Patch ProxyStore data eviction and Parsl network-layer validation#45
Open
NikJur wants to merge 3 commits into
Open
Patch ProxyStore data eviction and Parsl network-layer validation#45NikJur wants to merge 3 commits into
NikJur wants to merge 3 commits into
Conversation
…ach other on the local network
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent unintended ProxyStore “evict-on-read” data loss between simulation → train/inference steps by switching to explicit key propagation, and it also updates Parsl HTEX addressing to avoid strict IPv4/hostname validation failures on some HPC fabrics.
Changes:
- Reworked DDWE result handling to persist simulation outputs and training outputs via explicit ProxyStore keys.
- Updated the OpenMM NTL9 DDWE example train/inference tasks to resolve simulation/model objects via
store.get(key). - Set HTEX
addressdynamically viaaddress_by_hostname()for the Vista config.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| deepdrivewe/workflows/ddwe.py | Stores simulation/train results as ProxyStore keys to avoid destructive proxy resolution. |
| deepdrivewe/parsl.py | Uses hostname-derived address for HTEX to improve network-layer validation stability. |
| deepdrivewe/examples/openmm_ntl9_ddwe/train.py | Resolves simulation objects from ProxyStore keys before training. |
| deepdrivewe/examples/openmm_ntl9_ddwe/inference.py | Resolves simulation objects (and possibly training output) from ProxyStore before inference. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactored the Thinker result processors and execution kernels to implement a manual key-propagation schema.
The Issue: In non-streaming workflows, automated proxy resolution triggered "evict-on-read" behavior. This caused trajectory data or model weights to be deleted from the backend before downstream tasks (Training or Inference) could resolve them.
The Fix: Implemented explicit proxy extraction and re-registration of concrete objects as persistent keys. Updated train.py and inference.py to resolve simulation metadata via the store.get(key) interface.
Updated the HighThroughputExecutor configuration to utilize dynamic addressing.
The Issue: The default 'localhost' string frequently triggers IPv4 validation errors in strict network environments or on specific high-performance fabrics.
The Fix: Standardized executor initialization using the address_by_hostname() utility. This ensures the executor binds to a valid IPv4 string, satisfying validation requirements while maintaining reachability for distributed workers across the Slurm allocation.
Validation
The implementation was verified through an end-to-end 10-iteration NTL9 ensemble run on the Bede H200 cluster.