Skip to content

Patch ProxyStore data eviction and Parsl network-layer validation#45

Open
NikJur wants to merge 3 commits into
ramanathanlab:mainfrom
NikJur:fix/parsl-proxystore-eviction
Open

Patch ProxyStore data eviction and Parsl network-layer validation#45
NikJur wants to merge 3 commits into
ramanathanlab:mainfrom
NikJur:fix/parsl-proxystore-eviction

Conversation

@NikJur
Copy link
Copy Markdown

@NikJur NikJur commented Mar 27, 2026

  1. Non-Destructive Data Retrieval Protocol
    Refactored the Thinker result processors and execution kernels to implement a manual key-propagation schema.

The Issue: In non-streaming workflows, automated proxy resolution triggered "evict-on-read" behavior. This caused trajectory data or model weights to be deleted from the backend before downstream tasks (Training or Inference) could resolve them.

The Fix: Implemented explicit proxy extraction and re-registration of concrete objects as persistent keys. Updated train.py and inference.py to resolve simulation metadata via the store.get(key) interface.

  1. Parsl Networking Stability (optional)
    Updated the HighThroughputExecutor configuration to utilize dynamic addressing.

The Issue: The default 'localhost' string frequently triggers IPv4 validation errors in strict network environments or on specific high-performance fabrics.

The Fix: Standardized executor initialization using the address_by_hostname() utility. This ensures the executor binds to a valid IPv4 string, satisfying validation requirements while maintaining reachability for distributed workers across the Slurm allocation.

Validation
The implementation was verified through an end-to-end 10-iteration NTL9 ensemble run on the Bede H200 cluster.

Copilot AI review requested due to automatic review settings March 27, 2026 11:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent unintended ProxyStore “evict-on-read” data loss between simulation → train/inference steps by switching to explicit key propagation, and it also updates Parsl HTEX addressing to avoid strict IPv4/hostname validation failures on some HPC fabrics.

Changes:

  • Reworked DDWE result handling to persist simulation outputs and training outputs via explicit ProxyStore keys.
  • Updated the OpenMM NTL9 DDWE example train/inference tasks to resolve simulation/model objects via store.get(key).
  • Set HTEX address dynamically via address_by_hostname() for the Vista config.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
deepdrivewe/workflows/ddwe.py Stores simulation/train results as ProxyStore keys to avoid destructive proxy resolution.
deepdrivewe/parsl.py Uses hostname-derived address for HTEX to improve network-layer validation stability.
deepdrivewe/examples/openmm_ntl9_ddwe/train.py Resolves simulation objects from ProxyStore keys before training.
deepdrivewe/examples/openmm_ntl9_ddwe/inference.py Resolves simulation objects (and possibly training output) from ProxyStore before inference.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread deepdrivewe/examples/openmm_ntl9_ddwe/inference.py
Comment thread deepdrivewe/workflows/ddwe.py
Comment thread deepdrivewe/parsl.py
Comment thread deepdrivewe/parsl.py
Comment thread deepdrivewe/examples/openmm_ntl9_ddwe/inference.py
Comment thread deepdrivewe/workflows/ddwe.py
Comment thread deepdrivewe/workflows/ddwe.py
Comment thread deepdrivewe/examples/openmm_ntl9_ddwe/train.py
Comment thread deepdrivewe/examples/openmm_ntl9_ddwe/train.py
Comment thread deepdrivewe/examples/openmm_ntl9_ddwe/train.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants