Fix multi-host execution: --hosts parsing (#322) and MPI cluster collector staging (#303)#347
Conversation
…mlcommons#303): stage MPI collector script on remote hosts + regression tests for update_args, ssh validator and for MPI collector staging
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
russfellows
left a comment
There was a problem hiding this comment.
Here is a thorough review of PR #347.
PR #347 Review — Multi-host Execution Fixes (#322 + #303)
The PR is logically well-structured and the changes are technically sound overall. Both bugs are real, the root causes are correctly diagnosed, and the fixes use appropriate approaches. Below is a file-by-file analysis.
✅ mlpstorage (bash wrapper)
The change from $* → "$@" is correct and important — $* collapses all arguments into a single word-split string, while "$@" preserves each argument as a separate word. Adding set -euo pipefail and the -- separator, along with pinning --project to $SCRIPT_DIR, are all solid hardening improvements.
✅ mlpstorage_py/cli_parser.py — update_args
The host normalization using re.split(r'[,\s]+', item.strip()) is the correct approach; it handles all documented input forms (space-separated list, comma-separated, quoted strings with spaces, mixed). Removing the two leaked print() debug statements is a necessary cleanup.
One minor concern:
# Line: if hasattr(args, 'hosts') and not getattr(args, 'num_client_hosts', None):The not getattr(...) test is falsy for 0 as well as None. If a user somehow passes --num-client-hosts 0, this would silently re-derive it from len(hosts). A stricter is None check would be more correct:
# Suggested:
if hasattr(args, 'hosts') and getattr(args, 'num_client_hosts', None) is None:✅ mlpstorage_py/environment/validators.py
The whitespace-in-host-token guard is a correct defensive layer. One thing to note: in normal flow, update_args (called earlier) will have already split such tokens, so this guard is belt-and-suspenders. That's fine, but a comment noting "this should never fire in normal flow — see update_args" would help future maintainers understand the intent.
✅ mlpstorage_py/cluster_collector.py
The core fix for #303 is architecturally correct:
- UUID-based working dir — using
uuid.uuid4().hex[:12]to create a directory with the same absolute path on every node (viassh mkdir -p) is the right solution.shutilis already imported on line 11, soshutil.rmtreein thefinallyblock is fine. - Parallel SCP staging — using
ThreadPoolExecutorwith per-host error capture is clean and scalable. - X11 silencing —
env.pop("DISPLAY", None)+env.setdefault("PLM_RSH_AGENT", ...)is exactly right;.setdefaultrespects a user's existingPLM_RSH_AGENT. finallycleanup — remote and local cleanup is unconditional, which is correct.
Two concerns:
-
Unquoted paths in remote shell commands — In
_stage_script_on_remote_hostsand_cleanup_remote_staging, paths are embedded directly into shell strings:f"mkdir -p {remote_dir}" f"rm -rf {remote_dir}"
This is safe in practice because
remote_diris alwaystempfile.gettempdir() + "/mlps_collector_" + 12-hex-chars(no spaces or special chars possible). But as a best practice and to be defensive against non-standardTMPDIRsettings, these should be shell-quoted:f"mkdir -p '{remote_dir}'" f"rm -rf '{remote_dir}'"
-
Log level discrepancy with PR description — The staging progress is logged at
DEBUGlevel:self.logger.debug(f"Staging collector script to {len(remote_hosts_to_stage)} remote host(s)...") self.logger.debug(f"Collector script staged on {host}:{remote_dir}")
But the PR's smoke test output shows these as
INFO:. Users running at the default INFO level will not see staging progress, which is valuable operational feedback. Consider promoting the "Staging to N host(s)" and "Successfully staged to X" messages tologger.info().
✅ mlpstorage_py/common_args.py
Documentation-only changes; the help text additions accurately describe the accepted input formats and the --hosts=h1 h2 argparse pitfall. Correct.
✅ Tests
The test coverage is comprehensive. The regression tests in test_mpi_cluster_collector_issue_303.py cover the right paths: normal staging, localhost-only skip, shared_tmp_dir fast path, staging failure with host name in error, X11 env injection, and preservation of user-defined PLM_RSH_AGENT. The update_args test cases in test_cli.py cover all input forms from the issue. The existing test updated for the UUID tempdir layout is correctly adapted.
Summary
| Area | Status | Notes |
|---|---|---|
mlpstorage bash wrapper |
✅ Correct | "$@", --, set -euo pipefail all correct |
--hosts normalization (update_args) |
✅ Correct | Minor: use is None instead of truthiness for num_client_hosts |
SSH whitespace guard (validators.py) |
✅ Correct | Consider a comment explaining it's defensive |
| UUID-based working dir for MPI | ✅ Correct | Solves the cross-node /tmp isolation problem |
| SCP staging + parallel cleanup | ✅ Correct | Minor: quote remote_dir in inline shell commands |
| X11 env silencing | ✅ Correct | setdefault correctly respects user config |
| Staging log level | Promote to logger.info() so users see progress at default log level |
|
num_client_hosts re-derive condition |
Use is None for explicit intent |
|
| Test coverage | ✅ Comprehensive | 18 new cases, all key paths covered |
The fixes are logically correct and complete. The two items flagged (log level and is None check) are minor but worth addressing before merge.
|
Thank you @russfellows and @FileSystemGuy! Addressing the comments. |
| finally: | ||
| # Best-effort cleanup on remote hosts and the launch host. | ||
| if use_staging and remote_hosts_to_stage: | ||
| self._cleanup_remote_staging( |
There was a problem hiding this comment.
I strongly suggest we avoid cleaning up the remote storage. I personally see the rm -rf command as something we should strongly avoid to be triggered in a programatic way.
I suggest two options:
- We avoid to remove these files, eventually using the results-directory as staging directory instead of the work-directory
- We eventually remove the single files in a specific way, without recursion. Then we delete the directory with rmdir
But I prefer 1
|
Please review: @russfellows @wolfgang-desalvador |
|
@FileSystemGuy @russfellows @wolfgang-desalvador
|
|
Please review: @wolfgang-desalvador |
Done, thank you @idevasena for removing rm! |
Fixes #322
Fixes #303
Details
These two bugs are separately reported. The parsing bug (#322) prevents users from getting past argparse with any quoted form of
--hosts. Once that's fixed, the very next blocker on a real multi-host run is #303 — the MPI cluster collector writes its helper script into a launch-host-onlytempfile.TemporaryDirectory()and then asksmpirunto execute it on every node, which fails with[Errno 2] No such file or directoryon clusters with node-local/tmp(i.e. most of them).Fixing #322 alone would leave the user staring at cryptic
mpirunerrors and "Authorization required" noise from the collector. Fixing #303 alone is unreachable until the parsing works. Keeping them together lets us to actually complete a multi-hostmlpstorage training datasizeordatagenrun end-to-end.Both fixes are separable and could be reverted independently if needed.
Summary of changes
Part A —
--hostsparsing (#322)The original
update_argsonly split on commas, so:--hosts h1 h2 h3['h1', 'h2', 'h3']✅['h1', 'h2', 'h3']✅--hosts h1,h2,h3['h1', 'h2', 'h3']✅['h1', 'h2', 'h3']✅--hosts 'h1 h2 h3'['h1 h2 h3']❌ (fails SSH)['h1', 'h2', 'h3']✅--hosts='h1 h2 h3'['h1 h2 h3']❌ (fails SSH)['h1', 'h2', 'h3']✅--hosts='h1,h2,h3'['h1', 'h2', 'h3']✅['h1', 'h2', 'h3']✅--hosts=h1 h2 h3(no quotes)Also fixes a silent latent bug:
num_client_hostswas never auto-derived fromlen(hosts)because argparse sets the attribute toNone(not absent), sohasattr(...)always returned True. Switched togetattr(..., None).Also removes two leaked
print(f'Hosts is: {args.hosts}')debug statements that were appearing in every multi-host run.Wrapper hardening: the
mlpstoragebash wrapper used unquoted$*, which silently (1) re-word-splits any argument containing whitespace and (2) glob-expands patterns in--paramsagainst$CWD. Switched to"$@", added the--separator souv runcannot intercept flags meant for mlpstorage, and pinned--projectto the wrapper's own directory. This closes a previously unreported data-integrity issue where--params 'dataset.data_folder=/data/*'would get expanded by the shell before reaching mlpstorage.SSH pre-validation:
validate_ssh_connectivitynow rejects host tokens containing whitespace with a message that points at the argparse pitfall directly, rather than letting users hit a crypticssh: Could not resolve hostname 'h1 h2'.Part B — MPI cluster collector staging (#303)
MPIClusterCollectorcreates its helper script (mlps_collector.py) undertempfile.TemporaryDirectory()on the launch host only, then invokesmpirunwith that absolute path. On clusters with node-local/tmp(HPC, Kubernetes, bare-metal without a shared scratch FS), the remote ranks cannot find the script andmpirunaborts with[Errno 2] No such file or directory. The collector's fallback logic treats this as "collection failed" and reports local-only data — so multi-node runs silently characterise one node and claim it represents the whole cluster.This change:
mpirun; cleans up viassh rm -rfin afinallyblock.shared_tmp_dirkwarg for clusters that do have a shared scratch FS (skips staging entirely).PLM_RSH_AGENT="ssh -o ForwardX11=no -o StrictHostKeyChecking=accept-new"and stripsDISPLAY/XAUTHORITYfrom the subprocess env, silencing the misleadingAuthorization required, but no authorization protocol specifiedstderr noise.mpirunfails, naming the specific failing host.ssh_usernamethrough to mirrorSSHClusterCollector..setdefault("PLM_RSH_AGENT", ...)rather than overwriting, so users with a customPLM_RSH_AGENT(e.g. a specific SSH key) are respected.Files changed
Verification
Unit tests
Part A (#322): 10 new cases in
TestUpdateArgsexercising every input form from the issue (including the three failing samples), slot preservation (h1:2 h2:4), mixed separators, and the empty-after-parse exit path. 2 new cases inTestValidateSshConnectivityconfirming whitespace tokens are rejected without invokingssh.Part B (#303): 6 new cases in
test_mpi_cluster_collector_issue_303.py: remote-host staging, localhost-only skipping staging,shared_tmp_dirfast path, staging-failure error message,PLM_RSH_AGENTinjection, preservation of user-providedPLM_RSH_AGENT. One existing test updated for the new UUID-named tempdir layout.Full suite: 68/68 passing locally.
Multi-host smoke test (issue reproducer)
On a two-host cluster (
localhost+10.10.40.211), running all four--hostsforms from the issue:All four produced identical output:
And the
metadata.jsonrecords the final parsed host list from Form C as["localhost", "10.10.40.211"]— proving the quoted-string form normalized to two distinct hosts.Before / after evidence for #303
Before (from the smoke test prior to applying Part B, on the same two hosts):
After (with Part B applied, same invocation):
Tests
--hostsforms produce identical output