Skip to content

Fix multi-host execution: --hosts parsing (#322) and MPI cluster collector staging (#303)#347

Merged
FileSystemGuy merged 2 commits intomlcommons:mainfrom
idevasena:fix/issue-322-multi-host-parsing
Apr 24, 2026
Merged

Fix multi-host execution: --hosts parsing (#322) and MPI cluster collector staging (#303)#347
FileSystemGuy merged 2 commits intomlcommons:mainfrom
idevasena:fix/issue-322-multi-host-parsing

Conversation

@idevasena
Copy link
Copy Markdown
Contributor

Fixes #322
Fixes #303

Details

These two bugs are separately reported. The parsing bug (#322) prevents users from getting past argparse with any quoted form of --hosts. Once that's fixed, the very next blocker on a real multi-host run is #303 — the MPI cluster collector writes its helper script into a launch-host-only tempfile.TemporaryDirectory() and then asks mpirun to execute it on every node, which fails with [Errno 2] No such file or directory on clusters with node-local /tmp (i.e. most of them).

Fixing #322 alone would leave the user staring at cryptic mpirun errors and "Authorization required" noise from the collector. Fixing #303 alone is unreachable until the parsing works. Keeping them together lets us to actually complete a multi-host mlpstorage training datasize or datagen run end-to-end.

Both fixes are separable and could be reverted independently if needed.

Summary of changes

Part A — --hosts parsing (#322)

The original update_args only split on commas, so:

User input Before After
--hosts h1 h2 h3 ['h1', 'h2', 'h3'] ['h1', 'h2', 'h3']
--hosts h1,h2,h3 ['h1', 'h2', 'h3'] ['h1', 'h2', 'h3']
--hosts 'h1 h2 h3' ['h1 h2 h3'] ❌ (fails SSH) ['h1', 'h2', 'h3']
--hosts='h1 h2 h3' ['h1 h2 h3'] ❌ (fails SSH) ['h1', 'h2', 'h3']
--hosts='h1,h2,h3' ['h1', 'h2', 'h3'] ['h1', 'h2', 'h3']
--hosts=h1 h2 h3 (no quotes) argparse error argparse error (unfixable; documented)

Also fixes a silent latent bug: num_client_hosts was never auto-derived from len(hosts) because argparse sets the attribute to None (not absent), so hasattr(...) always returned True. Switched to getattr(..., None).

Also removes two leaked print(f'Hosts is: {args.hosts}') debug statements that were appearing in every multi-host run.

Wrapper hardening: the mlpstorage bash wrapper used unquoted $*, which silently (1) re-word-splits any argument containing whitespace and (2) glob-expands patterns in --params against $CWD. Switched to "$@", added the -- separator so uv run cannot intercept flags meant for mlpstorage, and pinned --project to the wrapper's own directory. This closes a previously unreported data-integrity issue where --params 'dataset.data_folder=/data/*' would get expanded by the shell before reaching mlpstorage.

SSH pre-validation: validate_ssh_connectivity now rejects host tokens containing whitespace with a message that points at the argparse pitfall directly, rather than letting users hit a cryptic ssh: Could not resolve hostname 'h1 h2'.

Part B — MPI cluster collector staging (#303)

MPIClusterCollector creates its helper script (mlps_collector.py) under tempfile.TemporaryDirectory() on the launch host only, then invokes mpirun with that absolute path. On clusters with node-local /tmp (HPC, Kubernetes, bare-metal without a shared scratch FS), the remote ranks cannot find the script and mpirun aborts with [Errno 2] No such file or directory. The collector's fallback logic treats this as "collection failed" and reports local-only data — so multi-node runs silently characterise one node and claim it represents the whole cluster.

This change:

  • Creates a UUID-named working directory whose absolute path is identical on every node.
  • By default, SCPs the helper script to each remote host in parallel before running mpirun; cleans up via ssh rm -rf in a finally block.
  • Adds an opt-in shared_tmp_dir kwarg for clusters that do have a shared scratch FS (skips staging entirely).
  • Injects PLM_RSH_AGENT="ssh -o ForwardX11=no -o StrictHostKeyChecking=accept-new" and strips DISPLAY / XAUTHORITY from the subprocess env, silencing the misleading Authorization required, but no authorization protocol specified stderr noise.
  • Improves error messages when staging or mpirun fails, naming the specific failing host.
  • Plumbs ssh_username through to mirror SSHClusterCollector.
  • Uses .setdefault("PLM_RSH_AGENT", ...) rather than overwriting, so users with a custom PLM_RSH_AGENT (e.g. a specific SSH key) are respected.

Files changed

mlpstorage                                                 |  14 ++-
mlpstorage_py/cli/common_args.py                           |   8 +-
mlpstorage_py/cli_parser.py                                |  32 ++-
mlpstorage_py/cluster_collector.py                         | 317 ++++++++++---
mlpstorage_py/environment/validators.py                    |  16 +
tests/unit/test_cli.py                                     | 110 +++++
tests/unit/test_environment.py                             |  25 ++
mlpstorage_py/tests/test_cluster_collector.py              |  15 +-
mlpstorage_py/tests/test_mpi_cluster_collector_issue_303.py| 248 +++++++++++

Verification

Unit tests

Part A (#322): 10 new cases in TestUpdateArgs exercising every input form from the issue (including the three failing samples), slot preservation (h1:2 h2:4), mixed separators, and the empty-after-parse exit path. 2 new cases in TestValidateSshConnectivity confirming whitespace tokens are rejected without invoking ssh.

Part B (#303): 6 new cases in test_mpi_cluster_collector_issue_303.py: remote-host staging, localhost-only skipping staging, shared_tmp_dir fast path, staging-failure error message, PLM_RSH_AGENT injection, preservation of user-provided PLM_RSH_AGENT. One existing test updated for the new UUID-named tempdir layout.

Full suite: 68/68 passing locally.

Multi-host smoke test (issue reproducer)

On a two-host cluster (localhost + 10.10.40.211), running all four --hosts forms from the issue:

# Form A — canonical (was already working)
mlpstorage training datasize --model unet3d \
    --hosts localhost 10.10.40.211 \
    --client-host-memory-in-gb 247 --max-accelerators 2 \
    --num-client-hosts 2 --accelerator-type a100 --file \
    --results-dir /tmp/smoke-A

# Form B — comma-separated (was already working)
mlpstorage training datasize --model unet3d --hosts localhost,10.10.40.211 ...

# Form C — quoted string (issue #322 Sample 3, PREVIOUSLY FAILED)
mlpstorage training datasize --model unet3d --hosts 'localhost 10.10.40.211' ...

# Form D — equals + quoted (issue #322 Sample 2, PREVIOUSLY FAILED)
mlpstorage training datasize --model unet3d --hosts='localhost 10.10.40.211' ...

All four produced identical output:

2026-04-22 22:09:39 | INFO: Environment validation passed
2026-04-22 22:09:49 | RESULT: Number of training files: 18090
2026-04-22 22:09:49 | RESULT: Total disk space required for training: 2469.87GiB
2026-04-22 22:09:49 | RESULT: Run the following command to generate data:
  mlpstorage training datagen --hosts=localhost,10.10.40.211 --model=unet3d ...

And the metadata.json records the final parsed host list from Form C as ["localhost", "10.10.40.211"] — proving the quoted-string form normalized to two distinct hosts.

Before / after evidence for #303

Before (from the smoke test prior to applying Part B, on the same two hosts):

WARNING: MPI collection failed: ... stderr: Authorization required, but no
  authorization protocol specified
python3: can't open file '/tmp/tmpjkay13x0/mlps_collector.py':
  [Errno 2] No such file or directory
mpirun detected that one or more processes exited with non-zero status
INFO: Falling back to local-only collection

After (with Part B applied, same invocation):

INFO: Staging mlps_collector.py to 1 remote host(s)...
INFO: Successfully staged to 10.10.40.211
INFO: Running MPI collection across 2 host(s)
INFO: MPI collection completed successfully (2 hosts reported)
INFO: Cleaning up staged files on remote hosts

Tests

…mlcommons#303): stage MPI collector script on remote hosts + regression tests for update_args, ssh validator and for MPI collector staging
@idevasena idevasena requested a review from a team April 23, 2026 04:13
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Copy Markdown
Contributor

@russfellows russfellows left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a thorough review of PR #347.


PR #347 Review — Multi-host Execution Fixes (#322 + #303)

The PR is logically well-structured and the changes are technically sound overall. Both bugs are real, the root causes are correctly diagnosed, and the fixes use appropriate approaches. Below is a file-by-file analysis.


mlpstorage (bash wrapper)

The change from $*"$@" is correct and important — $* collapses all arguments into a single word-split string, while "$@" preserves each argument as a separate word. Adding set -euo pipefail and the -- separator, along with pinning --project to $SCRIPT_DIR, are all solid hardening improvements.


mlpstorage_py/cli_parser.pyupdate_args

The host normalization using re.split(r'[,\s]+', item.strip()) is the correct approach; it handles all documented input forms (space-separated list, comma-separated, quoted strings with spaces, mixed). Removing the two leaked print() debug statements is a necessary cleanup.

One minor concern:

# Line: if hasattr(args, 'hosts') and not getattr(args, 'num_client_hosts', None):

The not getattr(...) test is falsy for 0 as well as None. If a user somehow passes --num-client-hosts 0, this would silently re-derive it from len(hosts). A stricter is None check would be more correct:

# Suggested:
if hasattr(args, 'hosts') and getattr(args, 'num_client_hosts', None) is None:

mlpstorage_py/environment/validators.py

The whitespace-in-host-token guard is a correct defensive layer. One thing to note: in normal flow, update_args (called earlier) will have already split such tokens, so this guard is belt-and-suspenders. That's fine, but a comment noting "this should never fire in normal flow — see update_args" would help future maintainers understand the intent.


mlpstorage_py/cluster_collector.py

The core fix for #303 is architecturally correct:

  • UUID-based working dir — using uuid.uuid4().hex[:12] to create a directory with the same absolute path on every node (via ssh mkdir -p) is the right solution. shutil is already imported on line 11, so shutil.rmtree in the finally block is fine.
  • Parallel SCP staging — using ThreadPoolExecutor with per-host error capture is clean and scalable.
  • X11 silencingenv.pop("DISPLAY", None) + env.setdefault("PLM_RSH_AGENT", ...) is exactly right; .setdefault respects a user's existing PLM_RSH_AGENT.
  • finally cleanup — remote and local cleanup is unconditional, which is correct.

Two concerns:

  1. Unquoted paths in remote shell commands — In _stage_script_on_remote_hosts and _cleanup_remote_staging, paths are embedded directly into shell strings:

    f"mkdir -p {remote_dir}"
    f"rm -rf {remote_dir}"

    This is safe in practice because remote_dir is always tempfile.gettempdir() + "/mlps_collector_" + 12-hex-chars (no spaces or special chars possible). But as a best practice and to be defensive against non-standard TMPDIR settings, these should be shell-quoted:

    f"mkdir -p '{remote_dir}'"
    f"rm -rf '{remote_dir}'"
  2. Log level discrepancy with PR description — The staging progress is logged at DEBUG level:

    self.logger.debug(f"Staging collector script to {len(remote_hosts_to_stage)} remote host(s)...")
    self.logger.debug(f"Collector script staged on {host}:{remote_dir}")

    But the PR's smoke test output shows these as INFO:. Users running at the default INFO level will not see staging progress, which is valuable operational feedback. Consider promoting the "Staging to N host(s)" and "Successfully staged to X" messages to logger.info().


mlpstorage_py/common_args.py

Documentation-only changes; the help text additions accurately describe the accepted input formats and the --hosts=h1 h2 argparse pitfall. Correct.


✅ Tests

The test coverage is comprehensive. The regression tests in test_mpi_cluster_collector_issue_303.py cover the right paths: normal staging, localhost-only skip, shared_tmp_dir fast path, staging failure with host name in error, X11 env injection, and preservation of user-defined PLM_RSH_AGENT. The update_args test cases in test_cli.py cover all input forms from the issue. The existing test updated for the UUID tempdir layout is correctly adapted.


Summary

Area Status Notes
mlpstorage bash wrapper ✅ Correct "$@", --, set -euo pipefail all correct
--hosts normalization (update_args) ✅ Correct Minor: use is None instead of truthiness for num_client_hosts
SSH whitespace guard (validators.py) ✅ Correct Consider a comment explaining it's defensive
UUID-based working dir for MPI ✅ Correct Solves the cross-node /tmp isolation problem
SCP staging + parallel cleanup ✅ Correct Minor: quote remote_dir in inline shell commands
X11 env silencing ✅ Correct setdefault correctly respects user config
Staging log level ⚠️ Concern Promote to logger.info() so users see progress at default log level
num_client_hosts re-derive condition ⚠️ Minor Use is None for explicit intent
Test coverage ✅ Comprehensive 18 new cases, all key paths covered

The fixes are logically correct and complete. The two items flagged (log level and is None check) are minor but worth addressing before merge.

FileSystemGuy
FileSystemGuy previously approved these changes Apr 23, 2026
@idevasena
Copy link
Copy Markdown
Contributor Author

Thank you @russfellows and @FileSystemGuy! Addressing the comments.

Comment thread mlpstorage_py/cluster_collector.py Outdated
finally:
# Best-effort cleanup on remote hosts and the launch host.
if use_staging and remote_hosts_to_stage:
self._cleanup_remote_staging(
Copy link
Copy Markdown

@wolfgang-desalvador wolfgang-desalvador Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest we avoid cleaning up the remote storage. I personally see the rm -rf command as something we should strongly avoid to be triggered in a programatic way.

I suggest two options:

  • We avoid to remove these files, eventually using the results-directory as staging directory instead of the work-directory
  • We eventually remove the single files in a specific way, without recursion. Then we delete the directory with rmdir

But I prefer 1

@FileSystemGuy
Copy link
Copy Markdown
Contributor

Please review: @russfellows @wolfgang-desalvador

@idevasena
Copy link
Copy Markdown
Contributor Author

@FileSystemGuy @russfellows @wolfgang-desalvador
Pushed 1b7ce1f addressing both reviews.

cluster_collector.py:

  • Stage collector script under <results_dir>/collector-staging/ instead of tempfile-based per-node working dir
  • Remove remote cleanup entirely: no rm -rf, no rmdir, no rmtree over SSH. Staged script persists as a run artifact for post-mortem.
  • Rename shared_tmp_dir kwarg to shared_staging_dir
  • Quote staging_dir in remaining mkdir -p shell commands

cluster_collector.py:

  • Promote staging progress logs to INFO so multi-host orchestration is visible at default log level
  • Log the absolute staged-script path for debuggability

cli_parser.py:

  • Use is None instead of truthy test when re-deriving num_client_hosts, so explicit --num-client-hosts 0 is preserved

validators.py:

  • Document whitespace-in-host guard as belt-and-suspenders defense for direct API callers that bypass update_args

tests:

  • Rework test_mpi_cluster_collector_issue_303.py for results-dir layout
  • Drop cleanup-path tests (no longer applicable)
  • Add persistence, rerun-idempotency, absolute-path, and missing-results-dir assertions
  • Add test_num_client_hosts_zero_is_preserved regression for cli_parser is-None fix

@FileSystemGuy
Copy link
Copy Markdown
Contributor

Please review: @wolfgang-desalvador

@wolfgang-desalvador
Copy link
Copy Markdown

Please review: @wolfgang-desalvador

Done, thank you @idevasena for removing rm!

@FileSystemGuy FileSystemGuy merged commit ee70bf3 into mlcommons:main Apr 24, 2026
2 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 24, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

4 participants