Fix multi-host execution: --hosts parsing (#322) and MPI cluster collector staging (#303) by idevasena · Pull Request #347 · mlcommons/storage

idevasena · 2026-04-23T04:12:58Z

Fixes #322
Fixes #303

Details

These two bugs are separately reported. The parsing bug (#322) prevents users from getting past argparse with any quoted form of --hosts. Once that's fixed, the very next blocker on a real multi-host run is #303 — the MPI cluster collector writes its helper script into a launch-host-only tempfile.TemporaryDirectory() and then asks mpirun to execute it on every node, which fails with [Errno 2] No such file or directory on clusters with node-local /tmp (i.e. most of them).

Fixing #322 alone would leave the user staring at cryptic mpirun errors and "Authorization required" noise from the collector. Fixing #303 alone is unreachable until the parsing works. Keeping them together lets us to actually complete a multi-host mlpstorage training datasize or datagen run end-to-end.

Both fixes are separable and could be reverted independently if needed.

Summary of changes

Part A — `--hosts` parsing (#322)

The original update_args only split on commas, so:

User input	Before	After
`--hosts h1 h2 h3`	`['h1', 'h2', 'h3']` ✅	`['h1', 'h2', 'h3']` ✅
`--hosts h1,h2,h3`	`['h1', 'h2', 'h3']` ✅	`['h1', 'h2', 'h3']` ✅
`--hosts 'h1 h2 h3'`	`['h1 h2 h3']` ❌ (fails SSH)	`['h1', 'h2', 'h3']` ✅
`--hosts='h1 h2 h3'`	`['h1 h2 h3']` ❌ (fails SSH)	`['h1', 'h2', 'h3']` ✅
`--hosts='h1,h2,h3'`	`['h1', 'h2', 'h3']` ✅	`['h1', 'h2', 'h3']` ✅
`--hosts=h1 h2 h3` (no quotes)	argparse error	argparse error (unfixable; documented)

Also fixes a silent latent bug: num_client_hosts was never auto-derived from len(hosts) because argparse sets the attribute to None (not absent), so hasattr(...) always returned True. Switched to getattr(..., None).

Also removes two leaked print(f'Hosts is: {args.hosts}') debug statements that were appearing in every multi-host run.

Wrapper hardening: the mlpstorage bash wrapper used unquoted $*, which silently (1) re-word-splits any argument containing whitespace and (2) glob-expands patterns in --params against $CWD. Switched to "$@", added the -- separator so uv run cannot intercept flags meant for mlpstorage, and pinned --project to the wrapper's own directory. This closes a previously unreported data-integrity issue where --params 'dataset.data_folder=/data/*' would get expanded by the shell before reaching mlpstorage.

SSH pre-validation: validate_ssh_connectivity now rejects host tokens containing whitespace with a message that points at the argparse pitfall directly, rather than letting users hit a cryptic ssh: Could not resolve hostname 'h1 h2'.

Part B — MPI cluster collector staging (#303)

MPIClusterCollector creates its helper script (mlps_collector.py) under tempfile.TemporaryDirectory() on the launch host only, then invokes mpirun with that absolute path. On clusters with node-local /tmp (HPC, Kubernetes, bare-metal without a shared scratch FS), the remote ranks cannot find the script and mpirun aborts with [Errno 2] No such file or directory. The collector's fallback logic treats this as "collection failed" and reports local-only data — so multi-node runs silently characterise one node and claim it represents the whole cluster.

This change:

Creates a UUID-named working directory whose absolute path is identical on every node.
By default, SCPs the helper script to each remote host in parallel before running mpirun; cleans up via ssh rm -rf in a finally block.
Adds an opt-in shared_tmp_dir kwarg for clusters that do have a shared scratch FS (skips staging entirely).
Injects PLM_RSH_AGENT="ssh -o ForwardX11=no -o StrictHostKeyChecking=accept-new" and strips DISPLAY / XAUTHORITY from the subprocess env, silencing the misleading Authorization required, but no authorization protocol specified stderr noise.
Improves error messages when staging or mpirun fails, naming the specific failing host.
Plumbs ssh_username through to mirror SSHClusterCollector.
Uses .setdefault("PLM_RSH_AGENT", ...) rather than overwriting, so users with a custom PLM_RSH_AGENT (e.g. a specific SSH key) are respected.

Files changed

mlpstorage                                                 |  14 ++-
mlpstorage_py/cli/common_args.py                           |   8 +-
mlpstorage_py/cli_parser.py                                |  32 ++-
mlpstorage_py/cluster_collector.py                         | 317 ++++++++++---
mlpstorage_py/environment/validators.py                    |  16 +
tests/unit/test_cli.py                                     | 110 +++++
tests/unit/test_environment.py                             |  25 ++
mlpstorage_py/tests/test_cluster_collector.py              |  15 +-
mlpstorage_py/tests/test_mpi_cluster_collector_issue_303.py| 248 +++++++++++

Verification

Unit tests

Part A (#322): 10 new cases in TestUpdateArgs exercising every input form from the issue (including the three failing samples), slot preservation (h1:2 h2:4), mixed separators, and the empty-after-parse exit path. 2 new cases in TestValidateSshConnectivity confirming whitespace tokens are rejected without invoking ssh.

Part B (#303): 6 new cases in test_mpi_cluster_collector_issue_303.py: remote-host staging, localhost-only skipping staging, shared_tmp_dir fast path, staging-failure error message, PLM_RSH_AGENT injection, preservation of user-provided PLM_RSH_AGENT. One existing test updated for the new UUID-named tempdir layout.

Full suite: 68/68 passing locally.

Multi-host smoke test (issue reproducer)

On a two-host cluster (localhost + 10.10.40.211), running all four --hosts forms from the issue:

# Form A — canonical (was already working)
mlpstorage training datasize --model unet3d \
    --hosts localhost 10.10.40.211 \
    --client-host-memory-in-gb 247 --max-accelerators 2 \
    --num-client-hosts 2 --accelerator-type a100 --file \
    --results-dir /tmp/smoke-A

# Form B — comma-separated (was already working)
mlpstorage training datasize --model unet3d --hosts localhost,10.10.40.211 ...

# Form C — quoted string (issue #322 Sample 3, PREVIOUSLY FAILED)
mlpstorage training datasize --model unet3d --hosts 'localhost 10.10.40.211' ...

# Form D — equals + quoted (issue #322 Sample 2, PREVIOUSLY FAILED)
mlpstorage training datasize --model unet3d --hosts='localhost 10.10.40.211' ...

All four produced identical output:

2026-04-22 22:09:39 | INFO: Environment validation passed
2026-04-22 22:09:49 | RESULT: Number of training files: 18090
2026-04-22 22:09:49 | RESULT: Total disk space required for training: 2469.87GiB
2026-04-22 22:09:49 | RESULT: Run the following command to generate data:
  mlpstorage training datagen --hosts=localhost,10.10.40.211 --model=unet3d ...

And the metadata.json records the final parsed host list from Form C as ["localhost", "10.10.40.211"] — proving the quoted-string form normalized to two distinct hosts.

Before / after evidence for #303

Before (from the smoke test prior to applying Part B, on the same two hosts):

WARNING: MPI collection failed: ... stderr: Authorization required, but no
  authorization protocol specified
python3: can't open file '/tmp/tmpjkay13x0/mlps_collector.py':
  [Errno 2] No such file or directory
mpirun detected that one or more processes exited with non-zero status
INFO: Falling back to local-only collection

After (with Part B applied, same invocation):

INFO: Staging mlps_collector.py to 1 remote host(s)...
INFO: Successfully staged to 10.10.40.211
INFO: Running MPI collection across 2 host(s)
INFO: MPI collection completed successfully (2 hosts reported)
INFO: Cleaning up staged files on remote hosts

Tests

Tests added for both fixes (18 new cases total)
Existing tests updated where the UUID tempdir layout required it (1 test)
Full local suite passes: 68/68
End-to-end smoke test on two hosts, all four --hosts forms produce identical output
Before/after changes captured for both Issues with executing mlpstorage training datasize/datagen across multiple hosts #322 and MPIClusterCollector assumes temp script/temp path is shared across all hosts #303

…mlcommons#303): stage MPI collector script on remote hosts + regression tests for update_args, ssh validator and for MPI collector staging

github-actions · 2026-04-23T04:13:07Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

russfellows

Here is a thorough review of PR #347.

PR #347 Review — Multi-host Execution Fixes (#322 + #303)

The PR is logically well-structured and the changes are technically sound overall. Both bugs are real, the root causes are correctly diagnosed, and the fixes use appropriate approaches. Below is a file-by-file analysis.

✅ `mlpstorage` (bash wrapper)

The change from $* → "$@" is correct and important — $* collapses all arguments into a single word-split string, while "$@" preserves each argument as a separate word. Adding set -euo pipefail and the -- separator, along with pinning --project to $SCRIPT_DIR, are all solid hardening improvements.

✅ `mlpstorage_py/cli_parser.py` — `update_args`

The host normalization using re.split(r'[,\s]+', item.strip()) is the correct approach; it handles all documented input forms (space-separated list, comma-separated, quoted strings with spaces, mixed). Removing the two leaked print() debug statements is a necessary cleanup.

One minor concern:

# Line: if hasattr(args, 'hosts') and not getattr(args, 'num_client_hosts', None):

The not getattr(...) test is falsy for 0 as well as None. If a user somehow passes --num-client-hosts 0, this would silently re-derive it from len(hosts). A stricter is None check would be more correct:

# Suggested:
if hasattr(args, 'hosts') and getattr(args, 'num_client_hosts', None) is None:

✅ `mlpstorage_py/environment/validators.py`

The whitespace-in-host-token guard is a correct defensive layer. One thing to note: in normal flow, update_args (called earlier) will have already split such tokens, so this guard is belt-and-suspenders. That's fine, but a comment noting "this should never fire in normal flow — see update_args" would help future maintainers understand the intent.

✅ `mlpstorage_py/cluster_collector.py`

The core fix for #303 is architecturally correct:

UUID-based working dir — using uuid.uuid4().hex[:12] to create a directory with the same absolute path on every node (via ssh mkdir -p) is the right solution. shutil is already imported on line 11, so shutil.rmtree in the finally block is fine.
Parallel SCP staging — using ThreadPoolExecutor with per-host error capture is clean and scalable.
X11 silencing — env.pop("DISPLAY", None) + env.setdefault("PLM_RSH_AGENT", ...) is exactly right; .setdefault respects a user's existing PLM_RSH_AGENT.
finally cleanup — remote and local cleanup is unconditional, which is correct.

Two concerns:

Unquoted paths in remote shell commands — In _stage_script_on_remote_hosts and _cleanup_remote_staging, paths are embedded directly into shell strings:
```
f"mkdir -p {remote_dir}"
f"rm -rf {remote_dir}"
```
This is safe in practice because remote_dir is always tempfile.gettempdir() + "/mlps_collector_" + 12-hex-chars (no spaces or special chars possible). But as a best practice and to be defensive against non-standard TMPDIR settings, these should be shell-quoted:
```
f"mkdir -p '{remote_dir}'"
f"rm -rf '{remote_dir}'"
```
Log level discrepancy with PR description — The staging progress is logged at DEBUG level:
```
self.logger.debug(f"Staging collector script to {len(remote_hosts_to_stage)} remote host(s)...")
self.logger.debug(f"Collector script staged on {host}:{remote_dir}")
```
But the PR's smoke test output shows these as INFO:. Users running at the default INFO level will not see staging progress, which is valuable operational feedback. Consider promoting the "Staging to N host(s)" and "Successfully staged to X" messages to logger.info().

✅ `mlpstorage_py/common_args.py`

Documentation-only changes; the help text additions accurately describe the accepted input formats and the --hosts=h1 h2 argparse pitfall. Correct.

✅ Tests

The test coverage is comprehensive. The regression tests in test_mpi_cluster_collector_issue_303.py cover the right paths: normal staging, localhost-only skip, shared_tmp_dir fast path, staging failure with host name in error, X11 env injection, and preservation of user-defined PLM_RSH_AGENT. The update_args test cases in test_cli.py cover all input forms from the issue. The existing test updated for the UUID tempdir layout is correctly adapted.

Summary

Area	Status	Notes
`mlpstorage` bash wrapper	✅ Correct	`"$@"`, `--`, `set -euo pipefail` all correct
`--hosts` normalization (`update_args`)	✅ Correct	Minor: use `is None` instead of truthiness for `num_client_hosts`
SSH whitespace guard (`validators.py`)	✅ Correct	Consider a comment explaining it's defensive
UUID-based working dir for MPI	✅ Correct	Solves the cross-node `/tmp` isolation problem
SCP staging + parallel cleanup	✅ Correct	Minor: quote `remote_dir` in inline shell commands
X11 env silencing	✅ Correct	`setdefault` correctly respects user config
Staging log level	⚠️ Concern	Promote to `logger.info()` so users see progress at default log level
`num_client_hosts` re-derive condition	⚠️ Minor	Use `is None` for explicit intent
Test coverage	✅ Comprehensive	18 new cases, all key paths covered

The fixes are logically correct and complete. The two items flagged (log level and is None check) are minor but worth addressing before merge.

idevasena · 2026-04-23T15:27:53Z

Thank you @russfellows and @FileSystemGuy! Addressing the comments.

wolfgang-desalvador · 2026-04-23T17:27:23Z

+        finally:
+            # Best-effort cleanup on remote hosts and the launch host.
+            if use_staging and remote_hosts_to_stage:
+                self._cleanup_remote_staging(


I strongly suggest we avoid cleaning up the remote storage. I personally see the rm -rf command as something we should strongly avoid to be triggered in a programatic way.

I suggest two options:

We avoid to remove these files, eventually using the results-directory as staging directory instead of the work-directory

We eventually remove the single files in a specific way, without recursion. Then we delete the directory with rmdir

But I prefer 1

FileSystemGuy · 2026-04-24T14:06:05Z

Please review: @russfellows @wolfgang-desalvador

idevasena · 2026-04-24T15:59:50Z

@FileSystemGuy @russfellows @wolfgang-desalvador
Pushed 1b7ce1f addressing both reviews.

cluster_collector.py:

Stage collector script under <results_dir>/collector-staging/ instead of tempfile-based per-node working dir
Remove remote cleanup entirely: no rm -rf, no rmdir, no rmtree over SSH. Staged script persists as a run artifact for post-mortem.
Rename shared_tmp_dir kwarg to shared_staging_dir
Quote staging_dir in remaining mkdir -p shell commands

cluster_collector.py:

Promote staging progress logs to INFO so multi-host orchestration is visible at default log level
Log the absolute staged-script path for debuggability

cli_parser.py:

Use is None instead of truthy test when re-deriving num_client_hosts, so explicit --num-client-hosts 0 is preserved

validators.py:

Document whitespace-in-host guard as belt-and-suspenders defense for direct API callers that bypass update_args

tests:

Rework test_mpi_cluster_collector_issue_303.py for results-dir layout
Drop cleanup-path tests (no longer applicable)
Add persistence, rerun-idempotency, absolute-path, and missing-results-dir assertions
Add test_num_client_hosts_zero_is_preserved regression for cli_parser is-None fix

FileSystemGuy · 2026-04-24T16:12:12Z

Please review: @wolfgang-desalvador

wolfgang-desalvador · 2026-04-24T18:05:39Z

Please review: @wolfgang-desalvador

Done, thank you @idevasena for removing rm!

fix(mlcommons#322): robust --hosts parsing and wrapper hardening + fix(…

a1725bb

…mlcommons#303): stage MPI collector script on remote hosts + regression tests for update_args, ssh validator and for MPI collector staging

idevasena requested a review from a team April 23, 2026 04:13

idevasena requested a review from wolfgang-desalvador April 23, 2026 04:13

russfellows requested changes Apr 23, 2026

View reviewed changes

FileSystemGuy previously approved these changes Apr 23, 2026

View reviewed changes

wolfgang-desalvador requested changes Apr 23, 2026

View reviewed changes

Address review comments

1b7ce1f

idevasena dismissed FileSystemGuy’s stale review via 1b7ce1f April 24, 2026 15:54

FileSystemGuy approved these changes Apr 24, 2026

View reviewed changes

russfellows approved these changes Apr 24, 2026

View reviewed changes

wolfgang-desalvador approved these changes Apr 24, 2026

View reviewed changes

FileSystemGuy merged commit ee70bf3 into mlcommons:main Apr 24, 2026
2 checks passed

github-actions Bot locked and limited conversation to collaborators Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-host execution: --hosts parsing (#322) and MPI cluster collector staging (#303)#347

Fix multi-host execution: --hosts parsing (#322) and MPI cluster collector staging (#303)#347
FileSystemGuy merged 2 commits intomlcommons:mainfrom
idevasena:fix/issue-322-multi-host-parsing

idevasena commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

russfellows left a comment

Uh oh!

idevasena commented Apr 23, 2026

Uh oh!

wolfgang-desalvador Apr 23, 2026 •

edited

Loading

Uh oh!

FileSystemGuy commented Apr 24, 2026

Uh oh!

idevasena commented Apr 24, 2026

Uh oh!

FileSystemGuy commented Apr 24, 2026

Uh oh!

wolfgang-desalvador commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

idevasena commented Apr 23, 2026

Details

Summary of changes

Part A — --hosts parsing (#322)

Part B — MPI cluster collector staging (#303)

Files changed

Verification

Unit tests

Multi-host smoke test (issue reproducer)

Before / after evidence for #303

Tests

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russfellows left a comment

Choose a reason for hiding this comment

PR #347 Review — Multi-host Execution Fixes (#322 + #303)

✅ mlpstorage (bash wrapper)

✅ mlpstorage_py/cli_parser.py — update_args

✅ mlpstorage_py/environment/validators.py

✅ mlpstorage_py/cluster_collector.py

✅ mlpstorage_py/common_args.py

✅ Tests

Summary

Uh oh!

idevasena commented Apr 23, 2026

Uh oh!

wolfgang-desalvador Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FileSystemGuy commented Apr 24, 2026

Uh oh!

idevasena commented Apr 24, 2026

Uh oh!

FileSystemGuy commented Apr 24, 2026

Uh oh!

wolfgang-desalvador commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Part A — `--hosts` parsing (#322)

github-actions Bot commented Apr 23, 2026 •

edited

Loading

✅ `mlpstorage` (bash wrapper)

✅ `mlpstorage_py/cli_parser.py` — `update_args`

✅ `mlpstorage_py/environment/validators.py`

✅ `mlpstorage_py/cluster_collector.py`

✅ `mlpstorage_py/common_args.py`

wolfgang-desalvador Apr 23, 2026 •

edited

Loading