Skip to content

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#359

Open
russfellows wants to merge 9 commits intomlcommons:mainfrom
russfellows:branch-3-0-1/bug-fixes-perf-enhancements
Open

Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#359
russfellows wants to merge 9 commits intomlcommons:mainfrom
russfellows:branch-3-0-1/bug-fixes-perf-enhancements

Conversation

@russfellows
Copy link
Copy Markdown
Contributor

Summary

Five independent bug fixes and enhancements across object storage credential loading, results directory configuration, IPC context handling in streaming checkpointing, MPI transport configuration, and Parquet data loader test correctness. Also adds 28 new unit tests, corrects 15 upstream test failures (wrong package name mlpstoragemlpstorage_py), a comprehensive object-storage usage guide, and dated benchmark results.

Tests: 949 passing, 4 skipped, 0 failures.


Changes

1. Object Storage: .env Credential Loading (mlpstorage_py/benchmarks/dlio.py)

_apply_object_storage_params() now logs the full path of the .env file it loads and raises a FileNotFoundError with an actionable message (listing required env vars and pointing to .env.example) when --object mode is active but no .env is found. Previously failed silently with cryptic downstream auth errors.

2. Results Directory: Environment Variable Override (mlpstorage_py/config.py, mlpstorage_py/main.py)

DEFAULT_RESULTS_DIR now reads MLPERF_RESULTS_DIR from the environment, falling back to the system temp directory. A WARNING-level message is emitted at startup when results are going to the temp dir and MLPERF_RESULTS_DIR is not set. Also adds the missing import os to main.py.

3. Checkpointing: IPC Context Fix (mlpstorage_py/checkpointing/streaming_checkpoint.py)

Queue and Event IPC objects are now created from the same multiprocessing context (ctx) as the child process. Previously used the default context, causing RuntimeError: SemLock can only be shared between processes through inheritance on non-fork platforms (macOS, Windows, PYTHONSTARTMETHOD=spawn).

4. MPI: Disable VADER BTL (mlpstorage_py/utils.py)

Appends --mca btl ^vader to single-host MPI flags to prevent segfaults from Open MPI's VADER shared-memory BTL on certain kernel/Open MPI version combinations. Standard documented workaround; no performance impact for storage benchmarking.

5. Dependencies (pyproject.toml, uv.lock)

  • Added python-dotenv >= 1.0.0 (required for .env loading in dlio.py)
  • Updated s3dlio >= 0.9.95 (byte-range GET + stat() for Parquet footer reads)
  • Pinned dlio-benchmark to russfellows/dlio_benchmark@feat/parquet-dgen-streaming which adds ParquetReaderS3Iterable with footer-caching fix (pending merge into upstream dlio_benchmark)

6. Security: .gitignore credential exclusion

Added .env.* / !.env.example to prevent accidental credential file commits.

7. Test fixes: upstream mlpstoragemlpstorage_py references

Fixed 15 test failures in tests/unit/test_benchmarks_vectordb.py and tests/unit/test_cli.py introduced by upstream commits that imported from mlpstorage (non-existent package; correct name is mlpstorage_py). Also mocked _validate_vdb_dependencies so VectorDB tests don't require optional pymilvus/tabulate packages in the base environment.

8. Updated Parquet reader tests (tests/unit/test_parquet_reader.py)

7 tests updated to match the refactored ParquetReaderS3Iterable API: cache stores compressed byte count (int) not a pyarrow Table; close() is a no-op; no per-sample LRU eviction (cache clears at finalize()).

9. Documentation

  • docs/OBJECT_STORAGE_GUIDE.md — new 292-line guide for S3/MinIO/object storage setup, .env configuration, and troubleshooting
  • docs/MULTI_ENDPOINT_GUIDE.md — expanded with s3dlio multi-endpoint load-balancing examples
  • README cross-references updated

- Fix all from/import statements: mlpstorage.X -> mlpstorage_py.X (33 py files)
- Fix all mock.patch() string paths: mlpstorage.X -> mlpstorage_py.X (~16 files)
- Replace 4 library-specific YAML configs with 1 workload-only s3_workload_unet3d.yaml
  (runtime params such as bucket, endpoint, storage_library belong in .env, not YAML)
- Add .env.example documenting all runtime parameters
- Update 22 shell scripts: pip/venv setup -> uv sync pattern
- Update tests/README.md: pip/venv -> uv, mlpstorage -> mlpstorage_py imports
- Update tests/object-store/README.md:
  - Replace 'cd mlp-storage && source .venv/bin/activate' with 'uv run python ...'
  - Update Library Selection section: YAML key -> runtime --param approach
  - Remove s3torchconnector from library selection table (keep historical results)
  - Update prerequisites: source .venv + source .env -> uv sync

Unit tests: 763 pass (previously 0 due to ModuleNotFoundError: mlpstorage)
- Extract --file/--object from add_universal_arguments into new
  add_storage_type_arguments() function; VectorDB/KVCache parsers
  no longer require it; training/checkpointing parsers call it
- Update training/checkpointing tests to pass --file in parse_args
- Wrap _collect_cluster_start/_collect_cluster_end with
  progress_context to show spinner during SSH/MPI collection
- Pass validate_structure=False to ReportGenerator in test fixtures
  that use empty temporary directories
- Change logger.error -> logger.warning for nonexistent results dir
  in get_runs_files; skip dirs with multiple metadata files
- Add _uri_for_filename alias to ParquetReaderS3Iterable
- Make --file/--object optional (required=False) so ALL benchmark
  parsers can carry the flag; VectorDB and KV-cache parsers now
  include it so the argument is available everywhere
- Fix progress.py: replace logger.status() (non-existent Logger
  method) with logger.info() in both progress_context and
  create_stage_progress non-interactive fallback paths
- Update tests to assert logger.info() instead of logger.status()

dlio_benchmark changes (local fork + installed venv):
- Replace broken \r-in-logger progress() with a Rich-based
  implementation using SpinnerColumn + BarColumn; falls back
  to plain stdout writes if Rich is unavailable
…rams

Reduce tests/object-store/ from 30+ files to 4 clean tests:
  - run_training.sh      — datagen + training via mlpstorage CLI
  - run_checkpointing.sh — checkpoint write + read via dlio_benchmark
  - test_s3lib_get_bench.py      — GET throughput benchmark (updated)
  - test_direct_write_comparison.py — native write/read benchmark (updated)

All runtime parameters (bucket, endpoint, storage library, credentials)
now come exclusively from environment variables or .env — no hardcoded
site-specific values remain in any test script or config file.

Changes:
- Archive 26 per-library scripts and result docs to old-archive/
- Archive 3 per-library checkpoint YAMLs to old-archive/
- Add configs/dlio/workload/llama3_8b_checkpoint.yaml: clean model-only
  YAML with all storage runtime params supplied via Hydra CLI overrides
- run_training.sh: BUCKET, STORAGE_LIBRARY, MODEL, NP all overridable
- run_checkpointing.sh: BUCKET, STORAGE_LIBRARY, NP, CHECKPOINTS all overridable
- test_s3lib_get_bench.py: use BUCKET env var (was hardcoded mlp-s3dlio);
  fail fast with clear error if bucket not set
- test_direct_write_comparison.py: use BUCKET env var as shared default;
  add validation error if required buckets not set
- Rewrite README.md: concise, accurate, uv-based instructions for all 4 tests

Unit tests: 905 passed, 4 skipped (no regressions)
…ore tests

- pyproject.toml: point dlio-benchmark at russfellows/dlio_benchmark@dev,
  which contains minio connection-pool fix and s3torchconnector bool fix
- uv.lock: regenerated after pyproject.toml change (resolved b1696e1)
- configs/dlio/workload: remove 17 library-specific YAML files (minio,
  s3dlio, s3torch variants) — all storage params are now supplied via
  --params CLI overrides from .env; generic YAMLs remain
- configs/dlio/workload/*.yaml (4 files): remove spurious 'region' field
- tests/object-store/README.md: complete rewrite with accurate instructions
- tests/object-store/run_training.sh: add s3torchconnector support,
  spawn multiprocessing, disable checkpoint in training tests
- tests/object-store/run_checkpointing.sh: set NP=4, add s3torchconnector
- tests/object-store/run_datagen.sh: new helper script
- tests/object-store/run_cleanup.sh: new helper script
- tests/object-store/old-archive/: archive stale test utility files
…d parquet loading

Object storage (dlio.py):
- _apply_object_storage_params() now logs the .env file path it loads
- Raises FileNotFoundError with actionable message if --object mode finds no .env

Config (config.py):
- DEFAULT_RESULTS_DIR reads MLPERF_RESULTS_DIR env var, falls back to tempdir

Main (main.py):
- Add import os (was missing after tempdir warning addition)
- Warn at startup when results will be written to system temp dir

Checkpointing (streaming_checkpoint.py):
- IPC Queue/Event created from same multiprocessing context as child process
- Fixes SemLock fork/spawn mismatch on non-fork start methods

MPI (utils.py):
- Add --mca btl ^vader to single-host MPI flags to prevent VADER segfaults

Dependencies (pyproject.toml, uv.lock):
- s3dlio >= 0.9.95
- python-dotenv >= 1.0.0
- dlio-benchmark pinned to russfellows/dlio_benchmark feat/parquet-dgen-streaming

Security (.gitignore):
- Block .env.* credential files; keep .env.example

Unit tests (933 passing, 4 skipped):
- tests/unit/test_config.py: 4 tests for DEFAULT_RESULTS_DIR env-var / tempdir behavior
- tests/unit/test_main_warnings.py: 4 tests for tempdir warning in run_benchmark()
- tests/unit/test_dlio_object_storage.py: 20 tests for _apply_object_storage_params()
- tests/unit/test_parquet_reader.py: updated 7 tests for new dlio-benchmark API
  (cache stores int byte-count not Table; no LRU eviction; close() is no-op)

Docs:
- docs/OBJECT_STORAGE_GUIDE.md moved from .github/ to docs/
- README.md, docs/README.md, tests/README.md: cross-reference links updated

Benchmark results and analysis (new in tests/):
- tests/benchmarks/: bench_*.py scripts (concurrency, phases, put_bytes, rt_switch, write_sizes, zerocopy)
- tests/object-store/: NPZ analysis, RetinaNet bench results, s3ultra results, scaling analysis, multi-endpoint test
- tests/Checkpoint_test_results.md, DLRM_test_results.md, Flux_test_results.md
- tests/RetinaNet_test_results.md, Parquet_dataloading.md, TEST-PLAN-2026-04-25.md
- tests/DLIO-optimization-analysis-2026-04-25.md
…iles

tests/unit/test_benchmarks_vectordb.py:
- Fix all patch() paths and inline imports (mlpstorage.* → mlpstorage_py.*)
- Add _validate_vdb_dependencies mock to all 14 tests that instantiate
  VectorDBBenchmark; that method runs in __init__ before verify_benchmark
  and raises DependencyError when optional packages (pymilvus, tabulate)
  are not installed in the base uv env

tests/unit/test_cli.py:
- Fix three import blocks (mlpstorage.cli, mlpstorage.cli_parser,
  mlpstorage.config → mlpstorage_py.*)
- Fix bare Namespace → argparse.Namespace in test_num_client_hosts_zero_is_preserved

All 15 previously-failing upstream tests now pass.
Full suite: 949 passed, 4 skipped.
@russfellows russfellows requested a review from a team April 27, 2026 21:58
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@FileSystemGuy
Copy link
Copy Markdown
Contributor

I thought I had converted all the "mlpstorage" to "mlpstorage_py" everywhere, so I'm surprised that you had to do those changes (again?). Did I just miss these cases earlier?

@idevasena
Copy link
Copy Markdown
Contributor

Reproducing the test locally on a clean uv sync from this PR head, I get 922 passed, 27 failed, 4 skipped — not the 949/4/0 the PR description claims. The 27 failures are in two files the PR introduces or rewrites: test_dlio_object_storage.py (17/20 of the new tests) and test_parquet_reader.py (10/7 of the rewritten tests). The remaining 922 tests pass cleanly.

=========================== short test summary info ============================
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_path_when_env_file_found_in_cwd
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_raises_file_not_found_when_no_env_file_anywhere
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_error_message_includes_required_vars
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_when_dotenv_upward_search_succeeds
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsBucketValidation::test_raises_value_error_when_bucket_missing
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_type_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_root_as_bucket
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_library_s3dlio
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_custom_library
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_uri_scheme_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_sets_force_path_style_when_endpoint_url_present
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_without_endpoint_url
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_direct_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_file_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_storage_root_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_force_path_style_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_logs_injected_params_summary
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_get_sample_caches_row_group
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_cache_grows_as_rgs_are_accessed
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_all_rg_entries_persist_within_epoch
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_does_not_evict_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_preserves_all_files_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_absolute_passthrough
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_relative_filename
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_strips_leading_slash
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_column_filtering_records_byte_count
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_no_column_filter_reads_all
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================
smrc@dskbd029:~/Storage_Repo_Tests/mlc-storage-pr359$ uv run pytest tests/unit -q --co | wc -l  
1160
smrc@dskbd029:~/Storage_Repo_Tests/mlc-storage-pr359$ grep -E "passed|failed|skipped" /tmp/pr359_unit.log | tail -5
        tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_datagen_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
        tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_configview_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
        tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_skipped_logs_debug (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
        tests/unit/test_validation_helpers.py::TestValidateBenchmarkEnvironment::test_success_logs_passed_message.
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================

@russfellows
Copy link
Copy Markdown
Contributor Author

russfellows commented Apr 28, 2026 via email

@idevasena
Copy link
Copy Markdown
Contributor

Got it. Reviewing your DLIO PR mlcommons/DLIO_local_changes#16 and use local version of it for this PR testing.

Comment thread pyproject.toml
@russfellows
Copy link
Copy Markdown
Contributor Author

russfellows commented Apr 29, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants