Bug fixes and performance enhancements: object storage, checkpointing, Parquet loading#359
Open
russfellows wants to merge 9 commits intomlcommons:mainfrom
Open
Conversation
- Fix all from/import statements: mlpstorage.X -> mlpstorage_py.X (33 py files) - Fix all mock.patch() string paths: mlpstorage.X -> mlpstorage_py.X (~16 files) - Replace 4 library-specific YAML configs with 1 workload-only s3_workload_unet3d.yaml (runtime params such as bucket, endpoint, storage_library belong in .env, not YAML) - Add .env.example documenting all runtime parameters - Update 22 shell scripts: pip/venv setup -> uv sync pattern - Update tests/README.md: pip/venv -> uv, mlpstorage -> mlpstorage_py imports - Update tests/object-store/README.md: - Replace 'cd mlp-storage && source .venv/bin/activate' with 'uv run python ...' - Update Library Selection section: YAML key -> runtime --param approach - Remove s3torchconnector from library selection table (keep historical results) - Update prerequisites: source .venv + source .env -> uv sync Unit tests: 763 pass (previously 0 due to ModuleNotFoundError: mlpstorage)
- Extract --file/--object from add_universal_arguments into new add_storage_type_arguments() function; VectorDB/KVCache parsers no longer require it; training/checkpointing parsers call it - Update training/checkpointing tests to pass --file in parse_args - Wrap _collect_cluster_start/_collect_cluster_end with progress_context to show spinner during SSH/MPI collection - Pass validate_structure=False to ReportGenerator in test fixtures that use empty temporary directories - Change logger.error -> logger.warning for nonexistent results dir in get_runs_files; skip dirs with multiple metadata files - Add _uri_for_filename alias to ParquetReaderS3Iterable
- Make --file/--object optional (required=False) so ALL benchmark parsers can carry the flag; VectorDB and KV-cache parsers now include it so the argument is available everywhere - Fix progress.py: replace logger.status() (non-existent Logger method) with logger.info() in both progress_context and create_stage_progress non-interactive fallback paths - Update tests to assert logger.info() instead of logger.status() dlio_benchmark changes (local fork + installed venv): - Replace broken \r-in-logger progress() with a Rich-based implementation using SpinnerColumn + BarColumn; falls back to plain stdout writes if Rich is unavailable
…rams Reduce tests/object-store/ from 30+ files to 4 clean tests: - run_training.sh — datagen + training via mlpstorage CLI - run_checkpointing.sh — checkpoint write + read via dlio_benchmark - test_s3lib_get_bench.py — GET throughput benchmark (updated) - test_direct_write_comparison.py — native write/read benchmark (updated) All runtime parameters (bucket, endpoint, storage library, credentials) now come exclusively from environment variables or .env — no hardcoded site-specific values remain in any test script or config file. Changes: - Archive 26 per-library scripts and result docs to old-archive/ - Archive 3 per-library checkpoint YAMLs to old-archive/ - Add configs/dlio/workload/llama3_8b_checkpoint.yaml: clean model-only YAML with all storage runtime params supplied via Hydra CLI overrides - run_training.sh: BUCKET, STORAGE_LIBRARY, MODEL, NP all overridable - run_checkpointing.sh: BUCKET, STORAGE_LIBRARY, NP, CHECKPOINTS all overridable - test_s3lib_get_bench.py: use BUCKET env var (was hardcoded mlp-s3dlio); fail fast with clear error if bucket not set - test_direct_write_comparison.py: use BUCKET env var as shared default; add validation error if required buckets not set - Rewrite README.md: concise, accurate, uv-based instructions for all 4 tests Unit tests: 905 passed, 4 skipped (no regressions)
…ore tests - pyproject.toml: point dlio-benchmark at russfellows/dlio_benchmark@dev, which contains minio connection-pool fix and s3torchconnector bool fix - uv.lock: regenerated after pyproject.toml change (resolved b1696e1) - configs/dlio/workload: remove 17 library-specific YAML files (minio, s3dlio, s3torch variants) — all storage params are now supplied via --params CLI overrides from .env; generic YAMLs remain - configs/dlio/workload/*.yaml (4 files): remove spurious 'region' field - tests/object-store/README.md: complete rewrite with accurate instructions - tests/object-store/run_training.sh: add s3torchconnector support, spawn multiprocessing, disable checkpoint in training tests - tests/object-store/run_checkpointing.sh: set NP=4, add s3torchconnector - tests/object-store/run_datagen.sh: new helper script - tests/object-store/run_cleanup.sh: new helper script - tests/object-store/old-archive/: archive stale test utility files
…d parquet loading Object storage (dlio.py): - _apply_object_storage_params() now logs the .env file path it loads - Raises FileNotFoundError with actionable message if --object mode finds no .env Config (config.py): - DEFAULT_RESULTS_DIR reads MLPERF_RESULTS_DIR env var, falls back to tempdir Main (main.py): - Add import os (was missing after tempdir warning addition) - Warn at startup when results will be written to system temp dir Checkpointing (streaming_checkpoint.py): - IPC Queue/Event created from same multiprocessing context as child process - Fixes SemLock fork/spawn mismatch on non-fork start methods MPI (utils.py): - Add --mca btl ^vader to single-host MPI flags to prevent VADER segfaults Dependencies (pyproject.toml, uv.lock): - s3dlio >= 0.9.95 - python-dotenv >= 1.0.0 - dlio-benchmark pinned to russfellows/dlio_benchmark feat/parquet-dgen-streaming Security (.gitignore): - Block .env.* credential files; keep .env.example Unit tests (933 passing, 4 skipped): - tests/unit/test_config.py: 4 tests for DEFAULT_RESULTS_DIR env-var / tempdir behavior - tests/unit/test_main_warnings.py: 4 tests for tempdir warning in run_benchmark() - tests/unit/test_dlio_object_storage.py: 20 tests for _apply_object_storage_params() - tests/unit/test_parquet_reader.py: updated 7 tests for new dlio-benchmark API (cache stores int byte-count not Table; no LRU eviction; close() is no-op) Docs: - docs/OBJECT_STORAGE_GUIDE.md moved from .github/ to docs/ - README.md, docs/README.md, tests/README.md: cross-reference links updated Benchmark results and analysis (new in tests/): - tests/benchmarks/: bench_*.py scripts (concurrency, phases, put_bytes, rt_switch, write_sizes, zerocopy) - tests/object-store/: NPZ analysis, RetinaNet bench results, s3ultra results, scaling analysis, multi-endpoint test - tests/Checkpoint_test_results.md, DLRM_test_results.md, Flux_test_results.md - tests/RetinaNet_test_results.md, Parquet_dataloading.md, TEST-PLAN-2026-04-25.md - tests/DLIO-optimization-analysis-2026-04-25.md
…iles tests/unit/test_benchmarks_vectordb.py: - Fix all patch() paths and inline imports (mlpstorage.* → mlpstorage_py.*) - Add _validate_vdb_dependencies mock to all 14 tests that instantiate VectorDBBenchmark; that method runs in __init__ before verify_benchmark and raises DependencyError when optional packages (pymilvus, tabulate) are not installed in the base uv env tests/unit/test_cli.py: - Fix three import blocks (mlpstorage.cli, mlpstorage.cli_parser, mlpstorage.config → mlpstorage_py.*) - Fix bare Namespace → argparse.Namespace in test_num_client_hosts_zero_is_preserved All 15 previously-failing upstream tests now pass. Full suite: 949 passed, 4 skipped.
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
FileSystemGuy
approved these changes
Apr 27, 2026
Contributor
|
I thought I had converted all the "mlpstorage" to "mlpstorage_py" everywhere, so I'm surprised that you had to do those changes (again?). Did I just miss these cases earlier? |
Contributor
|
Reproducing the test locally on a clean uv sync from this PR head, I get 922 passed, 27 failed, 4 skipped — not the 949/4/0 the PR description claims. The 27 failures are in two files the PR introduces or rewrites: test_dlio_object_storage.py (17/20 of the new tests) and test_parquet_reader.py (10/7 of the rewritten tests). The remaining 922 tests pass cleanly. |
Contributor
Author
|
Yes, I am pretty certain this can not, and SHOULD not work until the dlio_benchmark PR has been merged.
These require changes in how the readers work.
Here is that PR: mlcommons/DLIO_local_changes#16
Regards,
—Russ
… On Apr 28, 2026, at 4:32 PM, Devasena I ***@***.***> wrote:
idevasena
left a comment
(mlcommons/storage#359)
<#359 (comment)>
Reproducing the test locally on a clean uv sync from this PR head, I get 922 passed, 27 failed, 4 skipped — not the 949/4/0 the PR description claims. The 27 failures are in two files the PR introduces or rewrites: test_dlio_object_storage.py (17/20 of the new tests) and test_parquet_reader.py (10/7 of the rewritten tests). The remaining 922 tests pass cleanly.
=========================== short test summary info ============================
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_path_when_env_file_found_in_cwd
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_raises_file_not_found_when_no_env_file_anywhere
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_error_message_includes_required_vars
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsEnvLoading::test_logs_when_dotenv_upward_search_succeeds
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsBucketValidation::test_raises_value_error_when_bucket_missing
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_type_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_storage_root_as_bucket
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_library_s3dlio
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_custom_library
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_injects_default_uri_scheme_s3
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_sets_force_path_style_when_endpoint_url_present
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_without_endpoint_url
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_direct_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_no_force_path_style_for_file_scheme
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_storage_root_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_user_supplied_force_path_style_not_overridden
FAILED tests/unit/test_dlio_object_storage.py::TestApplyObjectStorageParamsInjection::test_logs_injected_params_summary
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_get_sample_caches_row_group
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_cache_grows_as_rgs_are_accessed
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_all_rg_entries_persist_within_epoch
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_does_not_evict_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_close_preserves_all_files_rg_cache
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_absolute_passthrough
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_for_relative_filename
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_uri_strips_leading_slash
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_column_filtering_records_byte_count
FAILED tests/unit/test_parquet_reader.py::TestParquetReaderS3Iterable::test_no_column_filter_reads_all
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================
***@***.***:~/Storage_Repo_Tests/mlc-storage-pr359$ uv run pytest tests/unit -q --co | wc -l
1160
***@***.***:~/Storage_Repo_Tests/mlc-storage-pr359$ grep -E "passed|failed|skipped" /tmp/pr359_unit.log | tail -5
tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_datagen_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
tests/unit/test_benchmarks_base.py::TestTimeSeriesCollectionIntegration::test_timeseries_skipped_for_configview_command (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
tests/unit/test_benchmarks_base.py::TestBenchmarkProgress::test_cluster_collection_skipped_logs_debug (fixtures used: mock_logger, request, tmp_path, tmp_path_factory).
tests/unit/test_validation_helpers.py::TestValidateBenchmarkEnvironment::test_success_logs_passed_message.
================== 27 failed, 922 passed, 4 skipped in 9.75s ===================
—
Reply to this email directly, view it on GitHub <#359 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJYX6VJBQE7AI55AS4L4YEWRNAVCNFSM6AAAAACYIQPA56VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGMZZGU2TGNRYGE>.
You are receiving this because you authored the thread.
|
Contributor
|
Got it. Reviewing your DLIO PR mlcommons/DLIO_local_changes#16 and use local version of it for this PR testing. |
idevasena
reviewed
Apr 29, 2026
Contributor
Author
|
Devasena,
I am certain you are correct. So perhaps my branch tag is wrong, or the tests are now wrong? I am not certain.
Obviously, more testing would be good. Unfortunately, I don’t have any time left this week, as I spent my weekend working on these changes.
If someone else can examine the tests given the updated dlio_benchmark parquet readers, that would be great.
Regards,
—Russ
… On Apr 28, 2026, at 7:04 PM, Devasena I ***@***.***> wrote:
@idevasena commented on this pull request.
In pyproject.toml <#359 (comment)>:
> @@ -82,7 +85,7 @@ url = "https://download.pytorch.org/whl/cpu"
explicit = true
[tool.uv.sources]
-dlio-benchmark = { git = "https://github.com/mlcommons/DLIO_local_changes.git" }
+dlio-benchmark = { git = "https://github.com/russfellows/dlio_benchmark.git", branch = "feat/parquet-dgen-streaming" }
I thought, this instance of DLIO benchmark linking to your fork in pyproject.toml would suffice.
—
Reply to this email directly, view it on GitHub <#359 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJZYB75PBH43OYCXA6L4YFILBAVCNFSM6AAAAACYIQPA56VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHM2DCOJTGQ4TMOBUHE>.
You are receiving this because you authored the thread.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five independent bug fixes and enhancements across object storage credential loading, results directory configuration, IPC context handling in streaming checkpointing, MPI transport configuration, and Parquet data loader test correctness. Also adds 28 new unit tests, corrects 15 upstream test failures (wrong package name
mlpstorage→mlpstorage_py), a comprehensive object-storage usage guide, and dated benchmark results.Tests: 949 passing, 4 skipped, 0 failures.
Changes
1. Object Storage:
.envCredential Loading (mlpstorage_py/benchmarks/dlio.py)_apply_object_storage_params()now logs the full path of the.envfile it loads and raises aFileNotFoundErrorwith an actionable message (listing required env vars and pointing to.env.example) when--objectmode is active but no.envis found. Previously failed silently with cryptic downstream auth errors.2. Results Directory: Environment Variable Override (
mlpstorage_py/config.py,mlpstorage_py/main.py)DEFAULT_RESULTS_DIRnow readsMLPERF_RESULTS_DIRfrom the environment, falling back to the system temp directory. AWARNING-level message is emitted at startup when results are going to the temp dir andMLPERF_RESULTS_DIRis not set. Also adds the missingimport ostomain.py.3. Checkpointing: IPC Context Fix (
mlpstorage_py/checkpointing/streaming_checkpoint.py)QueueandEventIPC objects are now created from the same multiprocessing context (ctx) as the child process. Previously used the default context, causingRuntimeError: SemLock can only be shared between processes through inheritanceon non-forkplatforms (macOS, Windows,PYTHONSTARTMETHOD=spawn).4. MPI: Disable VADER BTL (
mlpstorage_py/utils.py)Appends
--mca btl ^vaderto single-host MPI flags to prevent segfaults from Open MPI's VADER shared-memory BTL on certain kernel/Open MPI version combinations. Standard documented workaround; no performance impact for storage benchmarking.5. Dependencies (
pyproject.toml,uv.lock)python-dotenv >= 1.0.0(required for.envloading indlio.py)s3dlio >= 0.9.95(byte-range GET +stat()for Parquet footer reads)dlio-benchmarktorussfellows/dlio_benchmark@feat/parquet-dgen-streamingwhich addsParquetReaderS3Iterablewith footer-caching fix (pending merge into upstream dlio_benchmark)6. Security:
.gitignorecredential exclusionAdded
.env.*/!.env.exampleto prevent accidental credential file commits.7. Test fixes: upstream
mlpstorage→mlpstorage_pyreferencesFixed 15 test failures in
tests/unit/test_benchmarks_vectordb.pyandtests/unit/test_cli.pyintroduced by upstream commits that imported frommlpstorage(non-existent package; correct name ismlpstorage_py). Also mocked_validate_vdb_dependenciesso VectorDB tests don't require optionalpymilvus/tabulatepackages in the base environment.8. Updated Parquet reader tests (
tests/unit/test_parquet_reader.py)7 tests updated to match the refactored
ParquetReaderS3IterableAPI: cache stores compressed byte count (int) not a pyarrowTable;close()is a no-op; no per-sample LRU eviction (cache clears atfinalize()).9. Documentation
docs/OBJECT_STORAGE_GUIDE.md— new 292-line guide for S3/MinIO/object storage setup,.envconfiguration, and troubleshootingdocs/MULTI_ENDPOINT_GUIDE.md— expanded withs3dliomulti-endpoint load-balancing examples