I tried the new training models: flux, dlrm and retinanet and they all stuck at the Starting epoch 1 step forever. Meanwhile, I checked the NVMe R/W throughput, there was almost no reading on the NVMe. The old models (unet3d, cosmoflow and resnet) used in mlpstorage v2 have no such issue.
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage# ./mlpstorage training run --num-client-hosts 1 --hosts 127.0.0.1 --model flux --accelerator-type b200 --num-accelerators 2 --client-host-memory-in-gb 64 --param dataset.num_files_train=400 --data-dir $DATA_DIR --results-dir $RESULTS_DIR --allow-run-as-root --loops 1 --file
Setting attr from num_accelerators to 2
⠋ Validating environment... 0:00:002026-04-29 15:53:22|INFO: Environment validation passed
2026-04-29 15:53:22|STATUS: Benchmark results directory: /root/raysmond/workspace/mlperf_v3/result/training/flux/run/20260429_155322
2026-04-29 15:53:22|INFO: Collector script staged at /workspace/storage/results/collector-staging/mlps_collector.py (persisted as run artifact)
2026-04-29 15:53:22|INFO: Running MPI collection across 1 host(s)
2026-04-29 15:53:23|INFO: MPI collection completed successfully (1 hosts reported)
2026-04-29 15:53:23|INFO: Created benchmark run: training_run_flux_20260429_155322
2026-04-29 15:53:23|STATUS: Verifying benchmark run for training_run_flux_20260429_155322
2026-04-29 15:53:23|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-29 15:53:23|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 400 (Parameter: Overrode Parameters)
2026-04-29 15:53:23|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 8675, Actual: 400)
2026-04-29 15:53:23|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='flux', run_datetime='20260429_155322')])
2026-04-29 15:53:23|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-29 15:53:23|INFO: Results directory: None
2026-04-29 15:53:23|WARNING: Results directory is not set, using default results directory
2026-04-29 15:53:23|INFO: Collector script staged at /workspace/storage/results/collector-staging/mlps_collector.py (persisted as run artifact)
2026-04-29 15:53:23|INFO: Running MPI collection across 1 host(s)
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-29 15:53:24|INFO: MPI collection completed successfully (1 hosts reported)
2026-04-29 15:53:24|STATUS: Running benchmark command:: mpirun -n 2 -host 127.0.0.1:2 --bind-to none --map-by socket --allow-run-as-root /root/raysmond/workspace/mlperf_v3/storage2/.venv/bin/dlio_benchmark workload=flux_b200 ++hydra.run.dir=/root/raysmond/workspace/mlperf_v3/result/training/flux/run/20260429_155322 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=400 ++workload.dataset.data_folder=/mnt/1030_6T/flux --config-dir=/root/raysmond/workspace/mlperf_v3/storage2/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT] storage_type = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT] storage_root = './'
[OUTPUT] storage_options= None
[OUTPUT] data_folder = '/mnt/1030_6T/flux'
[OUTPUT] framework = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT] num_files_train= 400
[OUTPUT] record_length = 65536
[OUTPUT] generate_data = False
[OUTPUT] do_train = True
[OUTPUT] do_checkpoint = False
[OUTPUT] epochs = 1
[OUTPUT] batch_size = 48
[OUTPUT] 2026-04-29T15:53:27.298496 Running DLIO [Training] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-04-29T15:53:27.316719 Max steps per epoch: 1200 = 288 * 400 / 48 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-29T15:53:27.384825 Starting epoch 1: 1200 steps expected
⠙ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:03:31
I captured the NVMe R/W metrics with node-exporter, the chart shows that for the most of time, the NVMe has no reading I/O at all; while occasional bursts of tens of MB/s may occur every few minutes. Based on our experience testing v2, a sustained read throughput of over 10GB/s would be the expected norm.

I'm running the benchmark on Ubuntu 24.04 and here is some additional system information:
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# mpirun --version
mpirun (Open MPI) 4.1.6
Report bugs to http://www.open-mpi.org/community/help/
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# uv pip list | grep dlio
dlio-benchmark 3.0.0
s3dlio 0.9.86
/mnt/1030_6T is the mount dir for a PCIe Gen5 NVMe drive which has 6 TB capacity.
I tried the new training models: flux, dlrm and retinanet and they all stuck at the
Starting epoch 1step forever. Meanwhile, I checked the NVMe R/W throughput, there was almost no reading on the NVMe. The old models (unet3d, cosmoflow and resnet) used in mlpstorage v2 have no such issue.I captured the NVMe R/W metrics with

node-exporter, the chart shows that for the most of time, the NVMe has no reading I/O at all; while occasional bursts of tens of MB/s may occur every few minutes. Based on our experience testing v2, a sustained read throughput of over 10GB/s would be the expected norm.I'm running the benchmark on Ubuntu 24.04 and here is some additional system information:
/mnt/1030_6Tis the mount dir for a PCIe Gen5 NVMe drive which has 6 TB capacity.