[need help] mlpstorage training run stuck at the first epoch forever with almost no reading on the NVMe SSD

I tried the new training models: flux, dlrm and retinanet and they all stuck at the `Starting epoch 1` step forever. Meanwhile, I checked the NVMe R/W throughput, there was almost no reading on the NVMe. The old models (unet3d, cosmoflow and resnet)  used in mlpstorage v2 have no such issue.



```bash
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage# ./mlpstorage training run     --num-client-hosts 1     --hosts 127.0.0.1     --model flux     --accelerator-type b200     --num-accelerators 2     --client-host-memory-in-gb 64     --param dataset.num_files_train=400     --data-dir $DATA_DIR     --results-dir $RESULTS_DIR     --allow-run-as-root --loops 1 --file
Setting attr from num_accelerators to 2
⠋ Validating environment... 0:00:002026-04-29 15:53:22|INFO: Environment validation passed
2026-04-29 15:53:22|STATUS: Benchmark results directory: /root/raysmond/workspace/mlperf_v3/result/training/flux/run/20260429_155322
2026-04-29 15:53:22|INFO: Collector script staged at /workspace/storage/results/collector-staging/mlps_collector.py (persisted as run artifact)
2026-04-29 15:53:22|INFO: Running MPI collection across 1 host(s)
2026-04-29 15:53:23|INFO: MPI collection completed successfully (1 hosts reported)
2026-04-29 15:53:23|INFO: Created benchmark run: training_run_flux_20260429_155322
2026-04-29 15:53:23|STATUS: Verifying benchmark run for training_run_flux_20260429_155322
2026-04-29 15:53:23|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-29 15:53:23|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 400 (Parameter: Overrode Parameters)
2026-04-29 15:53:23|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 8675, Actual: 400)
2026-04-29 15:53:23|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='flux', run_datetime='20260429_155322')])
2026-04-29 15:53:23|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-29 15:53:23|INFO: Results directory: None
2026-04-29 15:53:23|WARNING: Results directory is not set, using default results directory
2026-04-29 15:53:23|INFO: Collector script staged at /workspace/storage/results/collector-staging/mlps_collector.py (persisted as run artifact)
2026-04-29 15:53:23|INFO: Running MPI collection across 1 host(s)
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-29 15:53:24|INFO: MPI collection completed successfully (1 hosts reported)
2026-04-29 15:53:24|STATUS: Running benchmark command:: mpirun -n 2 -host 127.0.0.1:2 --bind-to none --map-by socket --allow-run-as-root /root/raysmond/workspace/mlperf_v3/storage2/.venv/bin/dlio_benchmark workload=flux_b200 ++hydra.run.dir=/root/raysmond/workspace/mlperf_v3/result/training/flux/run/20260429_155322 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=400 ++workload.dataset.data_folder=/mnt/1030_6T/flux --config-dir=/root/raysmond/workspace/mlperf_v3/storage2/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/mnt/1030_6T/flux'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 400
[OUTPUT]   record_length  = 65536
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 48
[OUTPUT] 2026-04-29T15:53:27.298496 Running DLIO [Training] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-04-29T15:53:27.316719 Max steps per epoch: 1200 = 288 * 400 / 48 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-29T15:53:27.384825 Starting epoch 1: 1200 steps expected
⠙ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:03:31
```

I captured the NVMe R/W metrics with `node-exporter`, the chart shows that for the most of time, the NVMe has no reading I/O at all; while occasional bursts of tens of MB/s may occur every few minutes. Based on our experience testing v2, a sustained read throughput of over 10GB/s would be the expected norm.  
<img width="1878" height="658" alt="Image" src="https://github.com/user-attachments/assets/e6a22fb9-5ba8-4fd9-8b61-92827817478c" />


I'm running the benchmark on Ubuntu 24.04 and here is some additional system information: 
```
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# mpirun --version
mpirun (Open MPI) 4.1.6

Report bugs to http://www.open-mpi.org/community/help/
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# uv pip list | grep dlio
dlio-benchmark          3.0.0
s3dlio                  0.9.86
```

`/mnt/1030_6T` is the mount dir for a PCIe Gen5 NVMe drive  which has 6 TB capacity.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[need help] mlpstorage training run stuck at the first epoch forever with almost no reading on the NVMe SSD #362

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[need help] mlpstorage training run stuck at the first epoch forever with almost no reading on the NVMe SSD #362

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions