Skip to content

[need help] mlpstorage training run stuck at the first epoch forever with almost no reading on the NVMe SSD #362

@Raysmond

Description

@Raysmond

I tried the new training models: flux, dlrm and retinanet and they all stuck at the Starting epoch 1 step forever. Meanwhile, I checked the NVMe R/W throughput, there was almost no reading on the NVMe. The old models (unet3d, cosmoflow and resnet) used in mlpstorage v2 have no such issue.

(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage# ./mlpstorage training run     --num-client-hosts 1     --hosts 127.0.0.1     --model flux     --accelerator-type b200     --num-accelerators 2     --client-host-memory-in-gb 64     --param dataset.num_files_train=400     --data-dir $DATA_DIR     --results-dir $RESULTS_DIR     --allow-run-as-root --loops 1 --file
Setting attr from num_accelerators to 2
⠋ Validating environment... 0:00:002026-04-29 15:53:22|INFO: Environment validation passed
2026-04-29 15:53:22|STATUS: Benchmark results directory: /root/raysmond/workspace/mlperf_v3/result/training/flux/run/20260429_155322
2026-04-29 15:53:22|INFO: Collector script staged at /workspace/storage/results/collector-staging/mlps_collector.py (persisted as run artifact)
2026-04-29 15:53:22|INFO: Running MPI collection across 1 host(s)
2026-04-29 15:53:23|INFO: MPI collection completed successfully (1 hosts reported)
2026-04-29 15:53:23|INFO: Created benchmark run: training_run_flux_20260429_155322
2026-04-29 15:53:23|STATUS: Verifying benchmark run for training_run_flux_20260429_155322
2026-04-29 15:53:23|RESULT: Minimum file count dictated by dataset size to memory size ratio.
2026-04-29 15:53:23|STATUS: Closed: [CLOSED] Closed parameter override allowed: dataset.num_files_train = 400 (Parameter: Overrode Parameters)
2026-04-29 15:53:23|ERROR: INVALID: [INVALID] Insufficient number of training files (Parameter: dataset.num_files_train, Expected: >= 8675, Actual: 400)
2026-04-29 15:53:23|STATUS: Benchmark run is INVALID due to 1 issues ([RunID(program='training', command='run', model='flux', run_datetime='20260429_155322')])
2026-04-29 15:53:23|WARNING: Running the benchmark without verification for open or closed configurations. These results are not valid for submission. Use --open or --closed to specify a configuration.
⠋ Validating environment... ━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/4 0:00:002026-04-29 15:53:23|INFO: Results directory: None
2026-04-29 15:53:23|WARNING: Results directory is not set, using default results directory
2026-04-29 15:53:23|INFO: Collector script staged at /workspace/storage/results/collector-staging/mlps_collector.py (persisted as run artifact)
2026-04-29 15:53:23|INFO: Running MPI collection across 1 host(s)
⠋ Collecting cluster info... ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 2/4 0:00:002026-04-29 15:53:24|INFO: MPI collection completed successfully (1 hosts reported)
2026-04-29 15:53:24|STATUS: Running benchmark command:: mpirun -n 2 -host 127.0.0.1:2 --bind-to none --map-by socket --allow-run-as-root /root/raysmond/workspace/mlperf_v3/storage2/.venv/bin/dlio_benchmark workload=flux_b200 ++hydra.run.dir=/root/raysmond/workspace/mlperf_v3/result/training/flux/run/20260429_155322 ++hydra.output_subdir=dlio_config ++workload.dataset.num_files_train=400 ++workload.dataset.data_folder=/mnt/1030_6T/flux --config-dir=/root/raysmond/workspace/mlperf_v3/storage2/configs/dlio
[OUTPUT] [DEBUG DLIOBenchmark.__init__] After LoadConfig:
[OUTPUT]   storage_type   = <StorageType.LOCAL_FS: 'local_fs'>
[OUTPUT]   storage_root   = './'
[OUTPUT]   storage_options= None
[OUTPUT]   data_folder    = '/mnt/1030_6T/flux'
[OUTPUT]   framework      = <FrameworkType.PYTORCH: 'pytorch'>
[OUTPUT]   num_files_train= 400
[OUTPUT]   record_length  = 65536
[OUTPUT]   generate_data  = False
[OUTPUT]   do_train       = True
[OUTPUT]   do_checkpoint  = False
[OUTPUT]   epochs         = 1
[OUTPUT]   batch_size     = 48
[OUTPUT] 2026-04-29T15:53:27.298496 Running DLIO [Training] with 2 process(es)
[WARNING] The amount of dataset is smaller than the host memory; data might be cached after the first epoch. Increase the size of dataset to eliminate the caching effect!!!
[OUTPUT] 2026-04-29T15:53:27.316719 Max steps per epoch: 1200 = 288 * 400 / 48 / 2 (samples per file * num files / batch size / comm size)
[OUTPUT] 2026-04-29T15:53:27.384825 Starting epoch 1: 1200 steps expected
⠙ Running benchmark... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━ 3/4 0:03:31

I captured the NVMe R/W metrics with node-exporter, the chart shows that for the most of time, the NVMe has no reading I/O at all; while occasional bursts of tens of MB/s may occur every few minutes. Based on our experience testing v2, a sustained read throughput of over 10GB/s would be the expected norm.
Image

I'm running the benchmark on Ubuntu 24.04 and here is some additional system information:

(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# mpirun --version
mpirun (Open MPI) 4.1.6

Report bugs to http://www.open-mpi.org/community/help/
(.venv) root@cnit-zz-01:~/raysmond/workspace/mlperf_v3/storage2# uv pip list | grep dlio
dlio-benchmark          3.0.0
s3dlio                  0.9.86

/mnt/1030_6T is the mount dir for a PCIe Gen5 NVMe drive which has 6 TB capacity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions