Summary
Assess and, if needed, improve the Parquet data generator used by mlpstorage to ensure it can produce datasets within a reasonable time on representative storage backends. The ideal SLA is that generation of a representative Parquet dataset takes no more than 30 minutes to 1 hour, eventually by scaling the data generation across multiple hosts during testing.
Motivation
Parquet datasets used by workloads such as DLRM and FLUX can be very large. If the data generator is slow, dataset preparation becomes a significant bottleneck for users running the benchmark, increases time-to-first-result, and discourages experimentation with different dataset sizes or parameters.
We need a clear baseline of the current generator throughput and an understanding of whether it is limited by storage, by CPU, or by the generator implementation itself, and whether the no-higher-than 30 min – 1 h ideal SLA can be reached when scaling generation on multiple hosts.
Proposed methodology
- Define a representative Parquet dataset size and a fixed host configuration.
- Measure the current Parquet data generator throughput (GB/s and files/s) on representative storage backends.
- Profile the generator to identify whether the limit is CPU-side (encoding, Python overhead, single-threaded sections) or storage-side (write bandwidth, IOPS).
- Test scaling the data generation across multiple hosts and measure how generation time evolves with the number of hosts.
- Record for each run:
- Generation throughput (GB/s, files/s)
- CPU utilization (overall and per worker)
- Storage-side write metrics (write bandwidth, IOPS, average request size)
- Wall-clock time to produce the full dataset
- Scaling efficiency vs. number of hosts
- Compare the measured wall-clock time against the ideal SLA of no more than 30 min – 1 h.
Success criteria
- The Parquet data generator reaches a "good enough" generation time so that it does not become an obstacle for testing and benchmark tuning, ideally no higher than 30 min – 1 h when scaled across multiple hosts.
Related
Summary
Assess and, if needed, improve the Parquet data generator used by
mlpstorageto ensure it can produce datasets within a reasonable time on representative storage backends. The ideal SLA is that generation of a representative Parquet dataset takes no more than 30 minutes to 1 hour, eventually by scaling the data generation across multiple hosts during testing.Motivation
Parquet datasets used by workloads such as DLRM and FLUX can be very large. If the data generator is slow, dataset preparation becomes a significant bottleneck for users running the benchmark, increases time-to-first-result, and discourages experimentation with different dataset sizes or parameters.
We need a clear baseline of the current generator throughput and an understanding of whether it is limited by storage, by CPU, or by the generator implementation itself, and whether the no-higher-than 30 min – 1 h ideal SLA can be reached when scaling generation on multiple hosts.
Proposed methodology
Success criteria
Related