[Training v3.0 consolidation] Ensure the Parquet data generator delivers reasonable generation throughput

## Summary
Assess and, if needed, improve the Parquet data generator used by `mlpstorage` to ensure it can produce datasets within a reasonable time on representative storage backends. The ideal SLA is that generation of a representative Parquet dataset takes **no more than 30 minutes to 1 hour**, eventually by scaling the data generation across multiple hosts during testing.

## Motivation
Parquet datasets used by workloads such as DLRM and FLUX can be very large. If the data generator is slow, dataset preparation becomes a significant bottleneck for users running the benchmark, increases time-to-first-result, and discourages experimentation with different dataset sizes or parameters.

We need a clear baseline of the current generator throughput and an understanding of whether it is limited by storage, by CPU, or by the generator implementation itself, and whether the no-higher-than 30 min – 1 h ideal SLA can be reached when scaling generation on multiple hosts.

## Proposed methodology
1. Define a representative Parquet dataset size and a fixed host configuration.
2. Measure the current Parquet data generator throughput (GB/s and files/s) on representative storage backends.
3. Profile the generator to identify whether the limit is CPU-side (encoding, Python overhead, single-threaded sections) or storage-side (write bandwidth, IOPS).
4. Test scaling the data generation across multiple hosts and measure how generation time evolves with the number of hosts.
5. Record for each run:
   - Generation throughput (GB/s, files/s)
   - CPU utilization (overall and per worker)
   - Storage-side write metrics (write bandwidth, IOPS, average request size)
   - Wall-clock time to produce the full dataset
   - Scaling efficiency vs. number of hosts
6. Compare the measured wall-clock time against the ideal SLA of no more than 30 min – 1 h.

## Success criteria
- The Parquet data generator reaches a "good enough" generation time so that it does not become an obstacle for testing and benchmark tuning, ideally no higher than 30 min – 1 h when scaled across multiple hosts.

## Related
- #354 (Optimize Parquet reader/generation parameters for DLRM)
- #356 (Keep a single Parquet file handle open in the FLUX reader)
- #357 (Evaluate Parquet reader performance in FLUX)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training v3.0 consolidation] Ensure the Parquet data generator delivers reasonable generation throughput #358

Summary

Motivation

Proposed methodology

Success criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Training v3.0 consolidation] Ensure the Parquet data generator delivers reasonable generation throughput #358

Description

Summary

Motivation

Proposed methodology

Success criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions