Skip to content

[Training v3.0 consolidation] Ensure the Parquet data generator delivers reasonable generation throughput #358

@wolfgang-desalvador

Description

@wolfgang-desalvador

Summary

Assess and, if needed, improve the Parquet data generator used by mlpstorage to ensure it can produce datasets within a reasonable time on representative storage backends. The ideal SLA is that generation of a representative Parquet dataset takes no more than 30 minutes to 1 hour, eventually by scaling the data generation across multiple hosts during testing.

Motivation

Parquet datasets used by workloads such as DLRM and FLUX can be very large. If the data generator is slow, dataset preparation becomes a significant bottleneck for users running the benchmark, increases time-to-first-result, and discourages experimentation with different dataset sizes or parameters.

We need a clear baseline of the current generator throughput and an understanding of whether it is limited by storage, by CPU, or by the generator implementation itself, and whether the no-higher-than 30 min – 1 h ideal SLA can be reached when scaling generation on multiple hosts.

Proposed methodology

  1. Define a representative Parquet dataset size and a fixed host configuration.
  2. Measure the current Parquet data generator throughput (GB/s and files/s) on representative storage backends.
  3. Profile the generator to identify whether the limit is CPU-side (encoding, Python overhead, single-threaded sections) or storage-side (write bandwidth, IOPS).
  4. Test scaling the data generation across multiple hosts and measure how generation time evolves with the number of hosts.
  5. Record for each run:
    • Generation throughput (GB/s, files/s)
    • CPU utilization (overall and per worker)
    • Storage-side write metrics (write bandwidth, IOPS, average request size)
    • Wall-clock time to produce the full dataset
    • Scaling efficiency vs. number of hosts
  6. Compare the measured wall-clock time against the ideal SLA of no more than 30 min – 1 h.

Success criteria

  • The Parquet data generator reaches a "good enough" generation time so that it does not become an obstacle for testing and benchmark tuning, ideally no higher than 30 min – 1 h when scaled across multiple hosts.

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions