Summary
This issue proposes evaluating Apache Arrow IPC as an alternative on-disk data format for the DLRM workload, complementing the Parquet optimization effort tracked in #354.
Motivation
DLRM currently relies on Parquet for sample storage. Parquet decoding and reader-side overhead may limit per-accelerator throughput on high-bandwidth storage backends. While #354 explores how far the Parquet path can be pushed via generation and reader tuning, it is worth assessing an alternative format that preserves the columnar model but reduces deserialization cost.
Arrow IPC is a natural candidate: it stores data in the same in-memory layout used by Arrow consumers, removing most of the decode work performed when reading Parquet, while remaining widely supported in the data ecosystem.
Proposed methodology
- Define a baseline DLRM dataset equivalent in content to the current Parquet dataset.
- Generate the dataset in Arrow IPC format (both file and stream variants) using representative chunk sizes.
- Implement or enable an Arrow IPC reader path in the DLRM pipeline.
- Run the DLRM workload on the same accelerator/host configuration used for the Parquet baseline, recording:
- Achieved throughput (GB/s, samples/s) per accelerator
- CPU utilization (overall and per worker)
- Storage-side metrics (read IOPS, average request size)
- Reader latency distribution
- On-disk dataset size vs. Parquet
- Compare against:
Success criteria
Related
Summary
This issue proposes evaluating Apache Arrow IPC as an alternative on-disk data format for the DLRM workload, complementing the Parquet optimization effort tracked in #354.
Motivation
DLRM currently relies on Parquet for sample storage. Parquet decoding and reader-side overhead may limit per-accelerator throughput on high-bandwidth storage backends. While #354 explores how far the Parquet path can be pushed via generation and reader tuning, it is worth assessing an alternative format that preserves the columnar model but reduces deserialization cost.
Arrow IPC is a natural candidate: it stores data in the same in-memory layout used by Arrow consumers, removing most of the decode work performed when reading Parquet, while remaining widely supported in the data ecosystem.
Proposed methodology
Success criteria
Related