[Training v3.0 consolidation] Evaluate Arrow IPC as alternative format for DLRM

## Summary
This issue proposes evaluating Apache Arrow IPC as an alternative on-disk data format for the DLRM workload, complementing the Parquet optimization effort tracked in #354.

## Motivation
DLRM currently relies on Parquet for sample storage. Parquet decoding and reader-side overhead may limit per-accelerator throughput on high-bandwidth storage backends. While #354 explores how far the Parquet path can be pushed via generation and reader tuning, it is worth assessing an alternative format that preserves the columnar model but reduces deserialization cost.

Arrow IPC is a natural candidate: it stores data in the same in-memory layout used by Arrow consumers, removing most of the decode work performed when reading Parquet, while remaining widely supported in the data ecosystem.

## Proposed methodology
1. Define a baseline DLRM dataset equivalent in content to the current Parquet dataset.
2. Generate the dataset in Arrow IPC format (both file and stream variants) using representative chunk sizes.
3. Implement or enable an Arrow IPC reader path in the DLRM pipeline.
4. Run the DLRM workload on the same accelerator/host configuration used for the Parquet baseline, recording:
   - Achieved throughput (GB/s, samples/s) per accelerator
   - CPU utilization (overall and per worker)
   - Storage-side metrics (read IOPS, average request size)
   - Reader latency distribution
   - On-disk dataset size vs. Parquet
5. Compare against:
   - The DLRM target throughput (≥15 GB/s per accelerator; reference compute times 0.00038 s for GB200 and 0.00056 s for MI300X — see #353)
   - The best Parquet configuration identified in #354

## Success criteria
- A clear quantitative comparison between optimized Parquet (#354) and Arrow IPC for DLRM, in terms of throughput, CPU cost, and on-disk footprint.
- A recommendation on whether Arrow IPC should be offered as an alternative DLRM format.
- Reusable reader/generation code paths that allow switching format with minimal user-facing changes.

## Related
- #354 (Optimize Parquet reader/generation for DLRM)
- #333 (Earlier proposal to evaluate Arrow IPC for the reader)
- #353 (Confirm computation time step for DLRM)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training v3.0 consolidation] Evaluate Arrow IPC as alternative format for DLRM #355

Summary

Motivation

Proposed methodology

Success criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Training v3.0 consolidation] Evaluate Arrow IPC as alternative format for DLRM #355

Description

Summary

Motivation

Proposed methodology

Success criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions