Summary
Improve the Parquet reader path used by the FLUX workload by keeping a single, persistent file handle open per file for the lifetime of the reader. The main goal is to avoid losing the reader's row group cache between subsequent reads of the same file, which today is discarded whenever the file handle is closed and reopened.
Motivation
The current FLUX reader pattern appears to open the Parquet file (or recreate the ParquetFile / dataset reader) for each row group access. Each reopen forces:
- Re-reading and re-parsing the Parquet footer.
- Discarding any column chunk / row group state cached by the reader itself.
By keeping a single file handle (and the associated ParquetFile / dataset reader) open per worker, we can:
- Reuse decoded footer/metadata.
- Preserve the reader-level row group and column chunk cache across consecutive reads.
- Reduce CPU overhead and storage-side open calls.
- Improve effective read throughput, especially for high-bandwidth backends where the per-call overhead dominates.
Proposed change
- Ensure that in subsequent re-read operations of the same Parquet file, the file handle is not closed and the reader's row group cache is not evicted. Testing must be performed on the FLUX workload.
Validation
- Microbenchmark: read N row groups from a representative FLUX Parquet file with and without the persistent handle; compare wall-clock time, CPU time, and number of storage opens.
- End-to-end FLUX run on a fixed accelerator/host configuration; compare:
- Throughput (GB/s, samples/s) per accelerator
- CPU utilization per worker
- Storage-side open/read counts
- Reader latency distribution
- Confirm correctness: identical sample content and ordering vs. the previous reader.
Success criteria
- Measurable reduction in per-row-group read overhead.
- Improved FLUX throughput or reduced CPU cost on at least one representative storage backend.
- No regression in correctness or memory footprint.
Summary
Improve the Parquet reader path used by the FLUX workload by keeping a single, persistent file handle open per file for the lifetime of the reader. The main goal is to avoid losing the reader's row group cache between subsequent reads of the same file, which today is discarded whenever the file handle is closed and reopened.
Motivation
The current FLUX reader pattern appears to open the Parquet file (or recreate the
ParquetFile/ dataset reader) for each row group access. Each reopen forces:By keeping a single file handle (and the associated
ParquetFile/ dataset reader) open per worker, we can:Proposed change
Validation
Success criteria