Skip to content

[Training v3.0 consolidation] Keep a single Parquet file handle open in the FLUX reader to avoid row group cache eviction #356

@wolfgang-desalvador

Description

@wolfgang-desalvador

Summary

Improve the Parquet reader path used by the FLUX workload by keeping a single, persistent file handle open per file for the lifetime of the reader. The main goal is to avoid losing the reader's row group cache between subsequent reads of the same file, which today is discarded whenever the file handle is closed and reopened.

Motivation

The current FLUX reader pattern appears to open the Parquet file (or recreate the ParquetFile / dataset reader) for each row group access. Each reopen forces:

  • Re-reading and re-parsing the Parquet footer.
  • Discarding any column chunk / row group state cached by the reader itself.

By keeping a single file handle (and the associated ParquetFile / dataset reader) open per worker, we can:

  • Reuse decoded footer/metadata.
  • Preserve the reader-level row group and column chunk cache across consecutive reads.
  • Reduce CPU overhead and storage-side open calls.
  • Improve effective read throughput, especially for high-bandwidth backends where the per-call overhead dominates.

Proposed change

  • Ensure that in subsequent re-read operations of the same Parquet file, the file handle is not closed and the reader's row group cache is not evicted. Testing must be performed on the FLUX workload.

Validation

  1. Microbenchmark: read N row groups from a representative FLUX Parquet file with and without the persistent handle; compare wall-clock time, CPU time, and number of storage opens.
  2. End-to-end FLUX run on a fixed accelerator/host configuration; compare:
    • Throughput (GB/s, samples/s) per accelerator
    • CPU utilization per worker
    • Storage-side open/read counts
    • Reader latency distribution
  3. Confirm correctness: identical sample content and ordering vs. the previous reader.

Success criteria

  • Measurable reduction in per-row-group read overhead.
  • Improved FLUX throughput or reduced CPU cost on at least one representative storage backend.
  • No regression in correctness or memory footprint.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions