Skip to content

benchmarks for writing REE arrays to parquet#9936

Open
Rich-T-kid wants to merge 3 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/REE-to-Parquet-BenchMarks
Open

benchmarks for writing REE arrays to parquet#9936
Rich-T-kid wants to merge 3 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/REE-to-Parquet-BenchMarks

Conversation

@Rich-T-kid
Copy link
Copy Markdown
Contributor

@Rich-T-kid Rich-T-kid commented May 7, 2026

Which issue does this PR close?

Rationale for this change

there is no way to currently tell which approach to writing out REE columns to parquet is more performant. This PR aims to solve that.

What changes are included in this PR?

Added a create_string_ree_bench_batch() function that builds record batches of REE data — it plugs into the existing benchmark structure.

For controlling the shape of the generated REE arrays, I currently have two constants, MIN_RUN and MAX_RUN, that bound the run length. The intent is to let benchmarks cover long uniform runs as well as shorter / more sparse data, rather than only one shape.

An alternative would be a small params struct with defaults that callers can override — happy to switch to that if it's preferred, but that would require changing other callsites

Are these changes tested?

yes

Are there any user-facing changes?

no

@Rich-T-kid Rich-T-kid changed the title draft of benchmarks benchmarks for writing REE arrays to parquet May 14, 2026
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/REE-to-Parquet-BenchMarks branch from b59c826 to 3179d31 Compare May 14, 2026 16:48
@github-actions github-actions Bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels May 14, 2026
null_density: f32,
true_density: f32,
) -> Result<ArrayRef> {
const MIN_RUN: usize = 8;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth it to add these as parameters or is it fine like this?

…riter to allow for ree to be included without erroring out
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/REE-to-Parquet-BenchMarks branch from 3179d31 to 801e4a5 Compare May 14, 2026 16:52
let mut file = Empty::default();
let mut writer =
ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())).unwrap();
let Ok(mut writer) = ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone()))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach adds no overhead for the regular (non-ree) branches since unwrap should boil down to the same machine code as this expression. the issue is this doesn't allow for errors to be propagated for other reasons. this shouldnt be an issue for existing benchmarks since they all run fine with no issues. is there another way to go about this?

@Rich-T-kid Rich-T-kid marked this pull request as ready for review May 14, 2026 16:56
Comment thread parquet/benches/arrow_writer.rs Outdated
let batch = create_string_bench_batch_non_null(BATCH_SIZE, 0.25, 0.75).unwrap();
batches.push(("string_non_null", batch));

let batch = create_string_ree_bench_batch(BATCH_SIZE, 0.25, 0.75).unwrap();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe other types can be added? other primitive types such as [int,float] as well as writing out nested types like [list,struct,ree over ree]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add benchmarks for REE to parquet

1 participant