benchmarks for writing REE arrays to parquet#9936
Open
Rich-T-kid wants to merge 3 commits into
Open
Conversation
b59c826 to
3179d31
Compare
Rich-T-kid
commented
May 14, 2026
| null_density: f32, | ||
| true_density: f32, | ||
| ) -> Result<ArrayRef> { | ||
| const MIN_RUN: usize = 8; |
Contributor
Author
There was a problem hiding this comment.
Would it be worth it to add these as parameters or is it fine like this?
…riter to allow for ree to be included without erroring out
3179d31 to
801e4a5
Compare
Rich-T-kid
commented
May 14, 2026
| let mut file = Empty::default(); | ||
| let mut writer = | ||
| ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())).unwrap(); | ||
| let Ok(mut writer) = ArrowWriter::try_new(&mut file, batch.schema(), Some(props.clone())) |
Contributor
Author
There was a problem hiding this comment.
This approach adds no overhead for the regular (non-ree) branches since unwrap should boil down to the same machine code as this expression. the issue is this doesn't allow for errors to be propagated for other reasons. this shouldnt be an issue for existing benchmarks since they all run fine with no issues. is there another way to go about this?
Rich-T-kid
commented
May 14, 2026
| let batch = create_string_bench_batch_non_null(BATCH_SIZE, 0.25, 0.75).unwrap(); | ||
| batches.push(("string_non_null", batch)); | ||
|
|
||
| let batch = create_string_ree_bench_batch(BATCH_SIZE, 0.25, 0.75).unwrap(); |
Contributor
Author
There was a problem hiding this comment.
maybe other types can be added? other primitive types such as [int,float] as well as writing out nested types like [list,struct,ree over ree]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
there is no way to currently tell which approach to writing out REE columns to parquet is more performant. This PR aims to solve that.
What changes are included in this PR?
Added a
create_string_ree_bench_batch()function that builds record batches of REE data — it plugs into the existing benchmark structure.For controlling the shape of the generated REE arrays, I currently have two constants,
MIN_RUNandMAX_RUN, that bound the run length. The intent is to let benchmarks cover long uniform runs as well as shorter / more sparse data, rather than only one shape.An alternative would be a small params struct with defaults that callers can override — happy to switch to that if it's preferred, but that would require changing other callsites
Are these changes tested?
yes
Are there any user-facing changes?
no