feat(parquet): fuse level encoding with counting and histogram updates#9795
Conversation
etseidl
left a comment
There was a problem hiding this comment.
Looks nice, thanks @HippoBaro.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fuse_lvl_encoding_hist_counting (849685c) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks @HippoBaro -- I think this looks really close. Thank you @etseidl for the review
| self.rep_levels_encoder | ||
| .put_with_observer(levels, |level, count| { | ||
| new_rows += (count as u32) * (level == 0) as u32; | ||
| if let Some(ref mut h) = self.page_metrics.repetition_level_histogram { |
There was a problem hiding this comment.
You might be able to move this check out of the loop (so call put if self.page_metrics.repetition_level_histogram is none and and call with_with_observer if it s some
Maybe could improve the inner loop even more
There was a problem hiding this comment.
I did a quick test of that idea earlier and it didn't seem worth the added complexity. But it probably deserves a second look on better hardware 😄.
There was a problem hiding this comment.
Just did the experiment locally with this:
match self.page_metrics.definition_level_histogram.as_mut() {
Some(histogram) => encoder.put_with_observer(levels, |level, count| {
values_to_write += count * (level == max_def) as usize;
histogram.increment_by(level, count as i64);
}),
None => encoder.put_with_observer(levels, |level, count| {
values_to_write += count * (level == max_def) as usize;
}),
};Benchmarks show a 2-3% improvements on average for list_primitive.
There was a problem hiding this comment.
I've pushed the above change!
849685c to
8c4bd85
Compare
Add `put_with_observer()` to `LevelEncoder` so callers can piggyback row/null counting and histogram updates onto level encoding without extra passes over the level buffers. Update `write_mini_batch()` to encode definition and repetition levels while collecting the associated metrics in the same pass, and hoist the histogram-enabled branch out of the inner loop. Add `LevelHistogram::increment_by()` for counted updates, keep `update_from_levels()` as a deprecated compatibility wrapper, and remove the now-unnecessary PageMetrics histogram update helpers. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
8c4bd85 to
28f25c0
Compare
|
run benchmark arrow_writer |
|
Hi @HippoBaro, thanks for the request (#9795 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mkleen, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
I thought so 😅 Was worth a try! |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing fuse_lvl_encoding_hist_counting (28f25c0) to b93240a (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
I think it looks very nice persoanlly. The only last reamining question in my mind is if we should be super safe and restore put with a deprecated note.
I think if we want to include it in the next arrow (58.2.0, I am hoping to cut the release in the next day or two) we should include the deprecation
If we are ok with waiting until arrow 59 (in a month or so) we can probably not worry about put
|
Thanks @etseidl and @HippoBaro |
|
@etseidl do you have any concerns about this PR or would you like to review it again before we merge? |
Just a fix to the deprecation message then I think we're good to go 🚀 |
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
🚀 |
Which issue does this PR close?
Rationale for this change
See #9731
What changes are included in this PR?
Add
put_with_observer()toLevelEncoderthat calls anFnMut(i16, usize)observer for each value during encoding. This allows callers to piggyback counting and histogram updates into the encoding pass without extra iterations over the level buffer.Previously,
write_mini_batch()made 3 separate passes over each level array: one to count non-null values or row boundaries, one to update the level histogram, and one to RLE-encode. Now all three operations happen in a single pass via the observer closure.Replace
LevelHistogram::update_from_levels()with a newLevelHistogram::increment_by()that accepts a count, and remove the now-unnecessaryupdate_definition_level_histogram()andupdate_repetition_level_histogram()methods from PageMetrics.Are these changes tested?
All tests passing; existing tests give 100% coverage.
Are there any user-facing changes?
None