feat(parquet): compact level representation with generic writer dispatch#9831
feat(parquet): compact level representation with generic writer dispatch#9831alamb merged 4 commits intoapache:mainfrom
Conversation
etseidl
left a comment
There was a problem hiding this comment.
Made it through the first pass. Everything looks good, but I want to do a second pass when my brain is more awake 😅
| /// API wraps caller-provided `&[i16]` slices directly as `Materialized`, while the Arrow | ||
| /// writer path converts owned `LevelData` via `.as_ref()` (which may also produce `Uniform`). | ||
| #[derive(Debug, Clone, Copy)] | ||
| pub(crate) enum LevelDataRef<'a> { |
There was a problem hiding this comment.
Just curious why this is defined here rather than where LevelData is defined.
There was a problem hiding this comment.
This is intentional: LevelData lives in the Arrow level builder, but GenericColumnWriter::write_batch_internal needs the borrowed representation for both Arrow-owned LevelData and the public non-Arrow write_batch API that receives &[i16]. Putting LevelDataRef next to LevelData would make column::writer depend upward on the Arrow writer module.
There was a problem hiding this comment.
Thanks. Just wanted to make sure it was intentional.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
alamb
left a comment
There was a problem hiding this comment.
I went over this PR carefully this morning and it is really nice work 👌 -- thank you so much @HippoBaro (and @etseidl !). As I mentioned I think this PR makes the code easier to read AND faster which seems like a win all around in my mind
I reviewed correctness and code coverage (via cargo-llvm) and it all looks good to me
I had a bunch of suggestions that might improve the comments / readability but I don't think any of them is required to merge this PR.
| let rep_levels = self.rep_levels.as_ref().map(|levels| { | ||
| levels[chunk.level_offset..chunk.level_offset + chunk.num_levels].to_vec() | ||
| }); | ||
| let def_levels = self.def_levels.slice(chunk.level_offset, chunk.num_levels); |
There was a problem hiding this comment.
The new encpsulation makes this much easier to read I think -- so not only does this PR make the code faster, I also think it makes it easier to follow ❤️
| self.page_metrics.num_page_nulls += (levels.len() - values_to_write) as u64; | ||
| values_to_write | ||
| } | ||
| LevelDataRef::Uniform { value, count } => { |
There was a problem hiding this comment.
I think this is the key part of the whole PR, right? A special case when the levels information is the same (all null, all non null)
There was a problem hiding this comment.
Yes! This whole refactor allows me to add this Uniform case, which makes all-null (or really any uniform) data much faster to encode.
This comment has been minimized.
This comment has been minimized.
|
The clickbench results seem to report a few slower queries. I will rerun to reproduce |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Thank you all for the reviews and feedback! I’ll push a follow-up commit to address them, hopefully by EOD. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Some of the benchmark runner results looked like this might slow things down. I started two more benchmark runs to see if there is a repeatable pattenr |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
🤔 some of the benchmarks show a consistent slowdown -- @HippoBaro do you have any insight as to why they would get slower? |
|
Hum, I reran the above benchmarks locally on my laptop, both before and after rebasing onto I’m not sure I trust this particular benchmark runner result. For example, the worst offender in your list, So I strongly suspect this is benchmark noise 🤷 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@alamb These numbers look a lot more reasonable. The single-digit regressions do look consistent, though. I can’t reproduce them locally either. On my Apple laptop, both It could be a CPU architecture thing. If those regressions are unacceptable, I can try to profile the code on an x86 VM later in the week. |
I'll try reproducing on my Xeon workstation. |
|
So the regression is def there. Around 5% on average for the list_primitive's. I ran the list_primitive/default through samply, and it looks like a good bit of the slowdown is in Would it be possible to have different versions of Honestly I'm far more tolerant of write regressions than read. And 5% just for lists seems a fair trade. |
|
Thank you @etseidl! That’s super useful context 🙇 Let me see if I can refactor this. |
|
Thank you @etseidl and @alamb for pushing back! The regression was fairly straightforward: the compact level representation added extra branching/dispatch on a hot path. At first I thought this was the price to pay for speeding up the It turns out that I hadn't considered a good opportunity to batch writes there as well. We now batch consecutive non-empty list rows into a single child level write, then walk the appended repetition levels backwards to mark list-row boundaries. On my laptop these benchmarks show low single-digit improvements for the |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
might this still be a regression? |
A bit, but only when compared to previously optimized code. Compared to before @HippoBaro started, it's still quite nice. (comparing to #9654 f5d6dc3) FWIW, the sparse regression isn't as bad on my WS |
ae4f2b8 to
4d0f400
Compare
Adds a bulk encoding method for repeated level values. After a small warmup to enter RLE accumulation mode, remaining values are extended in O(1) via the existing `extend_run` path. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Introduces a `LevelData` enum (`Absent`, `Materialized`, `Uniform`) to replace `Option<Vec<i16>>` for definition and repetition levels, and a borrowed `LevelDataRef` counterpart for the writer path. Uniform columns (e.g. required fields, all-null pages) are now encoded in O(1) without materializing a dense list. The CDC chunker, column writer, and arrow writer are migrated to the new types. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Replace the long positional uniform roundtrip helper with a named `ColumnRoundTripUniform` fixture. This makes each test spell out the level inputs, schema levels, and expected read-back values explicitly. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
4d0f400 to
73df248
Compare
|
Thank you @etseidl, @alamb, and @Dandandan 💯 After trying a few alternatives to reduce the cost of the extra level representation variants, I think the remaining overhead in the
I think that's fair. This PR is part of a larger series that, taken together, improves runtime across most writer benchmarks. I rebased the PR against |
|
The MSRV failure is unrelated to this PR. For more details: |
|
run benchmarks arrow_writer |
1 similar comment
|
run benchmarks arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing compact_lvl_repr (73df248) to ded985c (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing compact_lvl_repr (73df248) to ded985c (merge-base) diff File an issue against this benchmark runner |
|
Merging up from main to get a clean CI run |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
I think the benchmarks are looking good now -- maybe there is some small slowdown but it also looks like it is within the noise threshold. |
|
Thank you for your diligence and patience @HippoBaro 🙏 (and @etseidl and @Dandandan |
Which issue does this PR close?
Rationale for this change
See #9731
What changes are included in this PR?
Represent definition and repetition levels as
LevelData/LevelDataRefwithAbsent,Materialized, andUniformvariants, and thread this through Arrow level generation, CDC chunking, and the generic column writer.Uniform level runs, such as required fields and all-null pages, can now be encoded without materializing dense
Vec<i16>buffers. Add bulk run support toLevelEncoder/RleEncoderso repeated levels are encoded in amortized O(1) after the RLE warmup, while preserving histogram, row count, null count, page splitting, and CDC chunk accounting.Are these changes tested?
All tests passing. Coverage exercises bulk RLE level encoding, compact/uniform
LevelDataslicing and writer roundtrips across Parquet v1/v2, and CDC/Arrow writer behavior including all-null and nested-level cases.Are there any user-facing changes?
None.