Bulk-fill definition levels for majority-null leaf columns#9967
Bulk-fill definition levels for majority-null leaf columns#9967RyanJamesStewart wants to merge 6 commits into
Conversation
HippoBaro
left a comment
There was a problem hiding this comment.
I love it! I think the code can be simplified a bit, and we can probably remove the unsafe by slicing the NullBuffer and using nulls.valid_indices()/nulls.iter() instead of constructing BitIndexIterator directly.
NullBuffer will do the BitIndexIterator business for you!
| max_def_level - (!valid as i16) | ||
| })); | ||
| } | ||
| info.non_null_indices.reserve(len - valid_in_range); |
There was a problem hiding this comment.
This is the null count right? We should reserve valid_in_range directly I think?
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (7341370) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
…umns When writing a nullable leaf (primitive) Arrow array, `write_leaf` built the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~num_rows of branchy work and allocates a num_rows level buffer even though almost every level is the same value. Add a length-gated bulk-fill path: when the column is majority-null and the sub-range is large enough to amortize the gate's per-call cost, build the definition levels by bulk-filling the null level (a vectorized memset) and overwriting only the non-null positions found via `NullBuffer::valid_indices()`. The per-row path is kept for non-majority-null arrays and for the small sub-ranges produced by list/struct write paths, so those shapes are not regressed. Contributes to apache#9731. Complements apache#9954's all-null fast path by covering the sparse (mostly-but-not-entirely-null) case it does not handle. Threshold sweep on Ryzen 9 9950X (parquet/arrow_writer benches, /default variant, vs main): T primitive list_primitive primitive_sparse list_primitive_sparse ---------------------------------------------------------------------- 0 -3.0% +2.6% -36.1% +7.8% 16 -1.4% +1.8% -34.8% +2.8% 32 -1.1% -0.1% -35.1% +1.7% 64 -1.1% +0.7% -34.5% +1.7% <- chosen 128 -1.0% +1.5% -35.1% +2.4% 256 -1.4% +1.4% -35.1% +2.7% T=0 reproduces the per-call slice/popcount regression on list_primitive_sparse_99pct_null (+7.8%, matches the criterion bot's original measurement on the unguarded version). The +1.7% floor at T>=32 is the structural cost of evaluating the gate itself across ~10K small write_leaf calls in the list path; reducing it further would require hoisting the decision into the caller. T=64 matches T=32 on every shape and gives ~12x margin over the avg list length of ~5. Final benchmarks vs main on Ryzen 9 9950X (T=64, /default variants): primitive/default -1.5% primitive_non_null/default -2.8% primitive_sparse_99pct_null/default -35.1% primitive_all_null/default -66.4% list_primitive/default +1.8% (within noise) list_primitive_non_null/default -0.7% (no change, p=0.45) list_primitive_sparse_99pct_null/default +3.0% (gate-check floor) struct_sparse_99pct_null/default -4.9% bool/default +2.2%
7341370 to
d825a1b
Compare
|
Thanks — applied both. Switched the fast path to Also caught a regression the benchmark bot surfaced that I had missed in my own measurement — I hadn't benchmarked the list paths. Added a length gate on entering the new path:
Breakeven for the list-sparse case is between T=0 and T=32. The +1.7% floor at T≥32 is the structural cost of evaluating the gate across ~10K calls, not the fast-path execution; reducing it further would require hoisting the decision into Re-triggering the bench. |
|
run benchmark arrow_writer |
|
Hi @RyanJamesStewart, thanks for the request (#9967 (comment)). Only whitelisted users can trigger benchmarks. Allowed users: Dandandan, Fokko, Jefffrey, Omega359, adriangb, alamb, asubiotto, brunal, buraksenn, cetra3, codephage2020, coderfender, comphead, erenavsarogullari, etseidl, friendlymatthew, gabotechs, geoffreyclaude, grtlr, haohuaijin, jonathanc-n, kevinjqliu, klion26, kosiew, kumarUjjawal, kunalsinghdadhwal, liamzwbao, mbutrovich, mkleen, mzabaluev, neilconway, rluvaton, sdf-jkl, timsaucer, xudong963, zhuqi-lucas. File an issue against this benchmark runner |
|
Pushed a fmt fixup (the On the bench: the run on On the bench re-trigger: would be good to have the post-gate numbers on record before further review, if anyone on the whitelist is willing to re-run |
| let bits = nulls.inner(); | ||
| info.def_levels.extend_from_iter(range.clone().map(|i| { | ||
| // Safety: range.end was asserted to be in bounds earlier | ||
| let valid = unsafe { bits.value_unchecked(i) }; | ||
| max_def_level - (!valid as i16) | ||
| })); | ||
| info.non_null_indices.reserve(len); | ||
| info.non_null_indices.extend( | ||
| BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len) | ||
| .map(|i| i + range.start), | ||
| ); |
There was a problem hiding this comment.
| let bits = nulls.inner(); | |
| info.def_levels.extend_from_iter(range.clone().map(|i| { | |
| // Safety: range.end was asserted to be in bounds earlier | |
| let valid = unsafe { bits.value_unchecked(i) }; | |
| max_def_level - (!valid as i16) | |
| })); | |
| info.non_null_indices.reserve(len); | |
| info.non_null_indices.extend( | |
| BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len) | |
| .map(|i| i + range.start), | |
| ); | |
| info.def_levels.extend_from_iter( | |
| nulls.iter().map(|valid| max_def_level - (!valid as i16)), | |
| ); |
No?
There was a problem hiding this comment.
Want to make sure I'm reading the suggestion right. nulls here is the full-array NullBuffer (line 953, unsliced; we just assert range.end <= nulls.len()), so nulls.iter() would produce nulls.len() levels rather than len. Did you have in mind slicing first the way the bulk-fill branch a few lines up does, i.e. let range_nulls = nulls.slice(range.start, len) and then iterating range_nulls? That would also let non_null_indices use range_nulls.valid_indices().map(|i| i + range.start), which I think still needs to happen for the downstream value path. Or were you proposing a broader restructure where non_null_indices isn't populated on this branch?
There was a problem hiding this comment.
Did you have in mind slicing first the way the bulk-fill branch a few lines up does, i.e.
let range_nulls = nulls.slice(range.start, len)and then iteratingrange_nulls? That would also letnon_null_indicesuserange_nulls.valid_indices().map(|i| i + range.start), which I think still needs to happen for the downstream value path
I think this.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (a717169) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
etseidl
left a comment
There was a problem hiding this comment.
Thanks @RyanJamesStewart, this looks like a nice improvement. Just a few nits.
| // Bulk-fill fast path. Gated on: | ||
| // - `len >= 64`: per-call slice/popcount/iterator overhead only | ||
| // amortizes on sizable sub-ranges. List/struct paths call | ||
| // write_leaf many times with tiny ranges (avg list length 1-5); | ||
| // paying any per-call popcount there would regress them. A | ||
| // threshold sweep at T={0,16,32,64,128,256} on Ryzen 9 9950X | ||
| // shows the regression floor settles by T=32 and the choice of | ||
| // 64 gives ~12x margin over avg list length without losing the | ||
| // flat-primitive wins. | ||
| // - `nulls.null_count() * 2 >= nulls.len()`: cached `null_count()` | ||
| // is O(1), so this check is free. We use the buffer-level density | ||
| // as a heuristic for the sub-range; for full-array writes (the | ||
| // primary target — flat primitive columns) it's exact. | ||
| // Note: even when this gate skips the fast path, evaluating the gate | ||
| // itself across high-frequency call sites (~10K calls in some list | ||
| // benchmarks) is a small structural cost (~+1-2% on list-sparse | ||
| // cases). It's the price of having any gate at all on this hot path; | ||
| // reducing it further would require hoisting the decision into the | ||
| // caller. The wins on the targeted shapes (-35% sparse-primitive, | ||
| // -66% all-null primitive) far outweigh it. |
There was a problem hiding this comment.
Please refer to https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions
I think this exposition is better in the PR than the code.
| // reducing it further would require hoisting the decision into the | ||
| // caller. The wins on the targeted shapes (-35% sparse-primitive, | ||
| // -66% all-null primitive) far outweigh it. | ||
| if len >= 64 && nulls.null_count() * 2 >= nulls.len() { |
There was a problem hiding this comment.
Please replace 64 with a constant. Documentation for the constant can point to this PR for an explanation as to how 64 was arrived at.
- Extract the `64` threshold into `BULK_FILL_MIN_LEN` const with a doc comment pointing to this PR for the sweep rationale. - Trim the in-code comment block; threshold-sweep rationale moves to the PR description per CONTRIBUTING.md guidance on AI-generated submissions. - Refactor the slow-path else branch to mirror the fast-path slice idiom: `let range_nulls = nulls.slice(range.start, len);` then `range_nulls.iter()` for `def_levels` and `range_nulls.valid_indices().map(|i| i + range.start)` for `non_null_indices`. Drops the `unsafe value_unchecked` and the manual `BitIndexIterator::new` offset arithmetic. - Drop the now-unused `BitIndexIterator` import. Behavior unchanged. `cargo test -p parquet --features arrow --lib arrow_writer` green (136 tests); clippy and fmt clean.
a717169 to
4eaed80
Compare
|
Pushed v2. Four changes from review:
Tests, clippy, and fmt clean locally; CI is in flight. The fast-path branch and gate are unchanged, so the targeted bench shape (the latest bot run showed |
etseidl
left a comment
There was a problem hiding this comment.
Thanks @RyanJamesStewart, looking better. A few suggestions to reduce some code duplication.
| info.non_null_indices | ||
| .extend(range_nulls.valid_indices().map(|i| i + range.start)); | ||
| } else { | ||
| let range_nulls = nulls.slice(range.start, len); |
| info.non_null_indices | ||
| .extend(range_nulls.valid_indices().map(|i| i + range.start)); |
There was a problem hiding this comment.
and remove this (seems a reserve went missing as well).
…ulls arm The `logical_nulls = None` arm of `write_leaf` extends `non_null_indices` by `len` items but didn't pre-reserve capacity. Add the matching `reserve(len)` before the `extend(range.clone())` so the no-nulls path allocates once instead of relying on `Vec`'s amortized growth. Behavior unchanged. Caught by @etseidl's review round on PR apache#9967.
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (3f69dff) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env:
BENCH_FILTER: list_prim |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (3f69dff) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
Hoist the shared work in `write_leaf`'s `Some(nulls)` arm out of the if/else where it can be done without changing the cost profile: - `let range_nulls = nulls.slice(range.start, len)`: O(1) metadata adjustment, safe to compute once and share between branches. - `info.non_null_indices.extend(range_nulls.valid_indices()...)`: a single iterator consumption after the branches, identical work in both paths. The `range_nulls.null_count()` popcount and `reserve(valid_in_range)` stay INSIDE the fast-path branch where bulk-fill needs the count anyway. An earlier revision (59ced5b) hoisted the popcount out and caused 11-24% regressions on `list_primitive*` benches because `write_leaf` is called ~10K times with avg range ~5 from list paths; paying the popcount per call dominated the saved hash probes. Bench (`list_primitive_sparse_99pct_null/default`, local hardware, this revision): 9.83 ms, in line with main's 10.7 ms on GKE c4a-highmem-16, and well below the 12.1 ms the prior dedup attempt showed. Behavior unchanged. 136 arrow_writer tests green; clippy + fmt clean.
…ulls arm The `logical_nulls = None` arm of `write_leaf` extends `non_null_indices` by `len` items but didn't pre-reserve capacity. Add the matching `reserve(len)` before the `extend(range.clone())` so the no-nulls path allocates once instead of relying on `Vec`'s amortized growth. Behavior unchanged. Caught by @etseidl's review round on PR apache#9967.
3f69dff to
df1f323
Compare
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
…erve The partial dedup in 695f178 (hoisting `let range_nulls = nulls.slice(...)` before the gate and `info.non_null_indices.extend(...)` after the if/else) regressed `list_primitive/default` 21% and `list_primitive_sparse_99pct_null/default` 13% on the GKE c4a-highmem-16 bench. Same operations as the v2 shape, but the compiler produced different code for the nullable-list path. Restore the v2 structure (slice declared inside each branch, both extends inline). Keep the `None`-arm `info.non_null_indices.reserve(len)` from df1f323 since that fixed `list_primitive_non_null/default` 1.12 -> 1.00 and addresses etseidl's reserve nit substantively. Bench shapes by revision on c4a-highmem-16: - v2 (4eaed80): list_primitive/default 1.00, non_null 1.00, sparse_99pct 1.00 - v3 (695f178, full dedup): 1.24, 1.12, 1.12 - v4 (df1f323, partial dedup + reserve): 1.21, 1.00, 1.13 - v5 (this commit): expected back to v2's clean shape with reserve fix retained 136 arrow_writer tests green; fmt + clippy clean.
|
Sorry @RyanJamesStewart I led you astray. I think the regression is due to the } else {
let nulls = nulls.inner();
info.def_levels.extend_from_iter(range.clone().map(|i| {
// Safety: range.end was asserted to be in bounds earlier
let valid = unsafe { nulls.value_unchecked(i) };
max_def_level - (!valid as i16)
}));
info.non_null_indices.reserve(len);
info.non_null_indices.extend(
BitIndexIterator::new(nulls.inner(), nulls.offset() + range.start, len)
.map(|i| i + range.start),
);
} |
|
run benchmark arrow_writer env:
BENCH_FILTER: list_primitive/ |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (cd166cc) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
Per etseidl's diagnosis on PR apache#9967: the `nulls.slice()` call in write_leaf's not-enough-nulls arm is the regression source on list_primitive paths, not the dedup structure. The list_primitive paths call write_leaf ~10K times with avg range ~5, where per-call NullBuffer slice + range_nulls.iter() / valid_indices overhead dominates. Restore the v1 (original) slow-arm pattern: - `let bits = nulls.inner()` once - `def_levels.extend_from_iter(range.clone().map(...))` with `bits.value_unchecked(i)` (safe: range.end asserted in bounds) - `non_null_indices` via `BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len)` Fast path keeps `nulls.slice()` / `valid_indices()` since bulk-fill needs them and `len >= BULK_FILL_MIN_LEN` (64) amortizes the cost. 136 arrow_writer tests green; fmt + clippy clean.
|
Pushed 552c773 with that slow-arm pattern: direct BitIndexIterator on bits.inner() with bits.offset() + range.start, value_unchecked inside the def_levels extend. No nulls.slice() per call on the cold path. Fast path keeps nulls.slice() + valid_indices() since the len >= 64 gate amortizes the slice cost there. Tests, clippy, fmt clean locally on the branch. Ready when you can retrigger the bench. |
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (552c773) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark arrow_writer env:
BENCH_FILTER: string/zstd_parquet_2|string_non_null/parquet_2 |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/arrow-parquet-null-heavy (552c773) to 48fa8a7 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
etseidl
left a comment
There was a problem hiding this comment.
Benches look good now. Thanks for working with me @RyanJamesStewart.
@HippoBaro any last thoughts?
Which issue does this PR close?
AI assistance
Implementation drafted with AI assistance and iterated against the benchmarks below. I've reviewed and own the code, including the gate threshold which I picked from the sweep in Threshold (
BULK_FILL_MIN_LEN). Per the project's CONTRIBUTING guidance on AI-generated submissions.Rationale for this change
When writing a nullable leaf (primitive) Arrow array,
write_leafbuilds the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~num_rowsof branchy work and allocates anum_rows-element level buffer even though almost every produced level is the same value. #9954 adds an O(1) fast path for the entirely null case; this PR covers the sparse (mostly-but-not-entirely null) case it doesn't handle, the literal subject of #9731 ("a column that is 99% null … ~100x more work than necessary").What changes are included in this PR?
A single popcount pass over the null mask (
Buffer::count_set_bits_offset, O(num_rows/64)) counts the valid values in the range. When the slice is majority-null, the definition-level buffer is bulk-filled with the null level (a vectorizedVec::resizememset) and only the non-null positions (fromNullBuffer::valid_indices()) are overwritten. The existing per-row path is kept for non-majority-null slices, so balanced and null-light columns are unaffected. Both branches share the samelet range_nulls = nulls.slice(range.start, len)slicing idiom; the slow path usesrange_nulls.iter()for the def-level map andrange_nulls.valid_indices().map(|i| i + range.start)fornon_null_indices, with nounsafe. Output is byte-identical: the level values are unchanged, just produced via memset+scatter (fast path) or via the high-levelNullBufferiterators (slow path) instead of a manualBitIndexIteratorwalk.Threshold (
BULK_FILL_MIN_LEN)The bulk-fill fast path is gated on two conditions:
len >= BULK_FILL_MIN_LEN(currently 64). Per-call slice/popcount/iterator overhead only amortizes on sizable sub-ranges. List/struct paths callwrite_leafmany times with tiny ranges (avg list length 1-5); paying any per-call popcount there would regress them. A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X shows the regression floor settles by T=32, and the choice of 64 gives ~12x margin over the average list length without losing the flat-primitive wins.nulls.null_count() * 2 >= nulls.len(). The cachednull_count()is O(1), so this check is free. We use the buffer-wide density as a heuristic for the sub-range; for full-array writes (the primary target, flat primitive columns) it's exact.Even when the gate skips the fast path, evaluating it across high-frequency call sites (~10K calls in some list benchmarks) is a small structural cost (~1-2% on list-sparse cases). The wins on the targeted shapes (-35% sparse-primitive, -66% all-null primitive) far outweigh that. Reducing the cost further would require hoisting the decision into the caller.
Are these changes tested?
Existing tests cover this path:
cargo test -p parquet --features arrow --lib arrow_writeris green (136 tests, full of nulls and roundtrips); fullcargo test -p parquet --features arrowgreen modulo the pre-existingPARQUET_TEST_DATAsubmodule failures (unrelated, same onmain).cargo clippy -p parquet --features arrow --libandcargo fmt --checkclean. Theunsafe get_unchecked_mutflagged in the original revision was replaced viaNullBuffer::valid_indices(); the slow-path also dropped itsunsafe value_uncheckedfor the same reason.Are there any user-facing changes?
None.
Benchmarks
cargo bench -p parquet --bench arrow_writer, 1M rows × 7 nullable primitive columns, local Ryzen 9 9950X:The CI benchmark bot (GKE
c4a-highmem-16, Neoverse-V2) on the post-fixup revision shows the same shape with stronger relative wins on the targeted cases:Microbench of the definition-level fill in isolation: 10.3x @ 100%-null, 8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈ 12-15% null, clean win above ~25%; the
>= 50% nullguard is conservative.This is the materialization-cost half of #9731 (~30% of the 99%-null write); the walk-cost half, a run-length input to the level encoder so the column writer doesn't even iterate all
num_rowslevels, is the larger structural change #9653 is heading toward. This PR is deliberately small and isolated so it lands independently of and rebases cleanly under that work.