Skip to content

test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests#9847

Merged
alamb merged 1 commit intoapache:mainfrom
HippoBaro:unpadded_child_mode_test
May 5, 2026
Merged

test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests#9847
alamb merged 1 commit intoapache:mainfrom
HippoBaro:unpadded_child_mode_test

Conversation

@HippoBaro
Copy link
Copy Markdown
Contributor

@HippoBaro HippoBaro commented Apr 28, 2026

Which issue does this PR close?

Rationale for this change

InMemoryArrayReader couples a column's in-memory representation (an Arrow array) with its storage representation (def/rep levels) and assumes a 1:1 mapping between array elements and levels. This holds when list readers consume fully-padded child arrays — one element per level, nulls included.

Upcoming work (see #9848) pushes null filtering from ListArrayReader down into the child reader at the storage level, breaking that 1:1 assumption: the child returns fewer array elements than levels, and the mapping between them depends on the filtering logic itself. Keeping the mock would mean reimplementing that logic: testing filtered output against a second, hand-rolled filter.

What changes are included in this PR?

Replace InMemoryArrayReader with real PrimitiveArrayReader instances backed by in-memory Parquet pages. Tests now accept raw non-null values and levels (matching what Parquet actually stores) and exercise the production RecordReader path.

Are these changes tested?

This is a net positive in test coverage. The existing tests now exercise real readers instead of in-memory and already-dense representations.

Are there any user-facing changes?

None.

`InMemoryArrayReader` coupled a column's in-memory representation (an
Arrow array) with its storage representation (def/rep levels) by
assuming a 1:1 mapping between array elements and levels. This held
when list readers consumed fully-padded child arrays — one element per
level, nulls included.

Upcoming work pushes null filtering from `ListArrayReader` down into the
child reader at the storage level, breaking that 1:1 assumption: the
child returns fewer array elements than levels, and the mapping between
them depends on the filtering logic itself. Keeping the mock would mean
reimplementing that logic: testing filtered output against a second,
hand-rolled filter.

Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances
backed by in-memory Parquet pages. Tests now accept raw non-null values
and levels (matching what Parquet actually stores) and exercise the
production `RecordReader` path.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@github-actions github-actions Bot added the parquet Changes to the parquet crate label Apr 28, 2026
@alamb alamb changed the title test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests May 1, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HippoBaro -- looks good to me. I had some API / readability suggestions but I don't think they are required. We can do them as a follow on PR (or never)

/// `values` must contain only the non-null values (entries where
/// `def_levels[i] == max_def_level`), in order, as Parquet encodes them.
pub fn make_int32_page_reader(
values: &[i32],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that all callsites actually used DataType::Int32 which makes sense to me

Some(vec![0, 1, 2, 3, 1]),
Some(vec![0, 1, 1, 1, 1]),
);
let array_reader_1 = make_int32_page_reader(&[4], &[0, 1, 2, 3, 1], &[0, 1, 1, 1, 1], 3, 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an API suggestion (maybe for another PR, not required):

I would personally find this much easier to understand if the fields were named so I didn't have to refer to make_int32_page_reader to figure out what&[0, 1, 2, 3, 1] and &[0, 1, 1, 1, 1] meant.

Perhaps we could do this with an builder in the test. Something like this perhaps:

let array_reader_1 = TestArrayReaderBuilder::new()
 .with_non_null_i32_values(&[4]) // <-- this is especially non obvous if you are used to normal Arrow arrays
 .with_def_levels(&[0, 1, 2, 3, 1])
 .with_rep_levels(&[0, 1, 1, 1, 1])
 .with_max_def_level(3)
 .with_max_rep_level(1)
 .build()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that would be lovely

Copy link
Copy Markdown
Contributor

@alamb alamb May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// `RecordReader` code path (including selective padding when a parent
/// `ListArrayReader` calls `enable_selective_padding`).
///
/// `values` must contain only the non-null values (entries where
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 21f739c into apache:main May 5, 2026
20 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 5, 2026

Thanks again @HippoBaro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants