[WIP][POC] Pfor encoding by prtkgaur · Pull Request #3595 · apache/parquet-java

prtkgaur · 2026-06-03T19:43:38Z

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Implements the PFOR (Patched Frame of Reference) integer compression encoding for INT32 and INT64 columns in the pfor package: - PforConstants: header/vector sizes, max exceptions (65535) - PforEncoderDecoder: histogram-based cost model for optimal bit width - PforValuesWriter: IntPforValuesWriter + LongPforValuesWriter with vector-buffered encoding and interleaved page layout - PforValuesReader: abstract base with lazy per-vector decoding - PforValuesReaderForInt: INT32 decoder using BytePacker - PforValuesReaderForLong: INT64 decoder using BytePackerForLong

Wires PFOR encoding into the parquet-java read/write pipeline: - Encoding.java: add PFOR enum with INT32/INT64 reader dispatch - ParquetProperties.java: add pforEnabled column property with isPforEnabled() and builder methods withPforEncoding() - DefaultV2ValuesWriterFactory.java: PFOR takes priority over BYTE_STREAM_SPLIT and DELTA_BINARY_PACKED for INT32/INT64 - ParquetMetadataConverter.java: guard for PFOR until thrift spec is merged upstream

64 tests covering: - PforEncoderDecoderTest: bit width utilities and histogram-based cost model - PforBitPackingTest: round-trip correctness across bit widths 0-64, partial groups, page header format - PforValuesEndToEndTest: full writer→reader pipeline including reset/reuse, skip, edge cases, random data

Benchmarks encode/decode throughput for int32/int64 across 8 data distributions inspired by Snowflake's NumericComprBenchmark: constant, sequential, small range, high-base-small-range (timestamps), with outliers (exception path), random, TPC-DS date keys, TPC-DS quantity. Uses junit-benchmarks (matches existing delta encoding benchmarks). Prints compression ratios for all distributions during setup. Excluded from normal test runs by surefire's benchmark exclusion.

Writer: - Pre-allocate reusable buffers (deltasBuffer, excPosBuffer, excValBuffer, metadataBuf, packBuf, packPadBuf) in constructor instead of allocating new arrays on every encodeAndFlushVector call - Replace ByteBuffer.allocate().order(LITTLE_ENDIAN) with manual byte shifts into reusable metadataBuf for vector info and exception writes - Emit valid header for totalCount==0 (reader can distinguish empty page from missing encoding) instead of BytesInput.empty() Reader: - Add numElements > valuesCount validation (handles nullable columns where page row count > encoded values) - Move getShortLE/getIntLE/getLongLE from private static in concrete readers to protected static in PforValuesReader base class

Tests cover: - Bad packing mode, log vector size out of range, bad value byte width - Negative num_elements, numElements > valuesCount - Header-only page, truncated offset array, truncated vector data - Corrupted offset pointing past buffer end - Skip past end, negative skip, read past end - Skip across vector boundaries (correctness check)

Pre-allocate reusable decode buffers (deltasBuffer, excPositionsBuffer, unpackPadBuf, unpackTempBuf) in allocateDecodedBuffer instead of allocating new arrays on every decodeVector call. Mirrors the writer-side improvement from the previous commit.

getBytes() now emits a valid 7-byte header even when totalCount==0, so the reader can distinguish an empty PFOR page from a missing encoding. Update assertions from size==0 to size==PFOR_HEADER_SIZE.

sfc-gh-pgaur added 8 commits April 21, 2026 00:15

Update empty-input tests for header-always-emitted behavior

680b9a9

getBytes() now emits a valid 7-byte header even when totalCount==0, so the reader can distinguish an empty PFOR page from a missing encoding. Update assertions from size==0 to size==PFOR_HEADER_SIZE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][POC] Pfor encoding#3595

[WIP][POC] Pfor encoding#3595
prtkgaur wants to merge 8 commits into
apache:masterfrom
prtkgaur:pfor-encoding

prtkgaur commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

prtkgaur commented Jun 3, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants