Skip to content

[WIP][POC] Pfor encoding#3595

Draft
prtkgaur wants to merge 8 commits into
apache:masterfrom
prtkgaur:pfor-encoding
Draft

[WIP][POC] Pfor encoding#3595
prtkgaur wants to merge 8 commits into
apache:masterfrom
prtkgaur:pfor-encoding

Conversation

@prtkgaur
Copy link
Copy Markdown

@prtkgaur prtkgaur commented Jun 3, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Implements the PFOR (Patched Frame of Reference) integer compression
encoding for INT32 and INT64 columns in the pfor package:
- PforConstants: header/vector sizes, max exceptions (65535)
- PforEncoderDecoder: histogram-based cost model for optimal bit width
- PforValuesWriter: IntPforValuesWriter + LongPforValuesWriter with
  vector-buffered encoding and interleaved page layout
- PforValuesReader: abstract base with lazy per-vector decoding
- PforValuesReaderForInt: INT32 decoder using BytePacker
- PforValuesReaderForLong: INT64 decoder using BytePackerForLong
Wires PFOR encoding into the parquet-java read/write pipeline:
- Encoding.java: add PFOR enum with INT32/INT64 reader dispatch
- ParquetProperties.java: add pforEnabled column property with
  isPforEnabled() and builder methods withPforEncoding()
- DefaultV2ValuesWriterFactory.java: PFOR takes priority over
  BYTE_STREAM_SPLIT and DELTA_BINARY_PACKED for INT32/INT64
- ParquetMetadataConverter.java: guard for PFOR until thrift spec
  is merged upstream
64 tests covering:
- PforEncoderDecoderTest: bit width utilities and histogram-based cost model
- PforBitPackingTest: round-trip correctness across bit widths 0-64, partial groups, page header format
- PforValuesEndToEndTest: full writer→reader pipeline including reset/reuse, skip, edge cases, random data
Benchmarks encode/decode throughput for int32/int64 across 8 data
distributions inspired by Snowflake's NumericComprBenchmark: constant,
sequential, small range, high-base-small-range (timestamps), with
outliers (exception path), random, TPC-DS date keys, TPC-DS quantity.

Uses junit-benchmarks (matches existing delta encoding benchmarks).
Prints compression ratios for all distributions during setup.
Excluded from normal test runs by surefire's benchmark exclusion.
Writer:
- Pre-allocate reusable buffers (deltasBuffer, excPosBuffer, excValBuffer,
  metadataBuf, packBuf, packPadBuf) in constructor instead of allocating
  new arrays on every encodeAndFlushVector call
- Replace ByteBuffer.allocate().order(LITTLE_ENDIAN) with manual byte
  shifts into reusable metadataBuf for vector info and exception writes
- Emit valid header for totalCount==0 (reader can distinguish empty page
  from missing encoding) instead of BytesInput.empty()

Reader:
- Add numElements > valuesCount validation (handles nullable columns where
  page row count > encoded values)
- Move getShortLE/getIntLE/getLongLE from private static in concrete
  readers to protected static in PforValuesReader base class
Tests cover:
- Bad packing mode, log vector size out of range, bad value byte width
- Negative num_elements, numElements > valuesCount
- Header-only page, truncated offset array, truncated vector data
- Corrupted offset pointing past buffer end
- Skip past end, negative skip, read past end
- Skip across vector boundaries (correctness check)
Pre-allocate reusable decode buffers (deltasBuffer, excPositionsBuffer,
unpackPadBuf, unpackTempBuf) in allocateDecodedBuffer instead of
allocating new arrays on every decodeVector call. Mirrors the writer-side
improvement from the previous commit.
getBytes() now emits a valid 7-byte header even when totalCount==0,
so the reader can distinguish an empty PFOR page from a missing
encoding. Update assertions from size==0 to size==PFOR_HEADER_SIZE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants