[WIP][POC] Pfor encoding#3595
Draft
prtkgaur wants to merge 8 commits into
Draft
Conversation
Implements the PFOR (Patched Frame of Reference) integer compression encoding for INT32 and INT64 columns in the pfor package: - PforConstants: header/vector sizes, max exceptions (65535) - PforEncoderDecoder: histogram-based cost model for optimal bit width - PforValuesWriter: IntPforValuesWriter + LongPforValuesWriter with vector-buffered encoding and interleaved page layout - PforValuesReader: abstract base with lazy per-vector decoding - PforValuesReaderForInt: INT32 decoder using BytePacker - PforValuesReaderForLong: INT64 decoder using BytePackerForLong
Wires PFOR encoding into the parquet-java read/write pipeline: - Encoding.java: add PFOR enum with INT32/INT64 reader dispatch - ParquetProperties.java: add pforEnabled column property with isPforEnabled() and builder methods withPforEncoding() - DefaultV2ValuesWriterFactory.java: PFOR takes priority over BYTE_STREAM_SPLIT and DELTA_BINARY_PACKED for INT32/INT64 - ParquetMetadataConverter.java: guard for PFOR until thrift spec is merged upstream
64 tests covering: - PforEncoderDecoderTest: bit width utilities and histogram-based cost model - PforBitPackingTest: round-trip correctness across bit widths 0-64, partial groups, page header format - PforValuesEndToEndTest: full writer→reader pipeline including reset/reuse, skip, edge cases, random data
Benchmarks encode/decode throughput for int32/int64 across 8 data distributions inspired by Snowflake's NumericComprBenchmark: constant, sequential, small range, high-base-small-range (timestamps), with outliers (exception path), random, TPC-DS date keys, TPC-DS quantity. Uses junit-benchmarks (matches existing delta encoding benchmarks). Prints compression ratios for all distributions during setup. Excluded from normal test runs by surefire's benchmark exclusion.
Writer: - Pre-allocate reusable buffers (deltasBuffer, excPosBuffer, excValBuffer, metadataBuf, packBuf, packPadBuf) in constructor instead of allocating new arrays on every encodeAndFlushVector call - Replace ByteBuffer.allocate().order(LITTLE_ENDIAN) with manual byte shifts into reusable metadataBuf for vector info and exception writes - Emit valid header for totalCount==0 (reader can distinguish empty page from missing encoding) instead of BytesInput.empty() Reader: - Add numElements > valuesCount validation (handles nullable columns where page row count > encoded values) - Move getShortLE/getIntLE/getLongLE from private static in concrete readers to protected static in PforValuesReader base class
Tests cover: - Bad packing mode, log vector size out of range, bad value byte width - Negative num_elements, numElements > valuesCount - Header-only page, truncated offset array, truncated vector data - Corrupted offset pointing past buffer end - Skip past end, negative skip, read past end - Skip across vector boundaries (correctness check)
Pre-allocate reusable decode buffers (deltasBuffer, excPositionsBuffer, unpackPadBuf, unpackTempBuf) in allocateDecodedBuffer instead of allocating new arrays on every decodeVector call. Mirrors the writer-side improvement from the previous commit.
getBytes() now emits a valid 7-byte header even when totalCount==0, so the reader can distinguish an empty PFOR page from a missing encoding. Update assertions from size==0 to size==PFOR_HEADER_SIZE.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?