Fix errors and inconsistencies in Variant format documentation#574
Open
iemejia wants to merge 1 commit into
Open
Fix errors and inconsistencies in Variant format documentation#574iemejia wants to merge 1 commit into
iemejia wants to merge 1 commit into
Conversation
VariantEncoding.md:
- Fix BINARY -> BYTE_ARRAY (BINARY is not a Parquet physical type)
- Add note on decimal little-endian vs big-endian difference
- Fix decimal implied-precision formula for val <= 0
- Label undocumented reserved bits in metadata/object/array headers
- Make sorted_strings description consistent across three definitions
- Use INT(N, true) notation consistent with LogicalTypes.md
- Hyphenate compound adjectives ("3 byte" -> "3-byte", etc.)
VariantShredding.md:
- Fix Python syntax error: iterating dict yields keys only;
add .items() for (name, field) unpacking
- Replace BINARY with BYTE_ARRAY
- Fix comma -> colon inside JSON-like literal in table cell
- Remove trailing space inside backticks in table header
- Use INT(N, true) notation consistent with LogicalTypes.md
alamb
reviewed
Jun 3, 2026
| `sorted_strings` is a 1-bit value indicating whether dictionary strings are sorted and unique. | ||
| `offset_size_minus_one` is a 2-bit value providing the number of bytes per dictionary size and offset field. | ||
| The actual number of bytes, `offset_size`, is `offset_size_minus_one + 1`. | ||
| Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers. |
Contributor
There was a problem hiding this comment.
THis "must be set to 0 by writers" seems to me to change the spec. Previously it was not specified. I think we should leave it as unspecified -- it would be nice to clarify it should be ignored by readers
Suggested change
| Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers. | |
| Bit 5 (marked `R`) is reserved; it must be ignored by readers. |
| The actual number of bytes is computed as `field_offset_size_minus_one + 1` and `field_id_size_minus_one + 1`. | ||
| `is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements. | ||
| If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used. | ||
| Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers. |
Contributor
There was a problem hiding this comment.
same comment as above -- I don't think we should mandate setting to 0
| The actual number of bytes is computed as `field_offset_size_minus_one + 1`. | ||
| `is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements. | ||
| If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used. | ||
| Bits 5-3 (marked `RRR`) are reserved; they must be set to 0 by writers and ignored by readers. |
| | Float | float | `14` | FLOAT | IEEE little-endian | | ||
| | Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | | ||
| | String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | | ||
| | Binary | binary | `15` | BYTE_ARRAY | 4-byte little-endian size, followed by bytes | |
Contributor
There was a problem hiding this comment.
Double checked against https://parquet.apache.org/docs/file-format/types/
| | boolean | BOOLEAN | | | ||
| | int8 | INT32 | INT(8, signed=true) | | ||
| | int16 | INT32 | INT(16, signed=true) | | ||
| | int8 | INT32 | INT(8, true) | |
Contributor
There was a problem hiding this comment.
I found the signed=true easier to understand to be honest, but this is technically accurate too
| object_fields = { | ||
| name: construct_variant(metadata, field.value, field.typed_value) | ||
| for (name, field) in typed_value | ||
| for (name, field) in typed_value.items() |
Contributor
There was a problem hiding this comment.
Why this change? Does the original not work? Did you test that this still works after the proposed change?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix bugs, terminology errors, and inconsistencies in the Variant format specification documents.
Changes
VariantEncoding.md
sorted_stringsdescription consistent across three definitionsINT(N, true)notation consistent with LogicalTypes.mdVariantShredding.md
.items()for (name, field) unpackingINT(N, true)notation consistent with LogicalTypes.mdValidation
No semantic/behavioral changes to the format specification. All fixes are documentation-only.
Split from #572 for easier review.