Fix errors and inconsistencies in Variant format documentation by iemejia · Pull Request #574 · apache/parquet-format

iemejia · 2026-06-02T18:32:34Z

Summary

Fix bugs, terminology errors, and inconsistencies in the Variant format specification documents.

Changes

VariantEncoding.md

Fix BINARY -> BYTE_ARRAY (BINARY is not a Parquet physical type)
Add note on decimal little-endian vs big-endian difference
Fix decimal implied-precision formula for val <= 0
Label undocumented reserved bits in metadata/object/array headers
Make sorted_strings description consistent across three definitions
Use INT(N, true) notation consistent with LogicalTypes.md
Hyphenate compound adjectives ("3 byte" -> "3-byte", etc.)

VariantShredding.md

Fix Python syntax error: iterating dict yields keys only; add .items() for (name, field) unpacking
Replace BINARY with BYTE_ARRAY
Fix comma -> colon inside JSON-like literal in table cell
Remove trailing space inside backticks in table header
Use INT(N, true) notation consistent with LogicalTypes.md

Validation

No semantic/behavioral changes to the format specification. All fixes are documentation-only.

Split from #572 for easier review.

VariantEncoding.md: - Fix BINARY -> BYTE_ARRAY (BINARY is not a Parquet physical type) - Add note on decimal little-endian vs big-endian difference - Fix decimal implied-precision formula for val <= 0 - Label undocumented reserved bits in metadata/object/array headers - Make sorted_strings description consistent across three definitions - Use INT(N, true) notation consistent with LogicalTypes.md - Hyphenate compound adjectives ("3 byte" -> "3-byte", etc.) VariantShredding.md: - Fix Python syntax error: iterating dict yields keys only; add .items() for (name, field) unpacking - Replace BINARY with BYTE_ARRAY - Fix comma -> colon inside JSON-like literal in table cell - Remove trailing space inside backticks in table header - Use INT(N, true) notation consistent with LogicalTypes.md

alamb

Thanks @iemejia

I left some comments for review

alamb · 2026-06-03T15:55:17Z

 `sorted_strings` is a 1-bit value indicating whether dictionary strings are sorted and unique.
 `offset_size_minus_one` is a 2-bit value providing the number of bytes per dictionary size and offset field.
 The actual number of bytes, `offset_size`, is `offset_size_minus_one + 1`.
+Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers.


THis "must be set to 0 by writers" seems to me to change the spec. Previously it was not specified. I think we should leave it as unspecified -- it would be nice to clarify it should be ignored by readers

Suggested change

Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers.

Bit 5 (marked `R`) is reserved; it must be ignored by readers.

alamb · 2026-06-03T15:55:58Z

 The actual number of bytes is computed as `field_offset_size_minus_one + 1` and `field_id_size_minus_one + 1`.
 `is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements.
 If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used.
+Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers.


same comment as above -- I don't think we should mandate setting to 0

alamb · 2026-06-03T15:56:05Z

 The actual number of bytes is computed as `field_offset_size_minus_one + 1`.
 `is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements.
 If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used.
+Bits 5-3 (marked `RRR`) are reserved; they must be set to 0 by writers and ignored by readers.


same as above

alamb · 2026-06-03T15:57:24Z

 | Float                | float                       | `14`    | FLOAT                       | IEEE little-endian                                                                                                  |
-| Binary               | binary                      | `15`    | BINARY                      | 4 byte little-endian size, followed by bytes                                                                        |
-| String               | string                      | `16`    | STRING                      | 4 byte little-endian size, followed by UTF-8 encoded bytes                                                          |
+| Binary               | binary                      | `15`    | BYTE_ARRAY                  | 4-byte little-endian size, followed by bytes                                                                        |


Double checked against https://parquet.apache.org/docs/file-format/types/

alamb · 2026-06-03T15:57:48Z

 | boolean                     | BOOLEAN                           |                          |
-| int8                        | INT32                             | INT(8, signed=true)      |
-| int16                       | INT32                             | INT(16, signed=true)     |
+| int8                        | INT32                             | INT(8, true)             |


I found the signed=true easier to understand to be honest, but this is technically accurate too

alamb · 2026-06-03T15:58:43Z

            object_fields = {
                name: construct_variant(metadata, field.value, field.typed_value)
-                for (name, field) in typed_value
+                for (name, field) in typed_value.items()


Why this change? Does the original not work? Did you test that this still works after the proposed change?

iemejia mentioned this pull request Jun 2, 2026

Fix specification typos, grammar, inconsistencies, and errors #572

Closed

alamb reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix errors and inconsistencies in Variant format documentation#574

Fix errors and inconsistencies in Variant format documentation#574
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:fix/variant-docs

iemejia commented Jun 2, 2026

Uh oh!

alamb left a comment

Uh oh!

alamb Jun 3, 2026

Uh oh!

alamb Jun 3, 2026

Uh oh!

alamb Jun 3, 2026

Uh oh!

alamb Jun 3, 2026

Uh oh!

alamb Jun 3, 2026

Uh oh!

alamb Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers.
	Bit 5 (marked `R`) is reserved; it must be ignored by readers.

Conversation

iemejia commented Jun 2, 2026

Summary

Changes

VariantEncoding.md

VariantShredding.md

Validation

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants