Skip to content

Fix errors and inconsistencies in Variant format documentation#574

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:fix/variant-docs
Open

Fix errors and inconsistencies in Variant format documentation#574
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:fix/variant-docs

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Jun 2, 2026

Summary

Fix bugs, terminology errors, and inconsistencies in the Variant format specification documents.

Changes

VariantEncoding.md

  • Fix BINARY -> BYTE_ARRAY (BINARY is not a Parquet physical type)
  • Add note on decimal little-endian vs big-endian difference
  • Fix decimal implied-precision formula for val <= 0
  • Label undocumented reserved bits in metadata/object/array headers
  • Make sorted_strings description consistent across three definitions
  • Use INT(N, true) notation consistent with LogicalTypes.md
  • Hyphenate compound adjectives ("3 byte" -> "3-byte", etc.)

VariantShredding.md

  • Fix Python syntax error: iterating dict yields keys only; add .items() for (name, field) unpacking
  • Replace BINARY with BYTE_ARRAY
  • Fix comma -> colon inside JSON-like literal in table cell
  • Remove trailing space inside backticks in table header
  • Use INT(N, true) notation consistent with LogicalTypes.md

Validation

No semantic/behavioral changes to the format specification. All fixes are documentation-only.

Split from #572 for easier review.

VariantEncoding.md:
- Fix BINARY -> BYTE_ARRAY (BINARY is not a Parquet physical type)
- Add note on decimal little-endian vs big-endian difference
- Fix decimal implied-precision formula for val <= 0
- Label undocumented reserved bits in metadata/object/array headers
- Make sorted_strings description consistent across three definitions
- Use INT(N, true) notation consistent with LogicalTypes.md
- Hyphenate compound adjectives ("3 byte" -> "3-byte", etc.)

VariantShredding.md:
- Fix Python syntax error: iterating dict yields keys only;
  add .items() for (name, field) unpacking
- Replace BINARY with BYTE_ARRAY
- Fix comma -> colon inside JSON-like literal in table cell
- Remove trailing space inside backticks in table header
- Use INT(N, true) notation consistent with LogicalTypes.md
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @iemejia

I left some comments for review

Comment thread VariantEncoding.md
`sorted_strings` is a 1-bit value indicating whether dictionary strings are sorted and unique.
`offset_size_minus_one` is a 2-bit value providing the number of bytes per dictionary size and offset field.
The actual number of bytes, `offset_size`, is `offset_size_minus_one + 1`.
Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis "must be set to 0 by writers" seems to me to change the spec. Previously it was not specified. I think we should leave it as unspecified -- it would be nice to clarify it should be ignored by readers

Suggested change
Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers.
Bit 5 (marked `R`) is reserved; it must be ignored by readers.

Comment thread VariantEncoding.md
The actual number of bytes is computed as `field_offset_size_minus_one + 1` and `field_id_size_minus_one + 1`.
`is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements.
If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used.
Bit 5 (marked `R`) is reserved; it must be set to 0 by writers and ignored by readers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above -- I don't think we should mandate setting to 0

Comment thread VariantEncoding.md
The actual number of bytes is computed as `field_offset_size_minus_one + 1`.
`is_large` is a 1-bit value that indicates how many bytes are used to encode the number of elements.
If `is_large` is `0`, 1 byte is used, and if `is_large` is `1`, 4 bytes are used.
Bits 5-3 (marked `RRR`) are reserved; they must be set to 0 by writers and ignored by readers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment thread VariantEncoding.md
| Float | float | `14` | FLOAT | IEEE little-endian |
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| Binary | binary | `15` | BYTE_ARRAY | 4-byte little-endian size, followed by bytes |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread VariantShredding.md
| boolean | BOOLEAN | |
| int8 | INT32 | INT(8, signed=true) |
| int16 | INT32 | INT(16, signed=true) |
| int8 | INT32 | INT(8, true) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the signed=true easier to understand to be honest, but this is technically accurate too

Comment thread VariantShredding.md
object_fields = {
name: construct_variant(metadata, field.value, field.typed_value)
for (name, field) in typed_value
for (name, field) in typed_value.items()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Does the original not work? Did you test that this still works after the proposed change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants