Skip to content

Fix specification typos, grammar, inconsistencies, and errors#572

Closed
iemejia wants to merge 4 commits into
apache:masterfrom
iemejia:master
Closed

Fix specification typos, grammar, inconsistencies, and errors#572
iemejia wants to merge 4 commits into
apache:masterfrom
iemejia:master

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Jun 2, 2026

Summary

This PR fixes numerous typos, grammar issues, inconsistencies, and minor errors across the Parquet format specification documents. The changes span 13 files with 4 focused cleanup commits.

Changes

Commit 1: Fix specification inconsistencies, typos, and errors

  • BloomFilter.md: Fix block_check pseudocode (setBit -> isSet); fix struct name to match thrift
  • parquet.thrift: Fix typos ("to be be", "documention", "not necessary"); remove off-by-one in DataPageHeaderV2 comment
  • README.md: Fix repetition level value for non-nested columns (1 -> 0); update defunct Twitter CoC links to ASF
  • LogicalTypes.md: Fix embedded types ordering contradiction; add nanosecond to TIME precision
  • VariantEncoding.md: Fix BINARY -> BYTE_ARRAY; add decimal endianness note
  • Compression.md: Fix ZSTD RFC reference (8478 -> 8878)
  • Encryption.md: Fix double-negative; align GCM invocation limit to NIST
  • Encodings.md: Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY

Commit 2: Fix more specification inconsistencies and clarify ambiguous descriptions

  • PageIndex.md, parquet.thrift: Fix double-quote typo
  • VariantShredding.md: Fix Python syntax error; replace BINARY with BYTE_ARRAY
  • BloomFilter.md: Include missing bloom_filter_length field
  • Encodings.md: "bitwidth of each block" -> "each miniblock"
  • LogicalTypes.md: Align DECIMAL precision/scale wording with thrift
  • Geospatial.md: Use uppercase edge-interpolation algorithm names to match thrift enum
  • VariantEncoding.md: Label undocumented reserved bits; fix decimal implied-precision formula

Commit 3: Fix additional typos, grammar, invalid HTML, and consistency issues (28 fixes)

  • CONTRIBUTING.md: 7 typos (docuemnt, interopability, libaries, etc.)
  • Encryption.md: 6 fixes (plural agreement, explictly, smart quotes, double spaces)
  • LogicalTypes.md: 7 fixes (invalid <tr colspan=3>, NaN casing, grammar)
  • parquet.thrift: 4 fixes (article agreement, terminal periods, BIT_PACKED comment)
  • Encodings.md, Compression.md, PageIndex.md, VariantEncoding.md: Minor fixes

Commit 4: Fix additional typos, grammar, hyphenation, and consistency issues (52 fixes)

  • parquet.thrift: Article agreement, edge interpolation, proper noun capitalization
  • Geospatial.md: Compound adjectives, comma splices, heading formatting
  • LogicalTypes.md: Grammar, Oxford commas, "can not" -> "cannot"
  • README.md: Plural agreement, compound adjective hyphenation, proper nouns
  • BinaryProtocolExtensions.md: FileMetaData casing, FlatBuffers capitalization
  • Encodings.md, Compression.md, Encryption.md, BloomFilter.md, PageIndex.md, VariantShredding.md, VariantEncoding.md: Various grammar, punctuation, and consistency fixes

Validation

  • Thrift definition compiles cleanly after all changes
  • No semantic/behavioral changes to the format specification
  • All fixes are documentation-only (typos, grammar, consistency, correctness of descriptions)

iemejia added 4 commits June 2, 2026 17:08
- BloomFilter.md: Fix block_check pseudocode (setBit -> isSet)
- BloomFilter.md: Fix struct name to match thrift (BloomFilterHeader)
- parquet.thrift: Fix typos ("to be be", "documention", "not necessary")
- parquet.thrift: Remove off-by-one in DataPageHeaderV2 is_compressed comment
- README.md: Fix repetition level value for non-nested columns (1 -> 0)
- README.md: Update defunct Twitter Code of Conduct links to ASF
- LogicalTypes.md: Fix embedded types ordering contradiction
- LogicalTypes.md: Add nanosecond to TIME precision description
- VariantEncoding.md: Fix BINARY -> BYTE_ARRAY for equivalent Parquet type
- VariantEncoding.md: Add note on decimal little-endian vs big-endian difference
- Compression.md: Fix ZSTD RFC reference (8478 -> 8878)
- Encryption.md: Fix double-negative and align GCM invocation limit to NIST
- Encodings.md: Remove misleading "always preferred" claim for DELTA_LENGTH_BYTE_ARRAY
…ions

Round-2 cleanup pass over the spec docs. Highlights:

Bugs / clear errors:
- PageIndex.md, parquet.thrift: fix ""Blart Versenwald III" double-quote typo
- VariantShredding.md: fix Python syntax error (`: Variant:` -> `-> Variant:`)
- VariantShredding.md: replace BINARY with BYTE_ARRAY (BINARY is not a Parquet
  physical type); fix `,` -> `:` inside a JSON-like literal in a table cell
- BloomFilter.md: include missing `bloom_filter_length` field in the
  ColumnMetaData snippet (it exists in parquet.thrift)
- Encodings.md: "bitwidth of each block" -> "each miniblock" in
  DELTA_BINARY_PACKED description
- README.md: add missing colon and code formatting in BIT_PACKED/RLE sentence
- LogicalTypes.md: fix TIME unit list punctuation, "Despite there is" grammar

Cross-document consistency:
- LogicalTypes.md: align DECIMAL precision/scale wording with parquet.thrift
  ("should" -> "must")
- Geospatial.md: use uppercase edge-interpolation algorithm names
  (SPHERICAL/VINCENTY/...) to match parquet.thrift enum and LogicalTypes.md
- Geospatial.md: make srid: prefix consistent (lowercase example)
- VariantEncoding.md, VariantShredding.md: use `INT(N, true)` notation
  consistent with LogicalTypes.md syntax
- VariantEncoding.md: make sorted_strings description consistent across the
  three places it is defined
- Encodings.md: PLAIN BOOLEAN no longer links to the deprecated MSB-first
  BITPACKED section; references the RLE/bit-packing hybrid section instead
- parquet.thrift: disambiguate `compressed_page_size` comment in PageLocation
  (it includes the header; the field of the same name on PageHeader does not)

Coherence/clarity:
- VariantEncoding.md: label undocumented reserved bits in metadata header,
  object value_header, and array value_header diagrams
- VariantEncoding.md: fix decimal implied-precision formula for val <= 0
- Encodings.md: BIT_PACKED is already replaced, not "will be replaced"
- Encryption.md: replace "allows to" with idiomatic English
- Encryption.md: "Data PageHeader" -> "Data Page Header" (spacing)
- BloomFilter.md: "64 bits version" -> "the 64-bit version"
- Geospatial.md: fix XYZM table column alignment
Round 3 of specification cleanup. 28 minor fixes across 8 files:

CONTRIBUTING.md (7 typos):
  docuemnt, an prototype, demostrate, interopability,
  libaries, highlighed, compatiblity

Encryption.md (6):
  - 'reflects the identity' -> 'reflect the identity' (plural subject)
  - explictly -> explicitly
  - Data/Dictionary PageHeader -> Data/Dictionary Page Header (spacing,
    same as previous fix in section 4.1, second occurrence in section 4.13)
  - Removed double space after 'right after'
  - 'the the FileMetaData' -> 'the FileMetaData'
  - Smart quotes 'PAR1' -> ASCII "PAR1" for magic-bytes literal

Encodings.md (1):
  'at at time' -> 'at a time'

PageIndex.md (1):
  Added missing terminal period after parquet.thrift link

parquet.thrift (4):
  - 'a element' -> 'an element' (DataPageHeaderV2.num_nulls comment)
  - Terminal periods after 'It was never used' and 'use PLAIN instead'
  - Rewrote BIT_PACKED comment to clarify it is superseded by RLE and
    cross-reference Encodings.md

Compression.md (1):
  Removed double space before [Brotli] link

LogicalTypes.md (7):
  - Terminal periods after INT(8, true) and INT(8, false) paragraphs
  - Removed invalid <tr colspan=3> from three logical-type tables
    (<tr> does not accept colspan; the intended colspan is already
    on the <th> header row)
  - Made 'precision' consistent with backticked identifier style
  - 'NANs' -> 'NaNs'
  - 'Despite there is no' -> 'Although there is no'
    (same fix as round 2 in a different paragraph)
  - 'In case of not present' -> 'If not present'

VariantEncoding.md (1):
  Hyphenated '1 byte', '2 byte', '4 byte', '8 byte' to '1-byte', etc.
  in primitive-types table (character-count-neutral edit, column widths
  preserved)
Round 4 of specification cleanup. 52 minor fixes across 13 files.

parquet.thrift (6):
  - L676: 'a OffsetIndex' -> 'an OffsetIndex'
  - L427/446/452: 'edges interpolation' -> 'edge interpolation'
    (Geospatial doc comments; align with thrift enum/struct naming)
  - L44: 'frameworks(e.g. hive, pig)' -> 'frameworks (e.g. Hive, Pig)'
    (missing space + capitalize proper nouns)
  - L816: 'GZip' -> 'GZIP' (match Compression.md heading and enum casing)

Geospatial.md (8):
  - L31: Missing space before parenthesis: 'OGC(' -> 'OGC ('
  - L50: 'well known' -> 'well-known' (compound adjective, 2 occurrences)
  - L61: ', and is also commonly used' -> '. It is also commonly used'
    (comma splice between independent clauses)
  - L97: 'Y Values' -> 'Y values' (mid-sentence; sibling 'X values' lowercase)
  - L137: Added 'The' before bullet sentence for grammaticality
  - L157: Markdown heading: bare line -> proper '# Coordinate Axis Order'
  - L72/73: 'edges interpolation' -> 'edge interpolation'

LogicalTypes.md (11):
  - 'to describes' -> 'that describes'
  - DECIMAL scale/precision: 'integer literal annotation' wording made
    precise ('non-negative integer' / 'positive integer')
  - TIME L302-303: 'annotation' -> 'annotations'; 'implementation' ->
    'implementations' (plural agreement)
  - L302/353/360: Oxford comma added before 'or NANOS' (consistency with
    earlier paragraph)
  - L449/460/478/479: 'can not' -> 'cannot' (consistent with rest of repo)
  - L608/623: 'edges interpolation' -> 'edge interpolation'

README.md (8):
  - L191: 'encoded values for the data page is' -> 'are' (plural subject)
  - L193/195: 'it would always' / 'if encoded, it will' -> 'they would' /
    'they will' (refer to plural repetition/definition levels)
  - L137-140: '1 bit' / '32 bit' / '64 bit' / '96 bit' / 'fixed length' ->
    hyphenated forms (compound adjectives)
  - L225: 'GZip' -> 'GZIP' (same as thrift)
  - L240: 'rc or avro files' -> 'RCFile or Avro files' (proper nouns)
  - L243: Removed trailing period from heading
  - L257: 'fine grained' -> 'fine-grained' (compound adjective)

CONTRIBUTING.md (6):
  - L46: Removed extra 'the'; added comma after 'feature'
  - L46: 'features desirability' -> "a feature's desirability"
    (possessive apostrophe)
  - L53: 'After the first two steps are complete' + 'After the vote passes':
    added missing commas after introductory clauses
  - L58: 'an external dependencies' -> 'an external dependency' (agreement)
  - L70: Comma splice between independent clauses -> semicolon
  - L90: Removed trailing period from heading

BinaryProtocolExtensions.md (10):
  - L29/53/56/79: 'FileMetadata' -> 'FileMetaData' (4 occurrences; match
    the canonical thrift struct name used elsewhere in this file)
  - L29: 'implementers which MUST' -> 'implementers who MUST' (who for
    people)
  - L70: 'extension shared publicly' -> 'extension is shared publicly'
    (missing copula)
  - L53/55/72/77/80: 'Flatbuffers' / 'flatbuffer' -> 'FlatBuffers' (5
    occurrences; project's official capitalization, already used twice
    elsewhere in the same file)

Encodings.md (5):
  - L57: 'can not' -> 'cannot'
  - L72: '4 byte little endian' -> '4-byte little endian' (compound
    adjective; preserved 'little endian' as in surrounding text)
  - L134-135: '32 bit register' / '64 bit register' -> '32-bit' / '64-bit'
  - L178: 'max definition level as 3' -> 'max definition level was 3'
    (parallel construction with previous clause)
  - L175: 'RLE/Bit packed described above' -> 'RLE/Bit-Packed described
    above' (match section heading 'Bit-packed (Deprecated)')
  - L235: 'list of bit packed ints' -> 'list of bit-packed ints'

Compression.md (2):
  - L51: 'provided by Google Snappy [library]' -> 'provided by the
    [Snappy compression library]' (match parallel construction used for
    GZIP, BROTLI, ZSTD sibling entries in the same file)
  - L61: Comma splice between independent clauses -> semicolon

Encryption.md (7):
  - L82: Removed double space after 'pages and'
  - L250: Removed double space after 'files'
  - L289: Heading '## 5 File Format' -> '## 5. File Format' (period to
    match sibling headings ## 5.1, ## 5.2, ## 5.3)
  - L256-258: '2 byte short, little endian' -> '2-byte short, little-endian'
    (compound adjectives; 3 lines)
  - L396: 'from a secret data' -> 'from secret data' (mass noun; no
    article)
  - L412: '/** Column metadata for this chunk.. **/' -> 'this chunk' (two
    trailing dots is neither sentence end nor ellipsis)

BloomFilter.md (4):
  - L268-270: 'multi-block bloom filter' / 'bloom filter header' /
    'bloom filter bitset' / 'bloom filter bit set' -> normalized casing
    ('Bloom filter' as a proper noun, consistent with the rest of this
    file and with the thrift comments) and 'bit set' -> 'bitset'
  - L311: Added terminal period in '/** The size of bitset in bytes **/'
  - L317: Added terminal period in '/** The compression used in the
    Bloom filter **/'

PageIndex.md (2):
  - L20: Title-case in heading: 'page index' -> 'Page Index' (proper noun;
    rest of heading already title-cased)
  - L40: 'one data page per the retrieved column' -> 'one data page per
    retrieved column' (article+'per' is ungrammatical)

VariantShredding.md (2):
  - L297: Python bug 'for (name, field) in typed_value' ->
    'for (name, field) in typed_value.items()' (iterating a dict yields
    keys only; the unpacking would fail at runtime — verified with
    'python3 -c' that the original raises ValueError)
  - L151: Removed trailing space inside backticks of table header
    ('`typed_value `' -> '`typed_value`')

VariantEncoding.md (3 lines, 8 hyphenations):
  - L145: '3 byte offsets' -> '3-byte offsets' (compound adjective; the
    same sentence already uses '1-byte', '2-byte', '4-byte' hyphenated)
  - L373-375: 'a 1 or 4 byte' / 'a 1, 2, 3 or 4 byte' -> '1- or 4-byte' /
    '1-, 2-, 3-, or 4-byte' (hyphenated number/unit compounds)
  - L386: '3 byte IDs' -> '3-byte IDs'
  - L387: 'one or four byte value' -> 'one- or four-byte value'

Thrift validation passes after these edits (only pre-existing doctext
warnings on lines 18, 22, 588 remain — unrelated to any fix).
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jun 2, 2026

Thanks @iemejia - it would help me review this if you could break it into several smaller PRs -- as I think different people have different levels of expertise and I think it will be eaiser to get reviews on smaller proposed changes

For example, I am not sure if the edge --> edges interpolation is technically accurate but I feel bad pinging the geospatial people on such a large PR

@iemejia
Copy link
Copy Markdown
Member Author

iemejia commented Jun 2, 2026

Closing in favor of smaller, concept-focused PRs for easier review:

@iemejia iemejia closed this Jun 2, 2026
Comment thread Compression.md
[Snappy compression format](https://github.com/google/snappy/blob/master/format_description.txt).
If any ambiguity arises when implementing this format, the implementation
provided by Google Snappy [library](https://github.com/google/snappy/)
provided by the [Snappy compression library](https://github.com/google/snappy/)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure Google's credit is being dropped?

Comment thread Encodings.md
The plain encoding is used whenever a more efficient encoding cannot be used. It
stores the data in the following format:
- BOOLEAN: [Bit Packed](#BITPACKED), LSB first
- BOOLEAN: bit-packed, LSB first (using the same packing scheme as the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like it drops BITPACKED hyper link, does it not exist in the doc?

Comment thread Encodings.md

Supported Types: BYTE_ARRAY

This encoding is always preferred over PLAIN for byte array columns.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this dropped?

Comment thread Encryption.md
key shall not not be used for more than 2^31 (~2 billion) pages. In Parquet files encrypted with
multiple keys (footer and column keys), the constraint on the number of invocations is applied
to each key separately.
key shall not be used for more than 2^32 total module encryptions, as per the NIST specification.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a substantive change?

Comment thread Geospatial.md
* `<authority>:<code>` - where `<authority>` represents some well known authorities, and `code` is the code used by the authority to identify the CRS. Examples are - `OGC:CRS84`, `OGC:CRS83`, `OGC:CRS27`, `EPSG:4326`, `EPSG:3857`, `IGNF:ATI`. See [https://spatialreference.org/](https://spatialreference.org/) for definitions of coordinate reference systems provided by some well known authorities.
* `srid:<identifier>` - A reference using a [Spatial reference identifier (SRID)](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), where <identifier> is the numeric SRID value. For example: `SRID:0`.
* `<authority>:<code>` - where `<authority>` represents some well-known authorities, and `code` is the code used by the authority to identify the CRS. Examples are - `OGC:CRS84`, `OGC:CRS83`, `OGC:CRS27`, `EPSG:4326`, `EPSG:3857`, `IGNF:ATI`. See [https://spatialreference.org/](https://spatialreference.org/) for definitions of coordinate reference systems provided by some well-known authorities.
* `srid:<identifier>` - A reference using a [Spatial reference identifier (SRID)](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), where <identifier> is the numeric SRID value. For example: `srid:0`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure here whether this is expected to be upper-cased or lower cased.

Comment thread Geospatial.md
* `thomas`: Thomas, Paul D. Spheroidal geodesics, reference systems, & local geometry. US Naval Oceanographic Office, 1970.
* `andoyer`: Thomas, Paul D. Mathematical models for navigation systems. US Naval Oceanographic Office, 1965.
* `karney`: [Karney, Charles FF. "Algorithms for geodesics." Journal of Geodesy 87 (2013): 43-55](https://link.springer.com/content/pdf/10.1007/s00190-012-0578-z.pdf), and [GeographicLib](https://geographiclib.sourceforge.io/)
* `SPHERICAL`: edges are interpolated as geodesics on a sphere.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear if capitalization here. it seems like this is a substantive changes.

Comment thread LogicalTypes.md
in the unscaled value.

If not specified, the scale is 0. Scale must be zero or a positive integer less
than or equal to the precision. Precision is required and must be a non-zero positive
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is still good to call out non-zero here even though it is repetive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants