Skip to content

Fix errors and grammar in BloomFilter.md and PageIndex.md#577

Open
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:fix/bloom-filter-page-index-docs
Open

Fix errors and grammar in BloomFilter.md and PageIndex.md#577
iemejia wants to merge 1 commit into
apache:masterfrom
iemejia:fix/bloom-filter-page-index-docs

Conversation

@iemejia
Copy link
Copy Markdown
Member

@iemejia iemejia commented Jun 2, 2026

Summary

Fix pseudocode bugs, missing fields, and grammar issues in the Bloom filter and Page index documentation.

Changes

BloomFilter.md

  • Fix block_check pseudocode: setBit -> isSet (checking bits, not setting them)
  • Fix struct name to match thrift (BloomFilterHeader)
  • Include missing bloom_filter_length field in ColumnMetaData snippet
  • Normalize "Bloom filter" casing (proper noun) throughout
  • "64 bits version" -> "the 64-bit version"
  • Normalize "bit set" -> "bitset" (consistent with thrift)
  • Add terminal periods in thrift-style doc comments

PageIndex.md

  • Fix "Blart Versenwald III" double-quote typo
  • Fix heading capitalization: "page index" -> "Page Index"
  • Fix article: "one data page per the retrieved column" -> "per retrieved column"
  • Add missing terminal period after parquet.thrift link

Validation

No semantic/behavioral changes to the format specification. All fixes are documentation-only.

Split from #572 for easier review.

BloomFilter.md:
- Fix block_check pseudocode: setBit -> isSet (checking, not setting)
- Fix struct name to match thrift (BloomFilterHeader)
- Include missing bloom_filter_length field in ColumnMetaData snippet
- Normalize "Bloom filter" casing (proper noun) throughout
- "64 bits version" -> "the 64-bit version"
- Normalize "bit set" -> "bitset" (consistent with thrift)
- Add terminal periods in thrift-style doc comments

PageIndex.md:
- Fix "Blart Versenwald III" double-quote typo
- Fix heading capitalization: "page index" -> "Page Index"
- Fix article: "one data page per the retrieved column" -> "per retrieved column"
- Add missing terminal period after parquet.thrift link
Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thanks @etseidl and @iemejia

Comment thread BloomFilter.md
for j in [0..31] {
if (masked.getWord(i).isSet(j)) {
if (not b.getWord(i).setBit(j)) {
if (not b.getWord(i).isSet(j)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a nice find

Comment thread BloomFilter.md
**/
struct BloomFilterPageHeader {
/** The size of bitset in bytes **/
struct BloomFilterHeader {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this matches what is in

struct BloomFilterHeader {

Comment thread BloomFilter.md
...
/** Byte offset from beginning of file to Bloom filter data. **/
14: optional i64 bloom_filter_offset;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified it is in

/** Size of Bloom filter data including the serialized header, in bytes.
* Added in 2.10 so readers may not read this field from old files and
* it can be obtained after the BloomFilterHeader has been deserialized.
* Writers should write this field so readers can read the bloom filter
* in a single I/O.
*/
15: optional i32 bloom_filter_length;

Comment thread BloomFilter.md
struct BloomFilterPageHeader {
/** The size of bitset in bytes **/
struct BloomFilterHeader {
/** The size of bitset in bytes. **/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you can update parquet.thrift to match these changes?

struct BloomFilterHeader {
/** The size of bitset in bytes **/
1: required i32 numBytes;
/** The algorithm for setting bits. **/
2: required BloomFilterAlgorithm algorithm;
/** The hash function used for Bloom filter. **/
3: required BloomFilterHash hash;
/** The compression used in the Bloom filter **/
4: required BloomFilterCompression compression;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants