Skip to content

fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora#2059

Open
xodn348 wants to merge 1 commit into
huggingface:mainfrom
xodn348:fix/bpe-trainer-i32-overflow
Open

fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora#2059
xodn348 wants to merge 1 commit into
huggingface:mainfrom
xodn348:fix/bpe-trainer-i32-overflow

Conversation

@xodn348
Copy link
Copy Markdown

@xodn348 xodn348 commented May 17, 2026

Summary

BpeTrainer accumulates pair counts in an AHashMap<Pair, i32>. On large
corpora — especially code corpora where indentation means a space-space pair
can appear billions of times — the i32 counter wraps to a negative value.
The wrapped pair is then excluded from the merge heap even though it is the
single most frequent pair in the vocabulary, producing completely wrong merge
ordering.

For example, training on a corpus of 30M lines each containing 80 leading
spaces yields ~2.37 billion occurrences of the (Ġ, Ġ) pair, just above
i32::MAX = 2,147,483,647. With i32, the counter wraps and the space-space
merge is silently skipped; with i64 (max ~9.2×10¹⁸) the merge is correctly
chosen first.

The fix is minimal: widen the accumulator type from i32 to i64 in
count_pairs and the per-word update loop in do_train. No other logic
changes.

Issue

Fixes #2058

Local verification

$ cd tokenizers/tokenizers

$ cargo fmt -- --check
# no output → format clean

$ cargo clippy --all-targets --all-features -- -D warnings
   ...
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 19.92s
# no warnings

$ cargo test --lib 2>&1 | tail -5
test models::unigram::trainer::tests::test_unk_token ... ok
test pre_tokenizers::punctuation::tests::deserialization_erroneous - should panic ... ok

test result: ok. 200 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.12s
=== LOCAL_TEST_PASSED ===

Risk

The change is a strict type widening: i32 → i64. All arithmetic,
comparisons, and casts in the affected path remain valid (positive counts
cast to u64 correctly; the -1/+1 change deltas from word.merge() are
widened with change as i64, which is lossless). The only observable
behaviour change is that pair counts no longer wrap on corpora with >2.1B
occurrences of a single pair. Existing behaviour on corpora that fit in i32
is unchanged.

…rge corpora

On corpora where any token pair appears more than i32::MAX (~2.1B) times,
the pair-count map wraps to a negative value and that pair is silently
dropped from the merge queue even though it is the most frequent pair.
Code corpora are especially susceptible because a run of spaces can push
the (space, space) pair count well past the threshold.

Widen the accumulator type from i32 to i64 (max ~9.2e18) throughout
count_pairs and the per-word update in do_train:
  - AHashMap<Pair, i32>  →  AHashMap<Pair, i64>
  - counts[i] as i32     →  counts[i] as i64
  - change * counts[iw] as i32 → (change as i64) * counts[iw] as i64

No other behaviour changes; all 200 lib unit tests pass, rustfmt clean,
clippy -D warnings clean.

Fixes huggingface#2058
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BPE trainer: i32 overflow in pair_counts corrupts merge ordering on large code corpora

1 participant