fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora by xodn348 · Pull Request #2059 · huggingface/tokenizers

xodn348 · 2026-05-17T11:21:19Z

Summary

BpeTrainer accumulates pair counts in an AHashMap<Pair, i32>. On large
corpora — especially code corpora where indentation means a space-space pair
can appear billions of times — the i32 counter wraps to a negative value.
The wrapped pair is then excluded from the merge heap even though it is the
single most frequent pair in the vocabulary, producing completely wrong merge
ordering.

For example, training on a corpus of 30M lines each containing 80 leading
spaces yields ~2.37 billion occurrences of the (Ġ, Ġ) pair, just above
i32::MAX = 2,147,483,647. With i32, the counter wraps and the space-space
merge is silently skipped; with i64 (max ~9.2×10¹⁸) the merge is correctly
chosen first.

The fix is minimal: widen the accumulator type from i32 to i64 in
count_pairs and the per-word update loop in do_train. No other logic
changes.

Issue

Fixes #2058

Local verification

$ cd tokenizers/tokenizers

$ cargo fmt -- --check
# no output → format clean

$ cargo clippy --all-targets --all-features -- -D warnings
   ...
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 19.92s
# no warnings

$ cargo test --lib 2>&1 | tail -5
test models::unigram::trainer::tests::test_unk_token ... ok
test pre_tokenizers::punctuation::tests::deserialization_erroneous - should panic ... ok

test result: ok. 200 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.12s
=== LOCAL_TEST_PASSED ===

Risk

The change is a strict type widening: i32 → i64. All arithmetic,
comparisons, and casts in the affected path remain valid (positive counts
cast to u64 correctly; the -1/+1 change deltas from word.merge() are
widened with change as i64, which is lossless). The only observable
behaviour change is that pair counts no longer wrap on corpora with >2.1B
occurrences of a single pair. Existing behaviour on corpora that fit in i32
is unchanged.

…rge corpora On corpora where any token pair appears more than i32::MAX (~2.1B) times, the pair-count map wraps to a negative value and that pair is silently dropped from the merge queue even though it is the most frequent pair. Code corpora are especially susceptible because a run of spaces can push the (space, space) pair count well past the threshold. Widen the accumulator type from i32 to i64 (max ~9.2e18) throughout count_pairs and the per-word update in do_train: - AHashMap<Pair, i32> → AHashMap<Pair, i64> - counts[i] as i32 → counts[i] as i64 - change * counts[iw] as i32 → (change as i64) * counts[iw] as i64 No other behaviour changes; all 200 lib unit tests pass, rustfmt clean, clippy -D warnings clean. Fixes huggingface#2058

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora#2059

fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora#2059
xodn348 wants to merge 1 commit into
huggingface:mainfrom
xodn348:fix/bpe-trainer-i32-overflow

xodn348 commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xodn348 commented May 17, 2026

Summary

Issue

Local verification

Risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant