fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora#2059
Open
xodn348 wants to merge 1 commit into
Open
fix(bpe): widen pair_counts from i32 to i64 to prevent overflow on large corpora#2059xodn348 wants to merge 1 commit into
xodn348 wants to merge 1 commit into
Conversation
…rge corpora On corpora where any token pair appears more than i32::MAX (~2.1B) times, the pair-count map wraps to a negative value and that pair is silently dropped from the merge queue even though it is the most frequent pair. Code corpora are especially susceptible because a run of spaces can push the (space, space) pair count well past the threshold. Widen the accumulator type from i32 to i64 (max ~9.2e18) throughout count_pairs and the per-word update in do_train: - AHashMap<Pair, i32> → AHashMap<Pair, i64> - counts[i] as i32 → counts[i] as i64 - change * counts[iw] as i32 → (change as i64) * counts[iw] as i64 No other behaviour changes; all 200 lib unit tests pass, rustfmt clean, clippy -D warnings clean. Fixes huggingface#2058
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BpeTraineraccumulates pair counts in anAHashMap<Pair, i32>. On largecorpora — especially code corpora where indentation means a space-space pair
can appear billions of times — the
i32counter wraps to a negative value.The wrapped pair is then excluded from the merge heap even though it is the
single most frequent pair in the vocabulary, producing completely wrong merge
ordering.
For example, training on a corpus of 30M lines each containing 80 leading
spaces yields ~2.37 billion occurrences of the
(Ġ, Ġ)pair, just abovei32::MAX = 2,147,483,647. Withi32, the counter wraps and the space-spacemerge is silently skipped; with
i64(max ~9.2×10¹⁸) the merge is correctlychosen first.
The fix is minimal: widen the accumulator type from
i32toi64incount_pairsand the per-word update loop indo_train. No other logicchanges.
Issue
Fixes #2058
Local verification
Risk
The change is a strict type widening:
i32 → i64. All arithmetic,comparisons, and casts in the affected path remain valid (positive counts
cast to
u64correctly; the-1/+1change deltas fromword.merge()arewidened with
change as i64, which is lossless). The only observablebehaviour change is that pair counts no longer wrap on corpora with >2.1B
occurrences of a single pair. Existing behaviour on corpora that fit in
i32is unchanged.