Make BPE/WordPiece training deterministic#2066
Open
ATOM00blue wants to merge 1 commit into
Open
Conversation
The BPE trainer creates the continuing-subword tokens lazily while iterating over the word counts in `tokenize_words`. Since the word counts are stored in a hash map, the iteration order varied from run to run, so these subword tokens were assigned different ids each time. This made the trained vocabulary (and therefore the resulting tokenizer) non-deterministic even though no part of training uses randomness. Iterate over the word counts in a stable, sorted order before assigning ids, mirroring the existing sorting done in `compute_alphabet`. Add a test that trains on a fixed corpus and checks the produced vocabulary against the expected one.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR makes BPE training deterministic by ensuring word-count iteration happens in a stable order, preventing run-to-run vocabulary/id differences caused by hash map iteration order.
Changes:
- Sort
wcentries before creatingWord/count vectors to make token id assignment deterministic. - Add a regression test asserting consistent vocab/id output for a representative dataset (see #1794).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+379
to
+382
| let mut sorted_words = wc.iter().collect::<Vec<_>>(); | ||
| sorted_words.sort_unstable_by(|a, b| a.0.cmp(b.0)); | ||
|
|
||
| for (word, count) in sorted_words { |
Comment on lines
+878
to
+885
| fn test_train_is_deterministic() { | ||
| // Training on the same data must always produce the exact same | ||
| // vocabulary. The continuing-subword tokens (here prefixed with `##`) | ||
| // are created on the fly while iterating over the word counts. Iterating | ||
| // the word counts in a non-deterministic (hash map) order used to assign | ||
| // those tokens different ids from one run to the next, which made the | ||
| // trained tokenizer non-deterministic. See #1794. | ||
| let word_counts: AHashMap<CompactString, u64> = [ |
Comment on lines
+899
to
+905
| let trainer = BpeTrainer::builder() | ||
| .show_progress(false) | ||
| .continuing_subword_prefix("##".into()) | ||
| .build(); | ||
| let mut model = BPE::default(); | ||
| trainer.do_train(&word_counts, &mut model).unwrap(); | ||
| let trained_vocab: AHashMap<String, u32> = model.get_vocab().into_iter().collect(); |
Comment on lines
+964
to
+965
| assert_eq!(trained_vocab, expected_vocab); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1794
Problem
Training a BPE or WordPiece tokenizer on the same data can produce a different vocabulary from one run to another, which in turn changes how text is tokenized. The reproducer from #1794 (a
WordPieceTrainerwith no dropout) yields two different tokenizations across runs.This is not caused by dropout or a missing seed: there is no randomness anywhere in the training algorithm. The non-determinism comes from hash-map iteration order.
Root cause
In
BpeTrainer::tokenize_words, the continuing-subword tokens (e.g. the##-prefixed tokens used by WordPiece) are created lazily while iterating over the word counts:Because
wcis a hash map, the iteration order changes between runs, so these subword tokens get assigned different ids each time. Those ids then feed into the merge tie-breaking (Merge::cmpcompares the pair ids when counts are equal), so on count ties the trainer can even pick different merges, producing genuinely different tokenizations.The base alphabet does not have this problem because
compute_alphabetalready sorts before assigning ids; only the subword tokens created intokenize_wordswere left in hash order.Fix
Iterate over the word counts in a stable, sorted order before assigning ids, mirroring the sorting already done in
compute_alphabet. Training has no other source of non-determinism, so this makes the produced vocabulary fully reproducible.Test
Added
test_train_is_deterministic, which trains on a fixed corpus with a##continuing-subword prefix and asserts the exact produced vocabulary. It fails onmain(the subword token ids vary with the hash seed) and passes with this change.All existing trainer/model tests and the
trainingintegration tests continue to pass.Developed with AI coding assistance; reviewed and submitted by the author.