Skip to content

Make BPE/WordPiece training deterministic#2066

Open
ATOM00blue wants to merge 1 commit into
huggingface:mainfrom
ATOM00blue:fix/deterministic-bpe-training
Open

Make BPE/WordPiece training deterministic#2066
ATOM00blue wants to merge 1 commit into
huggingface:mainfrom
ATOM00blue:fix/deterministic-bpe-training

Conversation

@ATOM00blue
Copy link
Copy Markdown

Fixes #1794

Problem

Training a BPE or WordPiece tokenizer on the same data can produce a different vocabulary from one run to another, which in turn changes how text is tokenized. The reproducer from #1794 (a WordPieceTrainer with no dropout) yields two different tokenizations across runs.

This is not caused by dropout or a missing seed: there is no randomness anywhere in the training algorithm. The non-determinism comes from hash-map iteration order.

Root cause

In BpeTrainer::tokenize_words, the continuing-subword tokens (e.g. the ##-prefixed tokens used by WordPiece) are created lazily while iterating over the word counts:

for (word, count) in wc {        // wc is an AHashMap
    ...
    if !w2id.contains_key(&CompactString::from(&s)) {
        id2w.push(...);          // id assigned in iteration order
        w2id.insert(...);
    }
}

Because wc is a hash map, the iteration order changes between runs, so these subword tokens get assigned different ids each time. Those ids then feed into the merge tie-breaking (Merge::cmp compares the pair ids when counts are equal), so on count ties the trainer can even pick different merges, producing genuinely different tokenizations.

The base alphabet does not have this problem because compute_alphabet already sorts before assigning ids; only the subword tokens created in tokenize_words were left in hash order.

Fix

Iterate over the word counts in a stable, sorted order before assigning ids, mirroring the sorting already done in compute_alphabet. Training has no other source of non-determinism, so this makes the produced vocabulary fully reproducible.

Test

Added test_train_is_deterministic, which trains on a fixed corpus with a ## continuing-subword prefix and asserts the exact produced vocabulary. It fails on main (the subword token ids vary with the hash seed) and passes with this change.

cargo test --lib models::bpe::trainer

All existing trainer/model tests and the training integration tests continue to pass.


Developed with AI coding assistance; reviewed and submitted by the author.

The BPE trainer creates the continuing-subword tokens lazily while
iterating over the word counts in `tokenize_words`. Since the word
counts are stored in a hash map, the iteration order varied from run to
run, so these subword tokens were assigned different ids each time. This
made the trained vocabulary (and therefore the resulting tokenizer)
non-deterministic even though no part of training uses randomness.

Iterate over the word counts in a stable, sorted order before assigning
ids, mirroring the existing sorting done in `compute_alphabet`. Add a
test that trains on a fixed corpus and checks the produced vocabulary
against the expected one.
Copilot AI review requested due to automatic review settings May 22, 2026 02:48
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR makes BPE training deterministic by ensuring word-count iteration happens in a stable order, preventing run-to-run vocabulary/id differences caused by hash map iteration order.

Changes:

  • Sort wc entries before creating Word/count vectors to make token id assignment deterministic.
  • Add a regression test asserting consistent vocab/id output for a representative dataset (see #1794).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +379 to +382
let mut sorted_words = wc.iter().collect::<Vec<_>>();
sorted_words.sort_unstable_by(|a, b| a.0.cmp(b.0));

for (word, count) in sorted_words {
Comment on lines +878 to +885
fn test_train_is_deterministic() {
// Training on the same data must always produce the exact same
// vocabulary. The continuing-subword tokens (here prefixed with `##`)
// are created on the fly while iterating over the word counts. Iterating
// the word counts in a non-deterministic (hash map) order used to assign
// those tokens different ids from one run to the next, which made the
// trained tokenizer non-deterministic. See #1794.
let word_counts: AHashMap<CompactString, u64> = [
Comment on lines +899 to +905
let trainer = BpeTrainer::builder()
.show_progress(false)
.continuing_subword_prefix("##".into())
.build();
let mut model = BPE::default();
trainer.do_train(&word_counts, &mut model).unwrap();
let trained_vocab: AHashMap<String, u32> = model.get_vocab().into_iter().collect();
Comment on lines +964 to +965
assert_eq!(trained_vocab, expected_vocab);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WordPieceTrainer.train_from_iterator is not deterministic

2 participants