Make BPE/WordPiece training deterministic by ATOM00blue · Pull Request #2066 · huggingface/tokenizers

ATOM00blue · 2026-05-22T02:48:00Z

Problem

Training a BPE or WordPiece tokenizer on the same data can produce a different vocabulary from one run to another, which in turn changes how text is tokenized. The reproducer from #1794 (a WordPieceTrainer with no dropout) yields two different tokenizations across runs.

This is not caused by dropout or a missing seed: there is no randomness anywhere in the training algorithm. The non-determinism comes from hash-map iteration order.

Root cause

In BpeTrainer::tokenize_words, the continuing-subword tokens (e.g. the ##-prefixed tokens used by WordPiece) are created lazily while iterating over the word counts:

for (word, count) in wc {        // wc is an AHashMap
    ...
    if !w2id.contains_key(&CompactString::from(&s)) {
        id2w.push(...);          // id assigned in iteration order
        w2id.insert(...);
    }
}

Because wc is a hash map, the iteration order changes between runs, so these subword tokens get assigned different ids each time. Those ids then feed into the merge tie-breaking (Merge::cmp compares the pair ids when counts are equal), so on count ties the trainer can even pick different merges, producing genuinely different tokenizations.

The base alphabet does not have this problem because compute_alphabet already sorts before assigning ids; only the subword tokens created in tokenize_words were left in hash order.

Fix

Iterate over the word counts in a stable, sorted order before assigning ids, mirroring the sorting already done in compute_alphabet. Training has no other source of non-determinism, so this makes the produced vocabulary fully reproducible.

Test

Added test_train_is_deterministic, which trains on a fixed corpus with a ## continuing-subword prefix and asserts the exact produced vocabulary. It fails on main (the subword token ids vary with the hash seed) and passes with this change.

cargo test --lib models::bpe::trainer

All existing trainer/model tests and the training integration tests continue to pass.

Developed with AI coding assistance; reviewed and submitted by the author.

The BPE trainer creates the continuing-subword tokens lazily while iterating over the word counts in `tokenize_words`. Since the word counts are stored in a hash map, the iteration order varied from run to run, so these subword tokens were assigned different ids each time. This made the trained vocabulary (and therefore the resulting tokenizer) non-deterministic even though no part of training uses randomness. Iterate over the word counts in a stable, sorted order before assigning ids, mirroring the existing sorting done in `compute_alphabet`. Add a test that trains on a fixed corpus and checks the produced vocabulary against the expected one.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR makes BPE training deterministic by ensuring word-count iteration happens in a stable order, preventing run-to-run vocabulary/id differences caused by hash map iteration order.

Changes:

Sort wc entries before creating Word/count vectors to make token id assignment deterministic.
Add a regression test asserting consistent vocab/id output for a representative dataset (see #1794).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        let mut sorted_words = wc.iter().collect::<Vec<_>>();
+        sorted_words.sort_unstable_by(|a, b| a.0.cmp(b.0));
+
+        for (word, count) in sorted_words {


+    fn test_train_is_deterministic() {
+        // Training on the same data must always produce the exact same
+        // vocabulary. The continuing-subword tokens (here prefixed with `##`)
+        // are created on the fly while iterating over the word counts. Iterating
+        // the word counts in a non-deterministic (hash map) order used to assign
+        // those tokens different ids from one run to the next, which made the
+        // trained tokenizer non-deterministic. See #1794.
+        let word_counts: AHashMap<CompactString, u64> = [


+        let trainer = BpeTrainer::builder()
+            .show_progress(false)
+            .continuing_subword_prefix("##".into())
+            .build();
+        let mut model = BPE::default();
+        trainer.do_train(&word_counts, &mut model).unwrap();
+        let trained_vocab: AHashMap<String, u32> = model.get_vocab().into_iter().collect();


+        assert_eq!(trained_vocab, expected_vocab);
+    }


Copilot AI review requested due to automatic review settings May 22, 2026 02:48

Copilot AI reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make BPE/WordPiece training deterministic#2066

Make BPE/WordPiece training deterministic#2066
ATOM00blue wants to merge 1 commit into
huggingface:mainfrom
ATOM00blue:fix/deterministic-bpe-training

ATOM00blue commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ATOM00blue commented May 22, 2026

Problem

Root cause

Fix

Test

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants