Document that train_new_from_iterator uses BPE for WordPiece by adityasingh2400 · Pull Request #2065 · huggingface/tokenizers

adityasingh2400 · 2026-05-21T23:31:31Z

Summary

Closes #1777.

Users following the HF LLM course call train_new_from_iterator() on a WordPiece tokenizer and silently get a BPE-trained vocabulary, because the library uses BPE internally for that path. @ArthurZucker confirmed in #1777 that this should be documented.

This PR adds the clarification in two places:

docs/source-doc-builder/components.mdx: a <Tip warning> callout under the Models section pointing readers to WordPieceTrainer + tokenizer.train(...) when they actually want WordPiece.
bindings/python/src/trainers.rs: a Note: block in the WordPieceTrainer Python docstring so the same caveat surfaces in API help.

No code changes. I skipped regenerating the .pyi stubs since this is docstring-only and my local toolchain may not match CI; happy to add a stub refresh commit if you want.

Test plan

git diff shows only docs/docstring edits, no code changes
<Tip> syntax matches existing usage in quicktour.mdx
Docstring Note: block follows the existing Args: / Example: style in trainers.rs

Users following the HF LLM course hit train_new_from_iterator on a WordPiece tokenizer and silently get a BPE-trained model. ArthurZucker confirmed in huggingface#1777 the discrepancy should be documented. Add a callout in the components doc plus a note on WordPieceTrainer so users know to use the trainer with .train() directly when they want WordPiece. Fixes huggingface#1777

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document that train_new_from_iterator uses BPE for WordPiece#2065

Document that train_new_from_iterator uses BPE for WordPiece#2065
adityasingh2400 wants to merge 1 commit into
huggingface:mainfrom
adityasingh2400:doc-clarify-wordpiece-train-new-from-iterator-bpe-1777

adityasingh2400 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adityasingh2400 commented May 21, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant