Skip to content

Document that train_new_from_iterator uses BPE for WordPiece#2065

Open
adityasingh2400 wants to merge 1 commit into
huggingface:mainfrom
adityasingh2400:doc-clarify-wordpiece-train-new-from-iterator-bpe-1777
Open

Document that train_new_from_iterator uses BPE for WordPiece#2065
adityasingh2400 wants to merge 1 commit into
huggingface:mainfrom
adityasingh2400:doc-clarify-wordpiece-train-new-from-iterator-bpe-1777

Conversation

@adityasingh2400
Copy link
Copy Markdown

Summary

Closes #1777.

Users following the HF LLM course call train_new_from_iterator() on a WordPiece tokenizer and silently get a BPE-trained vocabulary, because the library uses BPE internally for that path. @ArthurZucker confirmed in #1777 that this should be documented.

This PR adds the clarification in two places:

  • docs/source-doc-builder/components.mdx: a <Tip warning> callout under the Models section pointing readers to WordPieceTrainer + tokenizer.train(...) when they actually want WordPiece.
  • bindings/python/src/trainers.rs: a Note: block in the WordPieceTrainer Python docstring so the same caveat surfaces in API help.

No code changes. I skipped regenerating the .pyi stubs since this is docstring-only and my local toolchain may not match CI; happy to add a stub refresh commit if you want.

Test plan

  • git diff shows only docs/docstring edits, no code changes
  • <Tip> syntax matches existing usage in quicktour.mdx
  • Docstring Note: block follows the existing Args: / Example: style in trainers.rs

Users following the HF LLM course hit train_new_from_iterator on a
WordPiece tokenizer and silently get a BPE-trained model. ArthurZucker
confirmed in huggingface#1777 the discrepancy should be documented. Add a callout
in the components doc plus a note on WordPieceTrainer so users know
to use the trainer with .train() directly when they want WordPiece.

Fixes huggingface#1777
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Suggestion to Clarify WordPiece Documentation

1 participant