serialize tokenizer vocab and added_tokens compactly by ArthurZucker · Pull Request #2056 · huggingface/tokenizers

ArthurZucker · 2026-05-13T02:42:09Z

Summary

OrderedVocabIter (used by BPE / WordPiece / WordLevel) now serializes the whole vocab map as a single line.
AddedVocabulary now emits each entry of added_tokens as a single line, one token per array element — the outer array still wraps so diffs stay clean.

serde_json's raw_value feature is enabled to support RawValue::from_string / RawValue::serialize.

Result: pretty output now looks like

{
  "version": "1.0",
  ...
  "added_tokens": [
    {"id":0,"content":"[SPECIAL_0]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false,"special":true},
    ...
  ],
  ...
  "model": {
    "type": "BPE",
    "vocab": {"<unk>":0,"a":1,"b":2,...},
    "merges": [...]
  }
}

Pretty-printed tokenizer.json files spend most of their bytes on indentation inside the vocab map and added_tokens list. For large tokenizers this is ~10x the size of compact JSON. Keep the top-level structure pretty for inspectability, but emit the vocab map and each added_token as a single-line value via serde_json::value::RawValue. The outer array of added_tokens still respects the serializer's pretty flag, so pretty output remains diff-friendly with one token per line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-05-13T02:45:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Same trick as vocab/added_tokens: each `[left, right]` merge pair is emitted as a single-line RawValue. The outer array still respects the pretty flag, so pretty tokenizer.json shows one merge per line instead of 4-5 lines per pair.

Emit the whole added_tokens and merges arrays as one compact RawValue each, instead of one element per line. Matches Transformers.js output and maximises the size savings for large tokenizers.

- Inline lib test exercises the pre-compaction layout (multi-line vocab, multi-line added_tokens, legacy space-separated merges) and asserts the new compact serializer round-trips it. - Integration test loads deepseek-ai/DeepSeek-V4-Flash-Base/tokenizer.json (downloaded by the Makefile alongside other test fixtures) and verifies that the file deserializes, that the re-serialized output is strictly smaller than the original (legacy pretty = 6.4 MB, compact pretty = 4.9 MB, ~23% savings), and that the result round-trips identically.

The Rust core uses serde_json::value::RawValue to keep vocab, merges, and added_tokens on single lines when saving tokenizer.json. The pyo3 __repr__/__str__ goes through a custom serde Serializer that didn't recognise the RawValue smuggling protocol, so it was leaking $serde_json::private::RawValue(...) markers into Tokenizer reprs. Teach the custom serializer to intercept the magic struct name, parse the embedded JSON, and recurse so the inner shape renders as a normal Python repr. Enable serde_json's preserve_order feature so the keys inside the recursed Value::Object follow the insertion order the BPE serialize_struct produced rather than alphabetical.

Newer napi-rs builds throw "Failed to get Array length" when an undefined arg is read as an array, replacing the older "Given napi value is not an array". Loosen the matcher to accept either so CI passes on both versions.

ArthurZucker added 5 commits May 13, 2026 11:56

serialize BPE merges compactly too

6b6d873

Same trick as vocab/added_tokens: each `[left, right]` merge pair is emitted as a single-line RawValue. The outer array still respects the pretty flag, so pretty tokenizer.json shows one merge per line instead of 4-5 lines per pair.

fully compact added_tokens and merges arrays + apply cargo fmt

8c33e4e

Emit the whole added_tokens and merges arrays as one compact RawValue each, instead of one element per line. Matches Transformers.js output and maximises the size savings for large tokenizers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serialize tokenizer vocab and added_tokens compactly#2056

serialize tokenizer vocab and added_tokens compactly#2056
ArthurZucker wants to merge 6 commits into
mainfrom
feat/compact-vocab-output

ArthurZucker commented May 13, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

HuggingFaceDocBuilderDev commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArthurZucker commented May 13, 2026 •

edited

Loading