Skip to content

serialize tokenizer vocab and added_tokens compactly#2056

Open
ArthurZucker wants to merge 6 commits into
mainfrom
feat/compact-vocab-output
Open

serialize tokenizer vocab and added_tokens compactly#2056
ArthurZucker wants to merge 6 commits into
mainfrom
feat/compact-vocab-output

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented May 13, 2026

Summary

  • OrderedVocabIter (used by BPE / WordPiece / WordLevel) now serializes the whole vocab map as a single line.
  • AddedVocabulary now emits each entry of added_tokens as a single line, one token per array element — the outer array still wraps so diffs stay clean.

serde_json's raw_value feature is enabled to support RawValue::from_string / RawValue::serialize.

Result: pretty output now looks like

{
  "version": "1.0",
  ...
  "added_tokens": [
    {"id":0,"content":"[SPECIAL_0]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false,"special":true},
    ...
  ],
  ...
  "model": {
    "type": "BPE",
    "vocab": {"<unk>":0,"a":1,"b":2,...},
    "merges": [...]
  }
}

Pretty-printed tokenizer.json files spend most of their bytes on
indentation inside the vocab map and added_tokens list. For large
tokenizers this is ~10x the size of compact JSON.

Keep the top-level structure pretty for inspectability, but emit the
vocab map and each added_token as a single-line value via
serde_json::value::RawValue. The outer array of added_tokens still
respects the serializer's pretty flag, so pretty output remains
diff-friendly with one token per line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Same trick as vocab/added_tokens: each `[left, right]` merge pair is
emitted as a single-line RawValue. The outer array still respects the
pretty flag, so pretty tokenizer.json shows one merge per line instead
of 4-5 lines per pair.
Emit the whole added_tokens and merges arrays as one compact RawValue
each, instead of one element per line. Matches Transformers.js output
and maximises the size savings for large tokenizers.
- Inline lib test exercises the pre-compaction layout (multi-line vocab,
  multi-line added_tokens, legacy space-separated merges) and asserts
  the new compact serializer round-trips it.
- Integration test loads deepseek-ai/DeepSeek-V4-Flash-Base/tokenizer.json
  (downloaded by the Makefile alongside other test fixtures) and verifies
  that the file deserializes, that the re-serialized output is strictly
  smaller than the original (legacy pretty = 6.4 MB, compact pretty = 4.9
  MB, ~23% savings), and that the result round-trips identically.
The Rust core uses serde_json::value::RawValue to keep vocab, merges,
and added_tokens on single lines when saving tokenizer.json. The pyo3
__repr__/__str__ goes through a custom serde Serializer that didn't
recognise the RawValue smuggling protocol, so it was leaking
$serde_json::private::RawValue(...) markers into Tokenizer reprs.

Teach the custom serializer to intercept the magic struct name, parse
the embedded JSON, and recurse so the inner shape renders as a normal
Python repr. Enable serde_json's preserve_order feature so the keys
inside the recursed Value::Object follow the insertion order the BPE
serialize_struct produced rather than alphabetical.
Newer napi-rs builds throw "Failed to get Array length" when an
undefined arg is read as an array, replacing the older "Given napi
value is not an array". Loosen the matcher to accept either so CI
passes on both versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants