serialize tokenizer vocab and added_tokens compactly#2056
Open
ArthurZucker wants to merge 6 commits into
Open
Conversation
Pretty-printed tokenizer.json files spend most of their bytes on indentation inside the vocab map and added_tokens list. For large tokenizers this is ~10x the size of compact JSON. Keep the top-level structure pretty for inspectability, but emit the vocab map and each added_token as a single-line value via serde_json::value::RawValue. The outer array of added_tokens still respects the serializer's pretty flag, so pretty output remains diff-friendly with one token per line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Same trick as vocab/added_tokens: each `[left, right]` merge pair is emitted as a single-line RawValue. The outer array still respects the pretty flag, so pretty tokenizer.json shows one merge per line instead of 4-5 lines per pair.
Emit the whole added_tokens and merges arrays as one compact RawValue each, instead of one element per line. Matches Transformers.js output and maximises the size savings for large tokenizers.
- Inline lib test exercises the pre-compaction layout (multi-line vocab, multi-line added_tokens, legacy space-separated merges) and asserts the new compact serializer round-trips it. - Integration test loads deepseek-ai/DeepSeek-V4-Flash-Base/tokenizer.json (downloaded by the Makefile alongside other test fixtures) and verifies that the file deserializes, that the re-serialized output is strictly smaller than the original (legacy pretty = 6.4 MB, compact pretty = 4.9 MB, ~23% savings), and that the result round-trips identically.
The Rust core uses serde_json::value::RawValue to keep vocab, merges, and added_tokens on single lines when saving tokenizer.json. The pyo3 __repr__/__str__ goes through a custom serde Serializer that didn't recognise the RawValue smuggling protocol, so it was leaking $serde_json::private::RawValue(...) markers into Tokenizer reprs. Teach the custom serializer to intercept the magic struct name, parse the embedded JSON, and recurse so the inner shape renders as a normal Python repr. Enable serde_json's preserve_order feature so the keys inside the recursed Value::Object follow the insertion order the BPE serialize_struct produced rather than alphabetical.
Newer napi-rs builds throw "Failed to get Array length" when an undefined arg is read as an array, replacing the older "Given napi value is not an array". Loosen the matcher to accept either so CI passes on both versions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OrderedVocabIter(used by BPE / WordPiece / WordLevel) now serializes the whole vocab map as a single line.AddedVocabularynow emits each entry ofadded_tokensas a single line, one token per array element — the outer array still wraps so diffs stay clean.serde_json'sraw_valuefeature is enabled to supportRawValue::from_string/RawValue::serialize.Result: pretty output now looks like
{ "version": "1.0", ... "added_tokens": [ {"id":0,"content":"[SPECIAL_0]","single_word":false,"lstrip":false,"rstrip":false,"normalized":false,"special":true}, ... ], ... "model": { "type": "BPE", "vocab": {"<unk>":0,"a":1,"b":2,...}, "merges": [...] } }