Add chat-template rendering and special-token tokenize path#6
Merged
Conversation
The engine only exposed a raw tokenize (parse_special=false) and no way to
render a model's built-in chat template. Callers feeding instruct models
(Qwen, Gemma, Llama) therefore had to send unstructured text, which pushes
those models off-distribution and makes them echo prompt scaffolding.
New API (all additive; existing tokenize/detokenize unchanged):
- tokenizeWithOptions(text, len, add_special, parse_special): lets the
chat-template path tokenize rendered control markers as real token IDs.
Plain tokenize() now delegates here with parse_special=false, preserving
the historical contract byte-for-byte.
- hasChatTemplate(): true when the loaded model ships a template in GGUF
metadata, so callers can fall back to the raw path for base models.
- applyChatTemplate(messages, count, add_assistant): renders a conversation
through llama_model_chat_template + llama_chat_apply_template, returning
the formatted prompt (empty string signals "fall back to raw").
- ChatMessage { role, content } value type for the message list.
Tests cover the no-model guards for all three new entry points and, when
COTABBY_TEST_MODEL_PATH is set, assert a templated render tokenizes to a
non-empty list.
3b7f24d to
e8a7049
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The engine only exposed a raw
tokenize(withparse_special=false) and no way to render a model's built-in chat template. Callers feeding instruct models (Qwen, Gemma, Llama) therefore had to send unstructured labeled text, which pushes those models off-distribution and makes them echo prompt scaffolding (e.g. repeatingApp:/Text before caret:section headers). This adds the C++ surface a caller needs to render the model's own template and tokenize the result correctly, with a clean fallback for base models that ship no template.All changes are additive. Existing
tokenize/detokenizebehavior is unchanged.What's new
tokenizeWithOptions(text, len, add_special, parse_special)— lets the chat-template path tokenize rendered control markers as real token IDs. Plaintokenize()now delegates here withparse_special=false, preserving the historical contract byte-for-byte (BOS still added per model metadata).hasChatTemplate()— true when the loaded model ships a template in GGUF metadata, so callers fall back to the raw path for base models.applyChatTemplate(messages, count, add_assistant)— renders a conversation viallama_model_chat_template+llama_chat_apply_template, returning the formatted prompt. Empty string signals "fall back to raw."ChatMessage { role, content }value type for the message list.Validation
swift test— Executed 13 tests, with 0 failures (10 pre-existing + 3 new no-model guard tests).COTABBY_TEST_MODEL_PATHis set: a templated render must tokenize (withparse_special) to a non-empty token list.swift buildclean.Notes / limitations
llama_chat_apply_templatein llama.cpp b9310 uses a predefined template list, not a jinja parser, and exposes noenable_thinkingparameter. So generation-time "thinking off" for reasoning models is not reachable through this API; that would need a separate model-specific approach. This PR is scoped to template rendering + the tokenize path that unblocks it.Consumer
The Cotabby app (
tabby-1) will adopt this in a paired PR: pin bump + aLlamaPromptRendererrewrite that emits role-structured messages whenhasChatTemplate()is true and falls back to the current raw prompt otherwise.