Generation-time quality controls: token masks, single-line, mid-word continuation#488
Draft
FuJacob wants to merge 2 commits into
Draft
Generation-time quality controls: token masks, single-line, mid-word continuation#488FuJacob wants to merge 2 commits into
FuJacob wants to merge 2 commits into
Conversation
023d913 to
c42c482
Compare
Point the CotabbyInference package at the engine branch that adds the token masks, mid-word continuation, and KV snapshot APIs, and use them: - The always-on nonprintable token mask now applies automatically (control, chat-template, and unused tokens can no longer be emitted as visible text), with no app code beyond the pin bump. - single_line is set from the focused field (LlamaGenerationOptions.singleLine = !isMultiLineEnabled) so single-line fields never receive a multi-line completion at the source instead of being truncated after the fact. - forceWordContinuation fires only when the caret sits strictly inside a word (MidWordContinuationPolicy), so the engine constrains the first token to continue that word without affecting ordinary next-word predictions. Threads singleLine / forceWordContinuation through LlamaGenerationOptions into LlamaRuntimeCore (sampling config + setForceWordContinuation before each decodePrompt, fresh and reuse paths). Adds MidWordContinuationPolicy + tests.
Use the engine's new per-token logprob to drop completions the model itself was unsure about. LlamaRuntimeCore accumulates the average per-token log-probability and, when LlamaGenerationOptions.confidenceFloor is raised above its default of -infinity, suppresses completions below the floor. ConfidenceSuppressionPolicy holds the pure decision and is unit-tested. Disabled by default, so behavior is unchanged until a caller opts in; wiring a Settings control and full multi-candidate N-best ranking remain follow-ups.
937f8c6 to
ca10419
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires generation-time quality controls into the app so unwanted tokens are stopped at the sampler instead of cleaned up after the fact. Builds on the engine changes in FuJacob/cotabbyinference#8. Second of two stacked PRs, based on the
feat/output-safety-gatesbranch (#485).single_line: set from the focused field (LlamaGenerationOptions.singleLine = !request.isMultiLineEnabled). Single-line fields never receive a multi-line completion at the source.forceWordContinuation:MidWordContinuationPolicyfires only when the caret sits strictly inside a word, so the engine constrains the first sampled token to continue the current word. At a normal word end it does not fire, so ordinary next-word predictions are unchanged.LlamaRuntimeCoreaverages it and, whenLlamaGenerationOptions.confidenceFlooris raised above its default of -infinity, suppresses low-confidence completions via the pureConfidenceSuppressionPolicy. Disabled by default, so no behavior change until a caller opts in.Threading:
singleLine/forceWordContinuation/confidenceFloorflow throughLlamaGenerationOptionsintoLlamaRuntimeCore(sampling config,setForceWordContinuationbefore eachdecodePrompt, and the logprob accumulation in the sample loop).Validation
Package pin resolves to
cotabbyinference @ feat/generation-quality-controls (be64365). The engine side was separately validated withswift testagainst a local gemma-3-1b model: 20 tests, 0 failures (see the engine PR).Linked issues
Depends on FuJacob/cotabbyinference#8 (engine).
Risk / rollout notes
project.ymlpins the engine feature branch. Merge the engine PR tocotabbyinferencemainfirst, then flip this PR'sproject.ymlback tobranch: main, re-resolve, and merge. Kept as a draft until then.single_lineonly affects single-line fields.forceWordContinuationuses a narrow trigger. Confidence suppression is off by default.confidenceFloor, full multi-candidate N-best ranking, and the base-model prompt path (covered by the in-flight Feed instruct models their own chat template; write both prompt paths as prose #438). These need on-device tuning that green CI cannot stand in for.