Increase llama context window safely by Jam-Cai · Pull Request #215 · FuJacob/cotabby

Jam-Cai · 2026-05-25T03:24:54Z

Summary

use a 4096-token default llama context for autocomplete across build configurations
cap visual-context summarization at a 2048-token temporary context and bound summary batch size to that context

Why

Autocomplete benefits from the larger KV window, and tests should exercise the same default runtime configuration instead of getting a build-specific shortcut. The memory guard belongs in summarization because that path creates an extra temporary llama context; keeping that auxiliary context at 2048 avoids doubling peak memory when visual-context summaries run alongside autocomplete.

Validation

xcodebuild -project tabby.xcodeproj -scheme tabby -destination 'platform=macOS' build
xcodebuild -project tabby.xcodeproj -scheme tabby -destination 'platform=macOS' build-for-testing
xcodebuild -project tabby.xcodeproj -scheme tabby -configuration Release -destination 'platform=macOS' build

Greptile Summary

This PR increases the default llama context window from 2048 to 4096 tokens for autocomplete, and adds a 2048-token cap for the summarize() path so the auxiliary visual-context sequence can't monopolize more than half of the shared KV pool when both paths run concurrently.

LlamaRuntimeModels.swift: contextWindowTokens in LlamaRuntimeConfiguration.default is bumped unconditionally from 2048 → 4096.
LlamaRuntimeCore.swift: Introduces summarizeContextWindowCap = 2048 and derives summarizeContextWindow = min(contextWindowTokens, cap); clamps maxPredictionTokens to summarizeContextWindow - 1 and maxPromptTokens to max(1, summarizeContextWindow - maxPredictionTokens), then uses the clamped value in the generation loop (previously the loop ran uncapped at options.maxPredictionTokens).

Confidence Score: 5/5

Safe to merge. The KV-budget arithmetic is correct and the generation loop is now properly bounded to the capped window in the summarize path.

Both changes are narrow and well-reasoned. The context bump in LlamaRuntimeModels is a one-liner with no conditional logic. The summarize capping in LlamaRuntimeCore correctly derives summarizeContextWindow, clamps both maxPredictionTokens and maxPromptTokens to stay within that window, and fixes the pre-existing gap where the generation loop ran for the raw options.maxPredictionTokens without any bound check against available KV slots. No new invariant is broken across the two paths.

No files require special attention.

Important Files Changed

Filename	Overview
Cotabby/Models/LlamaRuntimeModels.swift	Single-line bump of default `contextWindowTokens` from 2048 to 4096 in `LlamaRuntimeConfiguration.default`; no other structural changes.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Adds `summarizeContextWindowCap = 2048`, derives `summarizeContextWindow = min(contextWindowTokens, cap)`, clamps both `maxPredictionTokens` and `maxPromptTokens` to that window, and fixes the generation loop to use the clamped value instead of the raw `options.maxPredictionTokens`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[summarize called] --> B[tokenize prompt]
    B --> C{tokens empty?}
    C -- yes --> ERR[throw generationFailed]
    C -- no --> D["summarizeContextWindow = min(contextWindowTokens, cap=2048)"]
    D --> E["maxPredictionTokens = min(options.maxPredictionTokens, summarizeContextWindow - 1)"]
    E --> F["maxPromptTokens = max(1, summarizeContextWindow - maxPredictionTokens)"]
    F --> G{prompt count exceeds maxPromptTokens?}
    G -- yes --> H[truncate to suffix]
    G -- no --> I[use all tokens]
    H --> J[decodePrompt]
    I --> J
    J --> K[generation loop bounded by maxPredictionTokens]
    K --> L[return generatedText]

_{Reviews (6): Last reviewed commit: "Clamp summarize prediction tokens to KV ..." | Re-trigger Greptile}

FuJacob · 2026-05-25T03:33:11Z

@greptileai

…xt-window

Bound the summarize generation loop by the same cap as the prompt budget so prompt + generated tokens stay within the capped window even if a future caller passes a larger maxPredictionTokens. Reword the rationale to describe the shared KV slot budget rather than peak memory, since the KV cache is a single pool allocated once at model load.

FuJacob · 2026-05-25T13:04:16Z

Analyzed this against the actual CotabbyInference engine before merging — TL;DR it's safe, but the rationale doesn't match how the engine allocates KV, and the real-world benefit is narrower than the description implies.

Safety: ✅ no overflow/crash risk

The comment + summary describe the KV cache as "a single pool allocated once at model load." That's not what happens. In CotabbyInferenceEngine.cpp, createSequence() builds a fresh llama_context per sequence, each with its own n_ctx and n_seq_max = 1:

ctx_params.n_ctx     = context_window_tokens;
ctx_params.n_seq_max = 1;
llama_context* ctx = llama_init_from_model(model, ctx_params);

So autocomplete and summarize run in isolated contexts with separate KV caches — there's no shared pool, no contention, no eviction. The worst case isn't "4096 + 2048 > pool"; it's two independent contexts, each correctly sized. That's precisely why it can't overflow.

Caveat: the summarize cap does not save memory

Because createSequence() always allocates n_ctx = context_window_tokens (4096), the summarize context still allocates a 4096-cell KV cache regardless of the Swift-side min(ctx, 2048). That cap only bounds how many tokens summarize feeds/generates — it does not reduce the allocation. So "avoids doubling peak memory when summaries run alongside autocomplete" isn't accurate: peak KV roughly doubles for the summarize context too. The cap's only real value is the defensive loop bound (a legit small correctness tidy-up); the memory-guard framing should be dropped or reworked.

Does the 4096 bump actually help?

Only situationally. The autocomplete prompt is the focused field's preceding text, so the larger window only changes output when that exceeds ~2048 tokens (long docs/emails). For short fields (Slack, iMessage, etc.) the prompt was already under 2048, so output is identical — but llama.cpp reserves the full n_ctx KV upfront, so every user pays ~2× autocomplete KV memory whether they benefit or not. Fine for the bundled 0.6–1B Q4 models; more noticeable on a large user-supplied GGUF / low-RAM Mac.

Suggestion

If the intent is genuinely "bigger window without the memory hit," the fix is to let createSequence() take a per-sequence context size so summary contexts allocate 2048 (not 4096) — the Swift cap is currently powerless to do that. Otherwise this is effectively just "give autocomplete a 4096 window," which is a real but narrow win. Safe to merge either way; just worth correcting the description so the tradeoff is clear.

Jam-Cai force-pushed the codex/summarize-context-window branch from b51e588 to 637d98b Compare May 25, 2026 03:33

Jam-Cai force-pushed the codex/summarize-context-window branch from 637d98b to 35a618c Compare May 25, 2026 03:39

greptile-apps Bot reviewed May 25, 2026

View reviewed changes

Comment thread Cotabby/Models/LlamaRuntimeModels.swift

Increase llama context window safely

993f245

FuJacob force-pushed the codex/summarize-context-window branch from 35a618c to 993f245 Compare May 25, 2026 04:24

FuJacob added 2 commits May 25, 2026 04:02

Merge remote-tracking branch 'origin/main' into codex/summarize-conte…

050ee89

…xt-window

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase llama context window safely#215

Increase llama context window safely#215
Jam-Cai wants to merge 3 commits into
mainfrom
codex/summarize-context-window

Jam-Cai commented May 25, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

FuJacob commented May 25, 2026

Uh oh!

Uh oh!

FuJacob commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Jam-Cai commented May 25, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

FuJacob commented May 25, 2026

Uh oh!

Uh oh!

FuJacob commented May 25, 2026

Safety: ✅ no overflow/crash risk

Caveat: the summarize cap does not save memory

Does the 4096 bump actually help?

Suggestion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jam-Cai commented May 25, 2026 •

edited by greptile-apps Bot

Loading