Skip to content

Increase llama context window safely#215

Open
Jam-Cai wants to merge 3 commits into
mainfrom
codex/summarize-context-window
Open

Increase llama context window safely#215
Jam-Cai wants to merge 3 commits into
mainfrom
codex/summarize-context-window

Conversation

@Jam-Cai
Copy link
Copy Markdown
Collaborator

@Jam-Cai Jam-Cai commented May 25, 2026

Summary

  • use a 4096-token default llama context for autocomplete across build configurations
  • cap visual-context summarization at a 2048-token temporary context and bound summary batch size to that context

Why

Autocomplete benefits from the larger KV window, and tests should exercise the same default runtime configuration instead of getting a build-specific shortcut. The memory guard belongs in summarization because that path creates an extra temporary llama context; keeping that auxiliary context at 2048 avoids doubling peak memory when visual-context summaries run alongside autocomplete.

Validation

  • xcodebuild -project tabby.xcodeproj -scheme tabby -destination 'platform=macOS' build
  • xcodebuild -project tabby.xcodeproj -scheme tabby -destination 'platform=macOS' build-for-testing
  • xcodebuild -project tabby.xcodeproj -scheme tabby -configuration Release -destination 'platform=macOS' build

Greptile Summary

This PR increases the default llama context window from 2048 to 4096 tokens for autocomplete, and adds a 2048-token cap for the summarize() path so the auxiliary visual-context sequence can't monopolize more than half of the shared KV pool when both paths run concurrently.

  • LlamaRuntimeModels.swift: contextWindowTokens in LlamaRuntimeConfiguration.default is bumped unconditionally from 2048 → 4096.
  • LlamaRuntimeCore.swift: Introduces summarizeContextWindowCap = 2048 and derives summarizeContextWindow = min(contextWindowTokens, cap); clamps maxPredictionTokens to summarizeContextWindow - 1 and maxPromptTokens to max(1, summarizeContextWindow - maxPredictionTokens), then uses the clamped value in the generation loop (previously the loop ran uncapped at options.maxPredictionTokens).

Confidence Score: 5/5

Safe to merge. The KV-budget arithmetic is correct and the generation loop is now properly bounded to the capped window in the summarize path.

Both changes are narrow and well-reasoned. The context bump in LlamaRuntimeModels is a one-liner with no conditional logic. The summarize capping in LlamaRuntimeCore correctly derives summarizeContextWindow, clamps both maxPredictionTokens and maxPromptTokens to stay within that window, and fixes the pre-existing gap where the generation loop ran for the raw options.maxPredictionTokens without any bound check against available KV slots. No new invariant is broken across the two paths.

No files require special attention.

Important Files Changed

Filename Overview
Cotabby/Models/LlamaRuntimeModels.swift Single-line bump of default contextWindowTokens from 2048 to 4096 in LlamaRuntimeConfiguration.default; no other structural changes.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Adds summarizeContextWindowCap = 2048, derives summarizeContextWindow = min(contextWindowTokens, cap), clamps both maxPredictionTokens and maxPromptTokens to that window, and fixes the generation loop to use the clamped value instead of the raw options.maxPredictionTokens.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[summarize called] --> B[tokenize prompt]
    B --> C{tokens empty?}
    C -- yes --> ERR[throw generationFailed]
    C -- no --> D["summarizeContextWindow = min(contextWindowTokens, cap=2048)"]
    D --> E["maxPredictionTokens = min(options.maxPredictionTokens, summarizeContextWindow - 1)"]
    E --> F["maxPromptTokens = max(1, summarizeContextWindow - maxPredictionTokens)"]
    F --> G{prompt count exceeds maxPromptTokens?}
    G -- yes --> H[truncate to suffix]
    G -- no --> I[use all tokens]
    H --> J[decodePrompt]
    I --> J
    J --> K[generation loop bounded by maxPredictionTokens]
    K --> L[return generatedText]
Loading

Reviews (6): Last reviewed commit: "Clamp summarize prediction tokens to KV ..." | Re-trigger Greptile

@Jam-Cai Jam-Cai force-pushed the codex/summarize-context-window branch from b51e588 to 637d98b Compare May 25, 2026 03:33
@FuJacob
Copy link
Copy Markdown
Owner

FuJacob commented May 25, 2026

@greptileai

@Jam-Cai Jam-Cai force-pushed the codex/summarize-context-window branch from 637d98b to 35a618c Compare May 25, 2026 03:39
Comment thread Cotabby/Models/LlamaRuntimeModels.swift
@FuJacob FuJacob force-pushed the codex/summarize-context-window branch from 35a618c to 993f245 Compare May 25, 2026 04:24
FuJacob added 2 commits May 25, 2026 04:02
Bound the summarize generation loop by the same cap as the prompt budget so
prompt + generated tokens stay within the capped window even if a future
caller passes a larger maxPredictionTokens. Reword the rationale to describe
the shared KV slot budget rather than peak memory, since the KV cache is a
single pool allocated once at model load.
@FuJacob
Copy link
Copy Markdown
Owner

FuJacob commented May 25, 2026

Analyzed this against the actual CotabbyInference engine before merging — TL;DR it's safe, but the rationale doesn't match how the engine allocates KV, and the real-world benefit is narrower than the description implies.

Safety: ✅ no overflow/crash risk

The comment + summary describe the KV cache as "a single pool allocated once at model load." That's not what happens. In CotabbyInferenceEngine.cpp, createSequence() builds a fresh llama_context per sequence, each with its own n_ctx and n_seq_max = 1:

ctx_params.n_ctx     = context_window_tokens;
ctx_params.n_seq_max = 1;
llama_context* ctx = llama_init_from_model(model, ctx_params);

So autocomplete and summarize run in isolated contexts with separate KV caches — there's no shared pool, no contention, no eviction. The worst case isn't "4096 + 2048 > pool"; it's two independent contexts, each correctly sized. That's precisely why it can't overflow.

Caveat: the summarize cap does not save memory

Because createSequence() always allocates n_ctx = context_window_tokens (4096), the summarize context still allocates a 4096-cell KV cache regardless of the Swift-side min(ctx, 2048). That cap only bounds how many tokens summarize feeds/generates — it does not reduce the allocation. So "avoids doubling peak memory when summaries run alongside autocomplete" isn't accurate: peak KV roughly doubles for the summarize context too. The cap's only real value is the defensive loop bound (a legit small correctness tidy-up); the memory-guard framing should be dropped or reworked.

Does the 4096 bump actually help?

Only situationally. The autocomplete prompt is the focused field's preceding text, so the larger window only changes output when that exceeds ~2048 tokens (long docs/emails). For short fields (Slack, iMessage, etc.) the prompt was already under 2048, so output is identical — but llama.cpp reserves the full n_ctx KV upfront, so every user pays ~2× autocomplete KV memory whether they benefit or not. Fine for the bundled 0.6–1B Q4 models; more noticeable on a large user-supplied GGUF / low-RAM Mac.

Suggestion

If the intent is genuinely "bigger window without the memory hit," the fix is to let createSequence() take a per-sequence context size so summary contexts allocate 2048 (not 4096) — the Swift cap is currently powerless to do that. Otherwise this is effectively just "give autocomplete a 4096 window," which is a real but narrow win. Safe to merge either way; just worth correcting the description so the tradeoff is clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants