Skip to content

Run visual-context summarizer on a reduced CPU thread budget#309

Closed
FuJacob wants to merge 1 commit into
mainfrom
fix/concurrent-summary-thread-budget
Closed

Run visual-context summarizer on a reduced CPU thread budget#309
FuJacob wants to merge 1 commit into
mainfrom
fix/concurrent-summary-thread-budget

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 27, 2026

Summary

Makes the OCR/visual-context summarizer actually decode alongside autocomplete on CPU-only machines instead of stalling it. The two already run on independent llama sequences with no shared lock, but every llama_context was created with n_threads = all cores, so a background summary spun up a second full set of compute threads that oversubscribed the cores and starved latency-critical autocomplete. This caps the summarizer's sequence to a smaller thread budget so the two share cores instead of fighting for them.

Validation

xcodegen generate
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build   # BUILD SUCCEEDED
swiftlint lint --quiet                                                                       # exit 0, 0 violations
xcodebuild test … CODE_SIGNING_ALLOWED=NO                                                     # 354 tests, 0 failures, 1 skipped

Behavioral tuning (autocomplete tokens/sec solo vs while OCR runs) is not yet measured — shipping the default split and leaving the knob easy to tune.

Linked issues

Depends on FuJacob/cotabbyinference#1 (adds the per-sequence thread_count field this PR sets).

Risk / rollout notes

  • Dependency pinned to a branch: project.yml temporarily points CotabbyInference at feat/per-sequence-thread-budget (c58a938). Move it back to branch: main once cotabbyinference#1 merges, then regenerate. Do not merge this before that dependency lands on the middleware's main.
  • Thread split: autocomplete keeps 0 (all cores); the summarizer uses max(2, activeProcessorCount / 4) via LlamaGenerationOptions.backgroundThreadBudget. One-line tuning point if measurements say otherwise.
  • Expectation: this is about latency hiding, not a 2× speedup — memory bandwidth still caps total CPU tokens/sec. The win is the background summary no longer degrading autocomplete latency. Tune the split after measuring solo-vs-concurrent autocomplete throughput.
  • No change to sampling output or KV-cache logic; thread_count only sets a context's execution width, and 0 reproduces prior behavior.

Greptile Summary

This PR caps the OCR/visual-context summarizer to a reduced CPU thread budget (max(2, activeProcessorCount / 4)) so it decodes concurrently with autocomplete instead of oversubscribing every core. The change adds threadCount to LlamaGenerationOptions, wires it through to SamplingConfig, and introduces a backgroundThreadBudget static property on the model; CotabbyInference is temporarily pinned to a feature branch that supplies the new thread_count field.

  • LlamaGenerationOptions gains a threadCount: Int = 0 field; the summary factory sets it via backgroundThreadBudget, while autocomplete leaves it at 0 (runtime default = all cores).
  • LlamaRuntimeCore.samplingConfig passes thread_count: Int32(options.threadCount) to the downstream SamplingConfig; the summary path uses an ephemeral sequence so no KV-cache interactions are introduced.
  • Both project.yml and project.pbxproj are temporarily pinned to feat/per-sequence-thread-budget and must be restored to main after cotabbyinference#1 merges — this is the only hard prerequisite before landing.

Confidence Score: 4/5

Safe to merge only after the CotabbyInference upstream PR lands on main and both project.yml and project.pbxproj are updated back to branch: main; the code changes themselves are correct and well-scoped.

The thread-budget split logic is straightforward and the summary sequence already uses an ephemeral context, so there is no KV-cache interaction to worry about. The one real concern is the pinned feature-branch dependency: merging before the upstream lands would ship a ref that could disappear or change, breaking the build. Everything else — the backgroundThreadBudget computation, the Int32 conversion, the SamplingFingerprint intentionally omitting threadCount — is correct as written.

project.yml and Cotabby.xcodeproj/project.pbxproj both need their branch reference updated before this PR can be safely merged.

Important Files Changed

Filename Overview
Cotabby/Models/LlamaRuntimeModels.swift Adds threadCount to LlamaGenerationOptions with a default of 0 (all cores) and a backgroundThreadBudget static computed property that caps the summarizer to max(2, activeProcessorCount / 4) threads; logic is clean, though SamplingFingerprint silently omits threadCount without a comment.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Passes thread_count: Int32(options.threadCount) through to SamplingConfig; the summary path uses an ephemeral sequence so no KV-cache concerns arise; autocomplete continues to use 0 (runtime default).
project.yml Pins CotabbyInference to feat/per-sequence-thread-budget — this is a hard merge-order dependency; restoring branch: main after the upstream PR lands is a required step before this PR can be safely merged.
Cotabby.xcodeproj/project.pbxproj Updates the Xcode package reference to match the project.yml branch change; must be reverted to main alongside project.yml once the upstream dependency merges.

Sequence Diagram

sequenceDiagram
    participant AC as Autocomplete
    participant SC as Summarizer
    participant Core as LlamaRuntimeCore
    participant Inf as CotabbyInference

    AC->>Core: "generate(options: threadCount=0)"
    Core->>Inf: "createSequence(SamplingConfig(thread_count=0))"
    Note over Inf: Uses all CPU cores

    SC->>Core: "summarize(options: threadCount=backgroundThreadBudget)"
    Core->>Inf: "createSequence(SamplingConfig(thread_count=max(2, cores/4)))"
    Note over Inf: Uses reduced thread budget

    par Concurrent execution
        Core-->>AC: decode tokens (full cores)
        Core-->>SC: decode tokens (capped cores)
    end

    Inf-->>Core: destroySequence (summary ephemeral)
Loading

Comments Outside Diff (2)

  1. Cotabby/Models/LlamaRuntimeModels.swift, line 413-421 (link)

    P2 SamplingFingerprint silently drops threadCount from the KV-cache key

    SamplingFingerprint.init copies every field from LlamaGenerationOptions except threadCount. Since the fingerprint is the sole guard for KV-state reuse in obtainAutocompleteSequence, a future caller that passes different threadCount values across successive autocomplete requests would happily reuse cached KV state built under a different execution-width context. Today this is safe because autocomplete always uses the default 0, but the omission isn't documented — a comment noting that threadCount is intentionally excluded (it doesn't affect KV state or token probabilities) would prevent a future maintainer from accidentally treating a budget change as a safe no-op for the cache.

    Fix in Codex Fix in Claude Code

  2. Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 403-404 (link)

    P2 SamplingFingerprint omits threadCount because it intentionally only captures fields that affect KV state and token probabilities. A small note here makes the exclusion explicit and guards against a future maintainer accidentally adding it.

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Run visual-context summarizer on a reduc..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

The summarizer and autocomplete already run on separate llama sequences
with no shared lock, but every context was created using all cores, so on
CPU-only machines a background summary oversubscribed the cores and
starved latency-critical autocomplete instead of decoding alongside it.

Plumb a per-sequence thread budget through LlamaGenerationOptions ->
SamplingConfig (new CotabbyInference field). Autocomplete stays at 0 (all
cores); the summary path uses max(2, cores/4) so it makes background
progress without fighting autocomplete for every core.

Depends on cotabbyinference#1 — the package is temporarily pinned to that
branch; move project.yml back to branch: main once it merges.
Comment thread project.yml
Comment on lines 15 to +17
url: https://github.com/FuJacob/cotabbyinference.git
branch: main
# Temporarily tracks the per-sequence thread-budget branch (cotabbyinference#1).
# Move back to `branch: main` once that PR merges.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Upstream branch dependency must land before merge

CotabbyInference is currently pinned to feat/per-sequence-thread-budget at commit c58a938. If cotabbyinference#1 is force-pushed, rebased, or deleted before this PR merges, CI will resolve a different (or missing) tree — and the thread_count field this PR depends on would disappear, breaking the build silently. The same pinned ref appears in project.pbxproj. Make sure to restore branch: main in both files after the upstream dependency merges, and treat those edits as a hard merge-order gate.

Fix in Codex Fix in Claude Code

@FuJacob FuJacob closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant