Run visual-context summarizer on a reduced CPU thread budget by FuJacob · Pull Request #309 · FuJacob/cotabby

FuJacob · 2026-05-27T08:14:40Z

Summary

Makes the OCR/visual-context summarizer actually decode alongside autocomplete on CPU-only machines instead of stalling it. The two already run on independent llama sequences with no shared lock, but every llama_context was created with n_threads = all cores, so a background summary spun up a second full set of compute threads that oversubscribed the cores and starved latency-critical autocomplete. This caps the summarizer's sequence to a smaller thread budget so the two share cores instead of fighting for them.

Validation

xcodegen generate
xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' build   # BUILD SUCCEEDED
swiftlint lint --quiet                                                                       # exit 0, 0 violations
xcodebuild test … CODE_SIGNING_ALLOWED=NO                                                     # 354 tests, 0 failures, 1 skipped

Behavioral tuning (autocomplete tokens/sec solo vs while OCR runs) is not yet measured — shipping the default split and leaving the knob easy to tune.

Linked issues

Depends on FuJacob/cotabbyinference#1 (adds the per-sequence thread_count field this PR sets).

Risk / rollout notes

Dependency pinned to a branch: project.yml temporarily points CotabbyInference at feat/per-sequence-thread-budget (c58a938). Move it back to branch: main once cotabbyinference#1 merges, then regenerate. Do not merge this before that dependency lands on the middleware's main.
Thread split: autocomplete keeps 0 (all cores); the summarizer uses max(2, activeProcessorCount / 4) via LlamaGenerationOptions.backgroundThreadBudget. One-line tuning point if measurements say otherwise.
Expectation: this is about latency hiding, not a 2× speedup — memory bandwidth still caps total CPU tokens/sec. The win is the background summary no longer degrading autocomplete latency. Tune the split after measuring solo-vs-concurrent autocomplete throughput.
No change to sampling output or KV-cache logic; thread_count only sets a context's execution width, and 0 reproduces prior behavior.

Greptile Summary

This PR caps the OCR/visual-context summarizer to a reduced CPU thread budget (max(2, activeProcessorCount / 4)) so it decodes concurrently with autocomplete instead of oversubscribing every core. The change adds threadCount to LlamaGenerationOptions, wires it through to SamplingConfig, and introduces a backgroundThreadBudget static property on the model; CotabbyInference is temporarily pinned to a feature branch that supplies the new thread_count field.

LlamaGenerationOptions gains a threadCount: Int = 0 field; the summary factory sets it via backgroundThreadBudget, while autocomplete leaves it at 0 (runtime default = all cores).
LlamaRuntimeCore.samplingConfig passes thread_count: Int32(options.threadCount) to the downstream SamplingConfig; the summary path uses an ephemeral sequence so no KV-cache interactions are introduced.
Both project.yml and project.pbxproj are temporarily pinned to feat/per-sequence-thread-budget and must be restored to main after cotabbyinference#1 merges — this is the only hard prerequisite before landing.

Confidence Score: 4/5

Safe to merge only after the CotabbyInference upstream PR lands on main and both project.yml and project.pbxproj are updated back to branch: main; the code changes themselves are correct and well-scoped.

The thread-budget split logic is straightforward and the summary sequence already uses an ephemeral context, so there is no KV-cache interaction to worry about. The one real concern is the pinned feature-branch dependency: merging before the upstream lands would ship a ref that could disappear or change, breaking the build. Everything else — the backgroundThreadBudget computation, the Int32 conversion, the SamplingFingerprint intentionally omitting threadCount — is correct as written.

project.yml and Cotabby.xcodeproj/project.pbxproj both need their branch reference updated before this PR can be safely merged.

Important Files Changed

Filename	Overview
Cotabby/Models/LlamaRuntimeModels.swift	Adds `threadCount` to `LlamaGenerationOptions` with a default of 0 (all cores) and a `backgroundThreadBudget` static computed property that caps the summarizer to `max(2, activeProcessorCount / 4)` threads; logic is clean, though `SamplingFingerprint` silently omits `threadCount` without a comment.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Passes `thread_count: Int32(options.threadCount)` through to `SamplingConfig`; the summary path uses an ephemeral sequence so no KV-cache concerns arise; autocomplete continues to use 0 (runtime default).
project.yml	Pins `CotabbyInference` to `feat/per-sequence-thread-budget` — this is a hard merge-order dependency; restoring `branch: main` after the upstream PR lands is a required step before this PR can be safely merged.
Cotabby.xcodeproj/project.pbxproj	Updates the Xcode package reference to match the `project.yml` branch change; must be reverted to `main` alongside `project.yml` once the upstream dependency merges.

Sequence Diagram

sequenceDiagram
    participant AC as Autocomplete
    participant SC as Summarizer
    participant Core as LlamaRuntimeCore
    participant Inf as CotabbyInference

    AC->>Core: "generate(options: threadCount=0)"
    Core->>Inf: "createSequence(SamplingConfig(thread_count=0))"
    Note over Inf: Uses all CPU cores

    SC->>Core: "summarize(options: threadCount=backgroundThreadBudget)"
    Core->>Inf: "createSequence(SamplingConfig(thread_count=max(2, cores/4)))"
    Note over Inf: Uses reduced thread budget

    par Concurrent execution
        Core-->>AC: decode tokens (full cores)
        Core-->>SC: decode tokens (capped cores)
    end

    Inf-->>Core: destroySequence (summary ephemeral)

Comments Outside Diff (2)

Cotabby/Models/LlamaRuntimeModels.swift, line 413-421 (link)

SamplingFingerprint silently drops threadCount from the KV-cache key

SamplingFingerprint.init copies every field from LlamaGenerationOptions except threadCount. Since the fingerprint is the sole guard for KV-state reuse in obtainAutocompleteSequence, a future caller that passes different threadCount values across successive autocomplete requests would happily reuse cached KV state built under a different execution-width context. Today this is safe because autocomplete always uses the default 0, but the omission isn't documented — a comment noting that threadCount is intentionally excluded (it doesn't affect KV state or token probabilities) would prevent a future maintainer from accidentally treating a budget change as a safe no-op for the cache.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 403-404 (link)

SamplingFingerprint omits threadCount because it intentionally only captures fields that affect KV state and token probabilities. A small note here makes the exclusion explicit and guards against a future maintainer accidentally adding it.

_{Reviews (1): Last reviewed commit: "Run visual-context summarizer on a reduc..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

The summarizer and autocomplete already run on separate llama sequences with no shared lock, but every context was created using all cores, so on CPU-only machines a background summary oversubscribed the cores and starved latency-critical autocomplete instead of decoding alongside it. Plumb a per-sequence thread budget through LlamaGenerationOptions -> SamplingConfig (new CotabbyInference field). Autocomplete stays at 0 (all cores); the summary path uses max(2, cores/4) so it makes background progress without fighting autocomplete for every core. Depends on cotabbyinference#1 — the package is temporarily pinned to that branch; move project.yml back to branch: main once it merges.

greptile-apps · 2026-05-27T08:23:20Z

    url: https://github.com/FuJacob/cotabbyinference.git
-    branch: main
+    # Temporarily tracks the per-sequence thread-budget branch (cotabbyinference#1).
+    # Move back to `branch: main` once that PR merges.


Upstream branch dependency must land before merge

CotabbyInference is currently pinned to feat/per-sequence-thread-budget at commit c58a938. If cotabbyinference#1 is force-pushed, rebased, or deleted before this PR merges, CI will resolve a different (or missing) tree — and the thread_count field this PR depends on would disappear, breaking the build silently. The same pinned ref appears in project.pbxproj. Make sure to restore branch: main in both files after the upstream dependency merges, and treat those edits as a hard merge-order gate.

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

FuJacob closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Run visual-context summarizer on a reduced CPU thread budget#309

Run visual-context summarizer on a reduced CPU thread budget#309
FuJacob wants to merge 1 commit into
mainfrom
fix/concurrent-summary-thread-budget

FuJacob commented May 27, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented May 27, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented May 27, 2026 •

edited by greptile-apps Bot

Loading