Run visual-context summarizer on a reduced CPU thread budget#309
Closed
FuJacob wants to merge 1 commit into
Closed
Run visual-context summarizer on a reduced CPU thread budget#309FuJacob wants to merge 1 commit into
FuJacob wants to merge 1 commit into
Conversation
The summarizer and autocomplete already run on separate llama sequences with no shared lock, but every context was created using all cores, so on CPU-only machines a background summary oversubscribed the cores and starved latency-critical autocomplete instead of decoding alongside it. Plumb a per-sequence thread budget through LlamaGenerationOptions -> SamplingConfig (new CotabbyInference field). Autocomplete stays at 0 (all cores); the summary path uses max(2, cores/4) so it makes background progress without fighting autocomplete for every core. Depends on cotabbyinference#1 — the package is temporarily pinned to that branch; move project.yml back to branch: main once it merges.
Comment on lines
15
to
+17
| url: https://github.com/FuJacob/cotabbyinference.git | ||
| branch: main | ||
| # Temporarily tracks the per-sequence thread-budget branch (cotabbyinference#1). | ||
| # Move back to `branch: main` once that PR merges. |
Contributor
There was a problem hiding this comment.
Upstream branch dependency must land before merge
CotabbyInference is currently pinned to feat/per-sequence-thread-budget at commit c58a938. If cotabbyinference#1 is force-pushed, rebased, or deleted before this PR merges, CI will resolve a different (or missing) tree — and the thread_count field this PR depends on would disappear, breaking the build silently. The same pinned ref appears in project.pbxproj. Make sure to restore branch: main in both files after the upstream dependency merges, and treat those edits as a hard merge-order gate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the OCR/visual-context summarizer actually decode alongside autocomplete on CPU-only machines instead of stalling it. The two already run on independent llama sequences with no shared lock, but every
llama_contextwas created withn_threads = all cores, so a background summary spun up a second full set of compute threads that oversubscribed the cores and starved latency-critical autocomplete. This caps the summarizer's sequence to a smaller thread budget so the two share cores instead of fighting for them.Validation
Behavioral tuning (autocomplete tokens/sec solo vs while OCR runs) is not yet measured — shipping the default split and leaving the knob easy to tune.
Linked issues
Depends on FuJacob/cotabbyinference#1 (adds the per-sequence
thread_countfield this PR sets).Risk / rollout notes
project.ymltemporarily pointsCotabbyInferenceatfeat/per-sequence-thread-budget(c58a938). Move it back tobranch: mainonce cotabbyinference#1 merges, then regenerate. Do not merge this before that dependency lands on the middleware's main.0(all cores); the summarizer usesmax(2, activeProcessorCount / 4)viaLlamaGenerationOptions.backgroundThreadBudget. One-line tuning point if measurements say otherwise.thread_countonly sets a context's execution width, and0reproduces prior behavior.Greptile Summary
This PR caps the OCR/visual-context summarizer to a reduced CPU thread budget (
max(2, activeProcessorCount / 4)) so it decodes concurrently with autocomplete instead of oversubscribing every core. The change addsthreadCounttoLlamaGenerationOptions, wires it through toSamplingConfig, and introduces abackgroundThreadBudgetstatic property on the model;CotabbyInferenceis temporarily pinned to a feature branch that supplies the newthread_countfield.LlamaGenerationOptionsgains athreadCount: Int = 0field; thesummaryfactory sets it viabackgroundThreadBudget, while autocomplete leaves it at0(runtime default = all cores).LlamaRuntimeCore.samplingConfigpassesthread_count: Int32(options.threadCount)to the downstreamSamplingConfig; the summary path uses an ephemeral sequence so no KV-cache interactions are introduced.project.ymlandproject.pbxprojare temporarily pinned tofeat/per-sequence-thread-budgetand must be restored tomainaftercotabbyinference#1merges — this is the only hard prerequisite before landing.Confidence Score: 4/5
Safe to merge only after the
CotabbyInferenceupstream PR lands onmainand bothproject.ymlandproject.pbxprojare updated back tobranch: main; the code changes themselves are correct and well-scoped.The thread-budget split logic is straightforward and the summary sequence already uses an ephemeral context, so there is no KV-cache interaction to worry about. The one real concern is the pinned feature-branch dependency: merging before the upstream lands would ship a ref that could disappear or change, breaking the build. Everything else — the
backgroundThreadBudgetcomputation, theInt32conversion, theSamplingFingerprintintentionally omittingthreadCount— is correct as written.project.ymlandCotabby.xcodeproj/project.pbxprojboth need their branch reference updated before this PR can be safely merged.Important Files Changed
threadCounttoLlamaGenerationOptionswith a default of 0 (all cores) and abackgroundThreadBudgetstatic computed property that caps the summarizer tomax(2, activeProcessorCount / 4)threads; logic is clean, thoughSamplingFingerprintsilently omitsthreadCountwithout a comment.thread_count: Int32(options.threadCount)through toSamplingConfig; the summary path uses an ephemeral sequence so no KV-cache concerns arise; autocomplete continues to use 0 (runtime default).CotabbyInferencetofeat/per-sequence-thread-budget— this is a hard merge-order dependency; restoringbranch: mainafter the upstream PR lands is a required step before this PR can be safely merged.project.ymlbranch change; must be reverted tomainalongsideproject.ymlonce the upstream dependency merges.Sequence Diagram
sequenceDiagram participant AC as Autocomplete participant SC as Summarizer participant Core as LlamaRuntimeCore participant Inf as CotabbyInference AC->>Core: "generate(options: threadCount=0)" Core->>Inf: "createSequence(SamplingConfig(thread_count=0))" Note over Inf: Uses all CPU cores SC->>Core: "summarize(options: threadCount=backgroundThreadBudget)" Core->>Inf: "createSequence(SamplingConfig(thread_count=max(2, cores/4)))" Note over Inf: Uses reduced thread budget par Concurrent execution Core-->>AC: decode tokens (full cores) Core-->>SC: decode tokens (capped cores) end Inf-->>Core: destroySequence (summary ephemeral)Comments Outside Diff (2)
Cotabby/Models/LlamaRuntimeModels.swift, line 413-421 (link)SamplingFingerprintsilently dropsthreadCountfrom the KV-cache keySamplingFingerprint.initcopies every field fromLlamaGenerationOptionsexceptthreadCount. Since the fingerprint is the sole guard for KV-state reuse inobtainAutocompleteSequence, a future caller that passes differentthreadCountvalues across successive autocomplete requests would happily reuse cached KV state built under a different execution-width context. Today this is safe because autocomplete always uses the default0, but the omission isn't documented — a comment noting thatthreadCountis intentionally excluded (it doesn't affect KV state or token probabilities) would prevent a future maintainer from accidentally treating a budget change as a safe no-op for the cache.Cotabby/Services/Runtime/LlamaRuntimeCore.swift, line 403-404 (link)SamplingFingerprintomitsthreadCountbecause it intentionally only captures fields that affect KV state and token probabilities. A small note here makes the exclusion explicit and guards against a future maintainer accidentally adding it.Reviews (1): Last reviewed commit: "Run visual-context summarizer on a reduc..." | Re-trigger Greptile