Add per-sequence thread budget to SamplingConfig by FuJacob · Pull Request #1 · FuJacob/cotabbyinference

FuJacob · 2026-05-27T08:07:10Z

Summary

Adds a thread_count field to SamplingConfig so each sequence's llama_context can be created with its own n_threads / n_threads_batch budget. createSequence uses it when positive and falls back to the all-cores default when <= 0.

Why: previously every context was created with n_threads = hardware_concurrency() (all cores). Running two sequences at once — e.g. autocomplete + the visual-context summarizer — spawned two full sets of compute threads contending for the same physical cores, so on CPU-only machines the two streams oversubscribed and showed no real concurrency (and a background summary degraded autocomplete latency). With this, a caller can give the background sequence a smaller budget so the two decode concurrently on disjoint cores.

Validation

swift build --package-path .     # Build complete
swift test --package-path .      # Executed 10 tests, 1 skipped (E2E needs COTABBY_TEST_MODEL_PATH), 0 failures

The concurrent-sequences test now creates its background "summary" sequence with thread_count: 2 to exercise the reduced-budget path; the autocomplete sequence keeps the default (0).

Linked issues

Risk / rollout notes

Source-breaking struct change: SamplingConfig gains a trailing thread_count field, so every Swift call site of the bridged memberwise initializer must pass it. Consumers (the Cotabby app's LlamaRuntimeCore.samplingConfig(from:)) need a companion change + a dependency pin bump before this takes effect.
Behavior is opt-in: thread_count == 0 reproduces prior all-cores behavior exactly. No change until a caller sets a positive budget.
No ABI/runtime change to decode or sampling logic; only context creation thread counts.

createSequence now reads SamplingConfig.thread_count for a context's n_threads / n_threads_batch, falling back to the all-cores default when it is <= 0. This lets a caller cap a background sequence (e.g. the visual-context summarizer) to a smaller thread budget so it decodes concurrently with latency-critical autocomplete instead of oversubscribing every core and starving it — the dominant reason two CPU sequences showed no real concurrency before. Backward compatible: thread_count == 0 preserves prior behavior.

This reverts commit c58a938.

FuJacob mentioned this pull request May 27, 2026

Run visual-context summarizer on a reduced CPU thread budget FuJacob/cotabby#309

Closed

Revert "Add per-sequence thread budget to SamplingConfig"

97bcc2d

This reverts commit c58a938.

FuJacob closed this May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-sequence thread budget to SamplingConfig#1

Add per-sequence thread budget to SamplingConfig#1
FuJacob wants to merge 2 commits into
mainfrom
feat/per-sequence-thread-budget

FuJacob commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FuJacob commented May 27, 2026

Summary

Validation

Linked issues

Risk / rollout notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant