Skip to content

Add per-sequence thread budget to SamplingConfig#1

Closed
FuJacob wants to merge 2 commits into
mainfrom
feat/per-sequence-thread-budget
Closed

Add per-sequence thread budget to SamplingConfig#1
FuJacob wants to merge 2 commits into
mainfrom
feat/per-sequence-thread-budget

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 27, 2026

Summary

Adds a thread_count field to SamplingConfig so each sequence's llama_context can be created with its own n_threads / n_threads_batch budget. createSequence uses it when positive and falls back to the all-cores default when <= 0.

Why: previously every context was created with n_threads = hardware_concurrency() (all cores). Running two sequences at once — e.g. autocomplete + the visual-context summarizer — spawned two full sets of compute threads contending for the same physical cores, so on CPU-only machines the two streams oversubscribed and showed no real concurrency (and a background summary degraded autocomplete latency). With this, a caller can give the background sequence a smaller budget so the two decode concurrently on disjoint cores.

Validation

swift build --package-path .     # Build complete
swift test --package-path .      # Executed 10 tests, 1 skipped (E2E needs COTABBY_TEST_MODEL_PATH), 0 failures

The concurrent-sequences test now creates its background "summary" sequence with thread_count: 2 to exercise the reduced-budget path; the autocomplete sequence keeps the default (0).

Linked issues

Risk / rollout notes

  • Source-breaking struct change: SamplingConfig gains a trailing thread_count field, so every Swift call site of the bridged memberwise initializer must pass it. Consumers (the Cotabby app's LlamaRuntimeCore.samplingConfig(from:)) need a companion change + a dependency pin bump before this takes effect.
  • Behavior is opt-in: thread_count == 0 reproduces prior all-cores behavior exactly. No change until a caller sets a positive budget.
  • No ABI/runtime change to decode or sampling logic; only context creation thread counts.

createSequence now reads SamplingConfig.thread_count for a context's
n_threads / n_threads_batch, falling back to the all-cores default when
it is <= 0. This lets a caller cap a background sequence (e.g. the
visual-context summarizer) to a smaller thread budget so it decodes
concurrently with latency-critical autocomplete instead of
oversubscribing every core and starving it — the dominant reason two
CPU sequences showed no real concurrency before.

Backward compatible: thread_count == 0 preserves prior behavior.
@FuJacob FuJacob closed this May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant