Phase 0 spike: batched-decode vs separate-context throughput benchmark#3
Merged
Conversation
CotabbyInferenceBench is a standalone executable target that compares
two architectures on the same model and workload:
a_two_contexts: N separate llama_context instances, one thread each
(what CotabbyInferenceEngine does today).
b_shared_context: One shared llama_context with n_seq_max=N, a single
batched llama_decode per step.
Both scenarios exclude prompt decode and seed sample from the timed
section so the numerator is comparable across runs. Output is one JSON
line per invocation; callers script multiple runs and compare.
Initial release-build numbers on M-series + Metal, Gemma 3 1B Q4_K_M,
prompt 256 / sample 200:
n=2: A 130-141 tok/s vs B 190-201 tok/s -> 1.43x (median)
n=3: A 132 tok/s vs B 254 tok/s -> 1.92x
n=4: A 113 tok/s* vs B 266 tok/s -> 2.35x
* At n=4, baseline A's threads bail after ~126 of 199 sampling
iterations each (llama_decode returning non-zero, likely resource
pressure from 4 concurrent Metal contexts). Baseline B's single
shared context completes all 4 sequences cleanly.
The 1.4x gate set in the implementation plan is met at every n we care
about. Phase 1 batched-decode refactor is GO.
Not linked into the CotabbyInference library. Run with --help for the
CLI.
FuJacob
added a commit
that referenced
this pull request
May 28, 2026
Replaces the per-sequence llama_context architecture with a single shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder thread that coalesces sample-step requests from multiple sequences into one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is unchanged; Cotabby's Swift code does not need to be modified. Why --- Phase 0 spike (see PR #3) showed that on M-series Metal, batched decode delivers 1.43x aggregate throughput at N=2 and up to 2.35x at N=4 vs the current "separate llama_context per sequence" design. The win comes from fusing matmul weight reads across sequences in a single llama_decode call: per-token decode is memory-bound on Apple Silicon, so a single decode that serves two sequences reuses the same weight read. The "Metal command queue serializes everything" pessimism does not survive empirically. Design ------ - Impl owns one llama_context with n_ctx = configured_ctx * MAX_SEQUENCES and n_seq_max = MAX_SEQUENCES. Each SequenceState carries a llama_seq_id slot (0..MAX_SEQUENCES-1) used to tag tokens in the shared KV cache. - Decoder thread loop: wait for at least one pending request, wait an additional BATCH_WINDOW_MICROS (200 µs by default) for siblings to pile in, then build one llama_batch carrying all pending tokens with their respective seq_ids, llama_decode once, sample each sequence's next token using its own sampler chain at its assigned batch index, and resolve every request's promise. - sampleNext fast path: deliver the seed token sampled at decodePrompt time. This avoids the decoder round-trip for the very first sample after a prompt, where there is no input token to feedback-decode. - sampleNext steady-state path: queue a PendingRequest (input token = previously-sampled token, position = current KV count, sampler = this sequence's chain) and wait on a std::promise resolved by the decoder thread. - decodePrompt holds decode_mutex for the prompt's chunk decode and takes the seed sample inline while the prompt's logits are still resident in the shared context. - trimKV holds decode_mutex, calls llama_memory_seq_rm for this sequence's seq_id, and invalidates any pending seed/input so the caller has to re-prime via decodePrompt before the next sampleNext. The 200 µs window is the throughput knob. Multi-sequence workloads naturally fall into lockstep because each sequence resubmits as soon as its sample returns, so successive requests usually arrive within the window without any caller-side coordination. Single-sequence callers pay one window's worth of latency per token (~2% of a ~10 ms decode); acceptable. Tunable later via a setter if needed. Cancellation ------------ - Existing one-way atomic flag preserved. - Checked at sampleNext entry (returns immediately) and again in processBatch after llama_decode but before sampling (skips wasted sample work, returns was_cancelled=true). The decode slot for a cancelled token is still consumed, which is fine — the slot is cheap; the win is not running the sampler. Tests ----- Added two integration tests gated on COTABBY_TEST_MODEL_PATH: - testInterleavedMultiSequenceSampling: alternates sampleNext between two sequences with greedy sampling and identical prompts, asserts identical output (validates the seed-token / feedback-decode handoff and per-sequence sampler isolation in the shared context). - testCancellationStopsSamplingPromptly: verifies sampleNext after cancelSequence returns was_cancelled=true without model work. Existing testEndToEndWithModel passes unchanged. Follow-ups ---------- - Bench scenario c_engine_threaded that exercises the full engine via its public API from two threads, for end-to-end throughput validation (Phase 0 numbers above are at the raw llama.cpp level). - README update: the "no shared decode mutex, no contention" claim is no longer accurate. The new design has a single decode_mutex serializing access to one llama_context. The contention is productive — it enables batching — but the README should reflect the new model.
FuJacob
added a commit
that referenced
this pull request
May 28, 2026
Replaces the per-sequence llama_context architecture with a single shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder thread that coalesces sample-step requests from multiple sequences into one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is unchanged; Cotabby's Swift code does not need to be modified. Why --- Phase 0 spike (see PR #3) showed that on M-series Metal, batched decode delivers 1.43x aggregate throughput at N=2 and up to 2.35x at N=4 vs the current "separate llama_context per sequence" design. The win comes from fusing matmul weight reads across sequences in a single llama_decode call: per-token decode is memory-bound on Apple Silicon, so a single decode that serves two sequences reuses the same weight read. The "Metal command queue serializes everything" pessimism does not survive empirically. Design ------ - Impl owns one llama_context with n_ctx = configured_ctx * MAX_SEQUENCES and n_seq_max = MAX_SEQUENCES. Each SequenceState carries a llama_seq_id slot (0..MAX_SEQUENCES-1) used to tag tokens in the shared KV cache. - Decoder thread loop: wait for at least one pending request, wait an additional BATCH_WINDOW_MICROS (200 µs by default) for siblings to pile in, then build one llama_batch carrying all pending tokens with their respective seq_ids, llama_decode once, sample each sequence's next token using its own sampler chain at its assigned batch index, and resolve every request's promise. - sampleNext fast path: deliver the seed token sampled at decodePrompt time. This avoids the decoder round-trip for the very first sample after a prompt, where there is no input token to feedback-decode. - sampleNext steady-state path: queue a PendingRequest (input token = previously-sampled token, position = current KV count, sampler = this sequence's chain) and wait on a std::promise resolved by the decoder thread. - decodePrompt holds decode_mutex for the prompt's chunk decode and takes the seed sample inline while the prompt's logits are still resident in the shared context. - trimKV holds decode_mutex, calls llama_memory_seq_rm for this sequence's seq_id, and invalidates any pending seed/input so the caller has to re-prime via decodePrompt before the next sampleNext. The 200 µs window is the throughput knob. Multi-sequence workloads naturally fall into lockstep because each sequence resubmits as soon as its sample returns, so successive requests usually arrive within the window without any caller-side coordination. Single-sequence callers pay one window's worth of latency per token (~2% of a ~10 ms decode); acceptable. Tunable later via a setter if needed. Cancellation ------------ - Existing one-way atomic flag preserved. - Checked at sampleNext entry (returns immediately) and again in processBatch after llama_decode but before sampling (skips wasted sample work, returns was_cancelled=true). The decode slot for a cancelled token is still consumed, which is fine — the slot is cheap; the win is not running the sampler. Tests ----- Added two integration tests gated on COTABBY_TEST_MODEL_PATH: - testInterleavedMultiSequenceSampling: alternates sampleNext between two sequences with greedy sampling and identical prompts, asserts identical output (validates the seed-token / feedback-decode handoff and per-sequence sampler isolation in the shared context). - testCancellationStopsSamplingPromptly: verifies sampleNext after cancelSequence returns was_cancelled=true without model work. Existing testEndToEndWithModel passes unchanged. Follow-ups ---------- - Bench scenario c_engine_threaded that exercises the full engine via its public API from two threads, for end-to-end throughput validation (Phase 0 numbers above are at the raw llama.cpp level). - README update: the "no shared decode mutex, no contention" claim is no longer accurate. The new design has a single decode_mutex serializing access to one llama_context. The contention is productive — it enables batching — but the README should reflect the new model.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
CotabbyInferenceBench, a standalone executable target that measures aggregate decode tok/s for two architectures on the same model and workload. It exists to answer one question before the Phase 1 refactor: does llama.cpp's batched-decode API actually beat our current "one llama_context per sequence" design on M-series + Metal, or does the Metal command queue serialize everything regardless?What it does
Two scenarios, same warmup, same total token count:
a_two_contexts— N separatellama_contextinstances, one thread each. This is whatCotabbyInferenceEnginedoes today.b_shared_context— one sharedllama_contextwithn_seq_max = N, a single batchedllama_decodeper step. Candidate Phase 1 architecture.Both exclude prompt decode and the seed sample from the timed section so the numerator is
(sample_tokens - 1) * num_sequencesand is directly comparable. Output is one line of JSON per invocation; the caller scripts multiple runs.Findings on Gemma 3 1B Q4_K_M (release build, M-series)
Prompt 256 tokens, sample 200 tokens:
* At N=4, baseline A's worker threads bail at ~126 of 199 iterations each (
llama_decodereturning non-zero — most likely Metal resource pressure from 4 concurrent contexts). Baseline B's single shared context completes all four sequences cleanly. So the N=4 number understates the gap; on equal completed-token-count terms baseline A simply does not finish the workload.The implementation plan's 1.4x decision gate is met at every realistic N. Phase 1 batched-decode refactor is GO.
Build & run
--helplists all options.Not in scope
CotabbyInferenceEngine.cppis untouched; the new target is a separate binary and not linked into theCotabbyInferencelibrary.Next steps if this lands
CotabbyInferenceEngineto a single shared context with leader-follower batched decode, per the implementation plan.