Phase 0 spike: batched-decode vs separate-context throughput benchmark by FuJacob · Pull Request #3 · FuJacob/cotabbyinference

FuJacob · 2026-05-28T09:45:08Z

Summary

Adds CotabbyInferenceBench, a standalone executable target that measures aggregate decode tok/s for two architectures on the same model and workload. It exists to answer one question before the Phase 1 refactor: does llama.cpp's batched-decode API actually beat our current "one llama_context per sequence" design on M-series + Metal, or does the Metal command queue serialize everything regardless?

What it does

Two scenarios, same warmup, same total token count:

a_two_contexts — N separate llama_context instances, one thread each. This is what CotabbyInferenceEngine does today.
b_shared_context — one shared llama_context with n_seq_max = N, a single batched llama_decode per step. Candidate Phase 1 architecture.

Both exclude prompt decode and the seed sample from the timed section so the numerator is (sample_tokens - 1) * num_sequences and is directly comparable. Output is one line of JSON per invocation; the caller scripts multiple runs.

Findings on Gemma 3 1B Q4_K_M (release build, M-series)

Prompt 256 tokens, sample 200 tokens:

N	A aggregate tok/s	B aggregate tok/s	Speedup
2	130-141 (median 134)	190-201 (median 194)	1.43x
3	132	254	1.92x
4	113*	266	2.35x

* At N=4, baseline A's worker threads bail at ~126 of 199 iterations each (llama_decode returning non-zero — most likely Metal resource pressure from 4 concurrent contexts). Baseline B's single shared context completes all four sequences cleanly. So the N=4 number understates the gap; on equal completed-token-count terms baseline A simply does not finish the workload.

The implementation plan's 1.4x decision gate is met at every realistic N. Phase 1 batched-decode refactor is GO.

Build & run

swift build -c release --product CotabbyInferenceBench

./.build/release/CotabbyInferenceBench \
  --model /path/to/model.gguf \
  --scenario b_shared_context \
  --num-sequences 2 \
  --prompt-tokens 256 \
  --sample-tokens 200

--help lists all options.

Not in scope

This PR is the spike, not the refactor. CotabbyInferenceEngine.cpp is untouched; the new target is a separate binary and not linked into the CotabbyInference library.
No CI integration. The benchmark needs a model file on disk, which CI does not have. The target builds in CI; running is on humans.
Numbers vary between runs by ~5%. Statistical rigor (median of N, variance) is left to whoever runs the spike — the binary's per-run output is the raw material.

Next steps if this lands

Merge.
Run on additional hardware variants (M1/M2/M3/M4 Pro/Max if available) to confirm the gate holds across the fleet.
Kick off Phase 1: refactor CotabbyInferenceEngine to a single shared context with leader-follower batched decode, per the implementation plan.

CotabbyInferenceBench is a standalone executable target that compares two architectures on the same model and workload: a_two_contexts: N separate llama_context instances, one thread each (what CotabbyInferenceEngine does today). b_shared_context: One shared llama_context with n_seq_max=N, a single batched llama_decode per step. Both scenarios exclude prompt decode and seed sample from the timed section so the numerator is comparable across runs. Output is one JSON line per invocation; callers script multiple runs and compare. Initial release-build numbers on M-series + Metal, Gemma 3 1B Q4_K_M, prompt 256 / sample 200: n=2: A 130-141 tok/s vs B 190-201 tok/s -> 1.43x (median) n=3: A 132 tok/s vs B 254 tok/s -> 1.92x n=4: A 113 tok/s* vs B 266 tok/s -> 2.35x * At n=4, baseline A's threads bail after ~126 of 199 sampling iterations each (llama_decode returning non-zero, likely resource pressure from 4 concurrent Metal contexts). Baseline B's single shared context completes all 4 sequences cleanly. The 1.4x gate set in the implementation plan is met at every n we care about. Phase 1 batched-decode refactor is GO. Not linked into the CotabbyInference library. Run with --help for the CLI.

Replaces the per-sequence llama_context architecture with a single shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder thread that coalesces sample-step requests from multiple sequences into one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is unchanged; Cotabby's Swift code does not need to be modified. Why --- Phase 0 spike (see PR #3) showed that on M-series Metal, batched decode delivers 1.43x aggregate throughput at N=2 and up to 2.35x at N=4 vs the current "separate llama_context per sequence" design. The win comes from fusing matmul weight reads across sequences in a single llama_decode call: per-token decode is memory-bound on Apple Silicon, so a single decode that serves two sequences reuses the same weight read. The "Metal command queue serializes everything" pessimism does not survive empirically. Design ------ - Impl owns one llama_context with n_ctx = configured_ctx * MAX_SEQUENCES and n_seq_max = MAX_SEQUENCES. Each SequenceState carries a llama_seq_id slot (0..MAX_SEQUENCES-1) used to tag tokens in the shared KV cache. - Decoder thread loop: wait for at least one pending request, wait an additional BATCH_WINDOW_MICROS (200 µs by default) for siblings to pile in, then build one llama_batch carrying all pending tokens with their respective seq_ids, llama_decode once, sample each sequence's next token using its own sampler chain at its assigned batch index, and resolve every request's promise. - sampleNext fast path: deliver the seed token sampled at decodePrompt time. This avoids the decoder round-trip for the very first sample after a prompt, where there is no input token to feedback-decode. - sampleNext steady-state path: queue a PendingRequest (input token = previously-sampled token, position = current KV count, sampler = this sequence's chain) and wait on a std::promise resolved by the decoder thread. - decodePrompt holds decode_mutex for the prompt's chunk decode and takes the seed sample inline while the prompt's logits are still resident in the shared context. - trimKV holds decode_mutex, calls llama_memory_seq_rm for this sequence's seq_id, and invalidates any pending seed/input so the caller has to re-prime via decodePrompt before the next sampleNext. The 200 µs window is the throughput knob. Multi-sequence workloads naturally fall into lockstep because each sequence resubmits as soon as its sample returns, so successive requests usually arrive within the window without any caller-side coordination. Single-sequence callers pay one window's worth of latency per token (~2% of a ~10 ms decode); acceptable. Tunable later via a setter if needed. Cancellation ------------ - Existing one-way atomic flag preserved. - Checked at sampleNext entry (returns immediately) and again in processBatch after llama_decode but before sampling (skips wasted sample work, returns was_cancelled=true). The decode slot for a cancelled token is still consumed, which is fine — the slot is cheap; the win is not running the sampler. Tests ----- Added two integration tests gated on COTABBY_TEST_MODEL_PATH: - testInterleavedMultiSequenceSampling: alternates sampleNext between two sequences with greedy sampling and identical prompts, asserts identical output (validates the seed-token / feedback-decode handoff and per-sequence sampler isolation in the shared context). - testCancellationStopsSamplingPromptly: verifies sampleNext after cancelSequence returns was_cancelled=true without model work. Existing testEndToEndWithModel passes unchanged. Follow-ups ---------- - Bench scenario c_engine_threaded that exercises the full engine via its public API from two threads, for end-to-end throughput validation (Phase 0 numbers above are at the raw llama.cpp level). - README update: the "no shared decode mutex, no contention" claim is no longer accurate. The new design has a single decode_mutex serializing access to one llama_context. The contention is productive — it enables batching — but the README should reflect the new model.

FuJacob mentioned this pull request May 28, 2026

Phase 1: refactor engine to one shared context with batched decode #4

Closed

FuJacob merged commit 6769335 into main May 28, 2026
1 check passed

FuJacob deleted the feat/batched-decode-bench branch May 28, 2026 10:49

FuJacob mentioned this pull request May 28, 2026

Phase 1: refactor engine to one shared context with batched decode #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 0 spike: batched-decode vs separate-context throughput benchmark#3

Phase 0 spike: batched-decode vs separate-context throughput benchmark#3
FuJacob merged 1 commit into
mainfrom
feat/batched-decode-bench

FuJacob commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FuJacob commented May 28, 2026

Summary

What it does

Findings on Gemma 3 1B Q4_K_M (release build, M-series)

Build & run

Not in scope

Next steps if this lands

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant