Skip to content

Phase 0 spike: batched-decode vs separate-context throughput benchmark#3

Merged
FuJacob merged 1 commit into
mainfrom
feat/batched-decode-bench
May 28, 2026
Merged

Phase 0 spike: batched-decode vs separate-context throughput benchmark#3
FuJacob merged 1 commit into
mainfrom
feat/batched-decode-bench

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented May 28, 2026

Summary

Adds CotabbyInferenceBench, a standalone executable target that measures aggregate decode tok/s for two architectures on the same model and workload. It exists to answer one question before the Phase 1 refactor: does llama.cpp's batched-decode API actually beat our current "one llama_context per sequence" design on M-series + Metal, or does the Metal command queue serialize everything regardless?

What it does

Two scenarios, same warmup, same total token count:

  • a_two_contexts — N separate llama_context instances, one thread each. This is what CotabbyInferenceEngine does today.
  • b_shared_context — one shared llama_context with n_seq_max = N, a single batched llama_decode per step. Candidate Phase 1 architecture.

Both exclude prompt decode and the seed sample from the timed section so the numerator is (sample_tokens - 1) * num_sequences and is directly comparable. Output is one line of JSON per invocation; the caller scripts multiple runs.

Findings on Gemma 3 1B Q4_K_M (release build, M-series)

Prompt 256 tokens, sample 200 tokens:

N A aggregate tok/s B aggregate tok/s Speedup
2 130-141 (median 134) 190-201 (median 194) 1.43x
3 132 254 1.92x
4 113* 266 2.35x

* At N=4, baseline A's worker threads bail at ~126 of 199 iterations each (llama_decode returning non-zero — most likely Metal resource pressure from 4 concurrent contexts). Baseline B's single shared context completes all four sequences cleanly. So the N=4 number understates the gap; on equal completed-token-count terms baseline A simply does not finish the workload.

The implementation plan's 1.4x decision gate is met at every realistic N. Phase 1 batched-decode refactor is GO.

Build & run

swift build -c release --product CotabbyInferenceBench

./.build/release/CotabbyInferenceBench \
  --model /path/to/model.gguf \
  --scenario b_shared_context \
  --num-sequences 2 \
  --prompt-tokens 256 \
  --sample-tokens 200

--help lists all options.

Not in scope

  • This PR is the spike, not the refactor. CotabbyInferenceEngine.cpp is untouched; the new target is a separate binary and not linked into the CotabbyInference library.
  • No CI integration. The benchmark needs a model file on disk, which CI does not have. The target builds in CI; running is on humans.
  • Numbers vary between runs by ~5%. Statistical rigor (median of N, variance) is left to whoever runs the spike — the binary's per-run output is the raw material.

Next steps if this lands

  1. Merge.
  2. Run on additional hardware variants (M1/M2/M3/M4 Pro/Max if available) to confirm the gate holds across the fleet.
  3. Kick off Phase 1: refactor CotabbyInferenceEngine to a single shared context with leader-follower batched decode, per the implementation plan.

CotabbyInferenceBench is a standalone executable target that compares
two architectures on the same model and workload:

  a_two_contexts: N separate llama_context instances, one thread each
                  (what CotabbyInferenceEngine does today).
  b_shared_context: One shared llama_context with n_seq_max=N, a single
                    batched llama_decode per step.

Both scenarios exclude prompt decode and seed sample from the timed
section so the numerator is comparable across runs. Output is one JSON
line per invocation; callers script multiple runs and compare.

Initial release-build numbers on M-series + Metal, Gemma 3 1B Q4_K_M,
prompt 256 / sample 200:

  n=2: A 130-141 tok/s vs B 190-201 tok/s  -> 1.43x (median)
  n=3: A 132 tok/s     vs B 254 tok/s      -> 1.92x
  n=4: A 113 tok/s*    vs B 266 tok/s      -> 2.35x

  * At n=4, baseline A's threads bail after ~126 of 199 sampling
    iterations each (llama_decode returning non-zero, likely resource
    pressure from 4 concurrent Metal contexts). Baseline B's single
    shared context completes all 4 sequences cleanly.

The 1.4x gate set in the implementation plan is met at every n we care
about. Phase 1 batched-decode refactor is GO.

Not linked into the CotabbyInference library. Run with --help for the
CLI.
@FuJacob FuJacob merged commit 6769335 into main May 28, 2026
1 check passed
@FuJacob FuJacob deleted the feat/batched-decode-bench branch May 28, 2026 10:49
FuJacob added a commit that referenced this pull request May 28, 2026
Replaces the per-sequence llama_context architecture with a single
shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder
thread that coalesces sample-step requests from multiple sequences into
one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is
unchanged; Cotabby's Swift code does not need to be modified.

Why
---
Phase 0 spike (see PR #3) showed that on M-series Metal, batched decode
delivers 1.43x aggregate throughput at N=2 and up to 2.35x at N=4
vs the current "separate llama_context per sequence" design. The win
comes from fusing matmul weight reads across sequences in a single
llama_decode call: per-token decode is memory-bound on Apple Silicon,
so a single decode that serves two sequences reuses the same weight
read. The "Metal command queue serializes everything" pessimism does
not survive empirically.

Design
------
- Impl owns one llama_context with n_ctx = configured_ctx * MAX_SEQUENCES
  and n_seq_max = MAX_SEQUENCES. Each SequenceState carries a
  llama_seq_id slot (0..MAX_SEQUENCES-1) used to tag tokens in the
  shared KV cache.
- Decoder thread loop: wait for at least one pending request, wait an
  additional BATCH_WINDOW_MICROS (200 µs by default) for siblings to
  pile in, then build one llama_batch carrying all pending tokens with
  their respective seq_ids, llama_decode once, sample each sequence's
  next token using its own sampler chain at its assigned batch index,
  and resolve every request's promise.
- sampleNext fast path: deliver the seed token sampled at decodePrompt
  time. This avoids the decoder round-trip for the very first sample
  after a prompt, where there is no input token to feedback-decode.
- sampleNext steady-state path: queue a PendingRequest (input token =
  previously-sampled token, position = current KV count, sampler =
  this sequence's chain) and wait on a std::promise resolved by the
  decoder thread.
- decodePrompt holds decode_mutex for the prompt's chunk decode and
  takes the seed sample inline while the prompt's logits are still
  resident in the shared context.
- trimKV holds decode_mutex, calls llama_memory_seq_rm for this
  sequence's seq_id, and invalidates any pending seed/input so the
  caller has to re-prime via decodePrompt before the next sampleNext.

The 200 µs window is the throughput knob. Multi-sequence workloads
naturally fall into lockstep because each sequence resubmits as soon as
its sample returns, so successive requests usually arrive within the
window without any caller-side coordination. Single-sequence callers
pay one window's worth of latency per token (~2% of a ~10 ms decode);
acceptable. Tunable later via a setter if needed.

Cancellation
------------
- Existing one-way atomic flag preserved.
- Checked at sampleNext entry (returns immediately) and again in
  processBatch after llama_decode but before sampling (skips wasted
  sample work, returns was_cancelled=true). The decode slot for a
  cancelled token is still consumed, which is fine — the slot is
  cheap; the win is not running the sampler.

Tests
-----
Added two integration tests gated on COTABBY_TEST_MODEL_PATH:

- testInterleavedMultiSequenceSampling: alternates sampleNext between
  two sequences with greedy sampling and identical prompts, asserts
  identical output (validates the seed-token / feedback-decode handoff
  and per-sequence sampler isolation in the shared context).
- testCancellationStopsSamplingPromptly: verifies sampleNext after
  cancelSequence returns was_cancelled=true without model work.

Existing testEndToEndWithModel passes unchanged.

Follow-ups
----------
- Bench scenario c_engine_threaded that exercises the full engine via
  its public API from two threads, for end-to-end throughput
  validation (Phase 0 numbers above are at the raw llama.cpp level).
- README update: the "no shared decode mutex, no contention" claim is
  no longer accurate. The new design has a single decode_mutex
  serializing access to one llama_context. The contention is
  productive — it enables batching — but the README should reflect the
  new model.
FuJacob added a commit that referenced this pull request May 28, 2026
Replaces the per-sequence llama_context architecture with a single
shared context (n_seq_max = MAX_SEQUENCES) and a dedicated decoder
thread that coalesces sample-step requests from multiple sequences into
one llama_decode call. Public C++ API (CotabbyInferenceEngine.h) is
unchanged; Cotabby's Swift code does not need to be modified.

Why
---
Phase 0 spike (see PR #3) showed that on M-series Metal, batched decode
delivers 1.43x aggregate throughput at N=2 and up to 2.35x at N=4
vs the current "separate llama_context per sequence" design. The win
comes from fusing matmul weight reads across sequences in a single
llama_decode call: per-token decode is memory-bound on Apple Silicon,
so a single decode that serves two sequences reuses the same weight
read. The "Metal command queue serializes everything" pessimism does
not survive empirically.

Design
------
- Impl owns one llama_context with n_ctx = configured_ctx * MAX_SEQUENCES
  and n_seq_max = MAX_SEQUENCES. Each SequenceState carries a
  llama_seq_id slot (0..MAX_SEQUENCES-1) used to tag tokens in the
  shared KV cache.
- Decoder thread loop: wait for at least one pending request, wait an
  additional BATCH_WINDOW_MICROS (200 µs by default) for siblings to
  pile in, then build one llama_batch carrying all pending tokens with
  their respective seq_ids, llama_decode once, sample each sequence's
  next token using its own sampler chain at its assigned batch index,
  and resolve every request's promise.
- sampleNext fast path: deliver the seed token sampled at decodePrompt
  time. This avoids the decoder round-trip for the very first sample
  after a prompt, where there is no input token to feedback-decode.
- sampleNext steady-state path: queue a PendingRequest (input token =
  previously-sampled token, position = current KV count, sampler =
  this sequence's chain) and wait on a std::promise resolved by the
  decoder thread.
- decodePrompt holds decode_mutex for the prompt's chunk decode and
  takes the seed sample inline while the prompt's logits are still
  resident in the shared context.
- trimKV holds decode_mutex, calls llama_memory_seq_rm for this
  sequence's seq_id, and invalidates any pending seed/input so the
  caller has to re-prime via decodePrompt before the next sampleNext.

The 200 µs window is the throughput knob. Multi-sequence workloads
naturally fall into lockstep because each sequence resubmits as soon as
its sample returns, so successive requests usually arrive within the
window without any caller-side coordination. Single-sequence callers
pay one window's worth of latency per token (~2% of a ~10 ms decode);
acceptable. Tunable later via a setter if needed.

Cancellation
------------
- Existing one-way atomic flag preserved.
- Checked at sampleNext entry (returns immediately) and again in
  processBatch after llama_decode but before sampling (skips wasted
  sample work, returns was_cancelled=true). The decode slot for a
  cancelled token is still consumed, which is fine — the slot is
  cheap; the win is not running the sampler.

Tests
-----
Added two integration tests gated on COTABBY_TEST_MODEL_PATH:

- testInterleavedMultiSequenceSampling: alternates sampleNext between
  two sequences with greedy sampling and identical prompts, asserts
  identical output (validates the seed-token / feedback-decode handoff
  and per-sequence sampler isolation in the shared context).
- testCancellationStopsSamplingPromptly: verifies sampleNext after
  cancelSequence returns was_cancelled=true without model work.

Existing testEndToEndWithModel passes unchanged.

Follow-ups
----------
- Bench scenario c_engine_threaded that exercises the full engine via
  its public API from two threads, for end-to-end throughput
  validation (Phase 0 numbers above are at the raw llama.cpp level).
- README update: the "no shared decode mutex, no contention" claim is
  no longer accurate. The new design has a single decode_mutex
  serializing access to one llama_context. The contention is
  productive — it enables batching — but the README should reflect the
  new model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant