Skip to content
This repository was archived by the owner on Mar 4, 2026. It is now read-only.

Commit 34273fa

Browse files
munimxCopilot
andcommitted
Merge branch 'bench/integration-results'
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2 parents b15e787 + e24f03e commit 34273fa

3 files changed

Lines changed: 517 additions & 386 deletions

File tree

docs/PERFORMANCE_REPORT.md

Lines changed: 80 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -1,147 +1,124 @@
1-
# Performance Report — LLM Inference Optimization Engine
1+
# Performance Report — Integration Benchmarks
22

3-
Target hardware: **Apple M2 Air, 8 GB unified memory**
4-
Baseline: raw Ollama `generate` calls with no batching or caching.
3+
**Hardware**: Apple M2 Air, 16 GB unified memory
4+
**Ollama version**: local
5+
**Benchmark date**: 2026-03-02
6+
**Methodology**: 1 warmup run discarded, 3 measured runs averaged. All 4 models installed locally (Q4_K_M quantization).
57

68
---
79

8-
## Throughput vs Latency
10+
## Summary
911

10-
| Scenario | Requests/s | P50 latency (ms) | P95 latency (ms) | Notes |
12+
| Model | Direct Ollama (ms) | Engine Cold (ms) | Cache Hit (ms) | Concurrent Speedup |
1113
|---|---|---|---|---|
12-
| Baseline (no batching) | 1.0 | 1 800 | 3 200 | Single requests, sequential |
13-
| FCFS batching (batch=4) | 2.9 | 620 | 1 100 | Concurrent Ollama fan-out |
14-
| TokenBudget batching | 3.4 | 540 | 980 | Packs requests to 512-token budget |
15-
| + Semantic cache (50% hit) | 6.1 | 190 | 410 | Cache hit avoids Ollama entirely |
16-
| + Speculative decoding | ~4.2 | 430 | 790 | 1.35× speedup vs no speculation |
14+
| `phi3:latest` (3.8B) | ~80 warm¹ | ~770 avg | **2** | varies² |
15+
| `mistral:7b` (7B) | 127 | 426 | **3** | 1.11x |
16+
| `llama3.1:8b` (8B) | 262 | 434 | **3** | 1.19x |
17+
| `deepseek-r1:7b` (7B reasoning) | ~50 000 | ~107 000 | **2** | N/A³ |
1718

18-
> **Note:** Numbers are illustrative; reproduce with `scripts/run_benchmarks.py`.
19+
¹ phi3 first request included model cold-load (~3100ms); subsequent warm requests averaged ~80ms.
20+
² phi3 concurrent shows high variance — model loads/unloads between model switches on 16GB.
21+
³ deepseek-r1 generates chain-of-thought reasoning tokens; concurrent test exceeded 5-minute timeout.
1922

2023
---
2124

22-
## Memory Footprint by Quantization Level
25+
## Key Findings
2326

24-
| Quantization | Model (7B params) | KV-cache (2 k tokens) | Total peak |
25-
|---|---|---|---|
26-
| fp16 | 14.0 GB | 0.84 GB | 14.84 GB |
27-
| q8_0 | 7.0 GB | 0.84 GB | 7.84 GB |
28-
| q4_K_M | 3.5 GB | 0.84 GB | 4.34 GB |
29-
| q4_0 | 3.2 GB | 0.84 GB | 4.04 GB |
30-
| q3_K_M | 2.7 GB | 0.84 GB | 3.54 GB |
27+
### 1. Cache Hit: 2–3 ms for all models
3128

32-
The `MemoryEstimator` adds a **10 % safety margin** to all figures above.
33-
The `AdaptiveThrottler` uses a soft threshold of **85 %** of the configured
34-
limit (default 14 GB for M2 Air) and a hard reject at the limit.
29+
Repeated identical prompts are served entirely from the in-process LRU cache, bypassing Ollama completely. For `mistral:7b` (127ms warm baseline), this is a **42x speedup**. For `deepseek-r1:7b` (50s baseline), the speedup approaches **25,000x** on cached responses.
3530

36-
---
31+
Cache miss (cold path) adds ~150–600ms scheduling and dispatch overhead on top of model inference time.
3732

38-
## Scheduling Policy Comparison
33+
### 2. Engine Cold Path Overhead
3934

40-
| Policy | Best for | Avg wait (ms) | Max wait (ms) |
35+
The engine adds scheduling → queue → dispatch → result-mapping overhead. For short generations (1–5 tokens), this overhead is proportionally high. For longer generations the ratio improves — the overhead is roughly constant at 150–400ms regardless of generation length.
36+
37+
| Model | Baseline (ms) | Engine cold (ms) | Overhead (ms) |
4138
|---|---|---|---|
42-
| FCFS | Fairness, debugging | 210 | 850 |
43-
| SJF | Low mean latency | 145 | 1 600 |
44-
| Priority | Multi-tenant SLAs | 120 (hi-pri) | 3 000 (lo-pri) |
45-
| TokenBudget | Throughput maximisation | 195 | 780 |
39+
| `mistral:7b` | 127 | 426 | ~300 |
40+
| `llama3.1:8b` | 262 | 434 | ~170 |
4641

47-
Measured under 10 concurrent clients, 7B q4_K_M model.
42+
### 3. Concurrent Batching (4 parallel requests)
4843

49-
---
44+
| Model | Sequential estimate (ms) | Concurrent wall (ms) | Speedup |
45+
|---|---|---|---|
46+
| `mistral:7b` — P1 | 572 | 491 | 1.17x |
47+
| `mistral:7b` — P2 | 476 | 438 | 1.08x |
48+
| `mistral:7b` — P3 | 476 | 438 | 1.09x |
49+
| `llama3.1:8b` — P1 | 1068 | 908 | 1.18x |
50+
| `llama3.1:8b` — P2 | 1032 | 880 | 1.17x |
51+
| `llama3.1:8b` — P3 | 1048 | 864 | 1.21x |
5052

51-
## Speculative Decoding Acceptance Rate
53+
Batching provides a consistent ~1.1–1.2x wall-time speedup for 4 concurrent requests on an M2 Air with a single loaded model. Gains are limited by Ollama's own single-threaded inference — the engine's concurrent `httpx` fan-out saturates Ollama's queue, but Ollama processes requests sequentially. The speedup comes from pipelining queue drain, HTTP connection reuse, and result dispatch.
5254

53-
| Draft model | Target model | Acceptance rate | Speedup |
54-
|---|---|---|---|
55-
| phi3:mini | llama3:8b | 68 % | 1.35× |
56-
| phi3:mini | mistral:7b | 52 % | 1.18× |
57-
| gemma:2b | llama3:8b | 44 % | 1.09× |
55+
### 4. deepseek-r1:7b — Reasoning Model
5856

59-
Acceptance rate is highly sensitive to prompt domain.
60-
Speculative decoding is most effective for:
61-
- Structured output (code, JSON)
62-
- Repetitive or templated prompts
63-
- Prompt–draft model family alignment
57+
deepseek-r1 generates an internal `<think>...</think>` chain before answering, producing hundreds of tokens even for trivial prompts. Measured baseline was ~50s per request. The engine's cache is the only optimization that provides meaningful speedup (2ms hit vs 50s miss = ~25,000x) — scheduling and batching do not help because the bottleneck is generation time.
6458

6559
---
6660

67-
## Cache Effectiveness
61+
## Per-Model Detail
6862

69-
| Cache hit rate | Effective RPS | Reduction in Ollama calls |
70-
|---|---|---|
71-
| 0 % | 3.4 | 0 % |
72-
| 25 % | 4.5 | 25 % |
73-
| 50 % | 6.1 | 50 % |
74-
| 75 % | 9.8 | 75 % |
63+
### phi3:latest
7564

76-
`SemanticCache` uses exact `(model, prompt)` matching with configurable TTL
77-
(default 300 s) and LRU eviction at 1 000 entries.
65+
| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
66+
|---|---|---|---|---|
67+
| "What is 2+2?" (cold model) | 3146 | 266 | 2 | 38.5 |
68+
| "Capital of France?" | 94 | 184 | 2 | 57.6 |
69+
| "Sky colour?" | 69 | 1860 | 2 | 77.0 |
7870

79-
---
71+
Note: Prompt 1 baseline includes model cold-load. phi3 was first model benchmarked. Warm baseline (P2, P3) is 69–94ms at 57–77 tok/s.
8072

81-
## Reproducing Benchmarks
73+
### mistral:7b
8274

83-
```bash
84-
# Start Ollama first
85-
ollama serve
75+
| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
76+
|---|---|---|---|---|
77+
| "What is 2+2?" | 143 | 840 | 3 | 35.1 |
78+
| "Capital of France?" | 119 | 219 | 3 | 39.5 |
79+
| "Sky colour?" | 119 | 219 | 3 | 39.4 |
8680

87-
# Run all benchmarks
88-
python scripts/run_benchmarks.py --config configs/benchmarks.yaml
81+
### llama3.1:8b
8982

90-
# Start the API server
91-
python scripts/start_server.py
83+
| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
84+
|---|---|---|---|---|
85+
| "What is 2+2?" | 267 | 464 | 3 | 27.1 |
86+
| "Capital of France?" | 258 | 403 | 2 | 27.6 |
87+
| "Sky colour?" | 262 | 434 | 3 | 27.3 |
9288

93-
# Run a quick load test (requires httpx)
94-
python -c "
95-
import asyncio, httpx, time
89+
### deepseek-r1:7b
9690

97-
async def main():
98-
async with httpx.AsyncClient(base_url='http://localhost:8000') as c:
99-
start = time.perf_counter()
100-
tasks = [c.post('/completions', json={'model': 'llama3:8b', 'prompt': 'Hello'})
101-
for _ in range(20)]
102-
responses = await asyncio.gather(*tasks)
103-
print(f'{len(responses)} requests in {time.perf_counter()-start:.2f}s')
91+
| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
92+
|---|---|---|---|---|
93+
| "What is 2+2?" | 49 783 | 107 266 | 2 | 9.0 |
10494

105-
asyncio.run(main())
106-
"
107-
```
95+
One prompt only; reasoning chain tokens drove generation time to ~50s. Cache hit still 2ms.
10896

10997
---
11098

111-
## Post-Phase-7 Performance Improvements (perf/1–8)
112-
113-
The following improvements were landed after Phase 7 to address bottlenecks found
114-
under production-level load testing:
115-
116-
| Branch | Change | Observed Impact |
117-
|---|---|---|
118-
| `perf/1-batch-token-counter` | O(n²)→O(1) token tracking in `Batch` | Eliminates quadratic scheduling overhead at batch_size > 16 |
119-
| `perf/2-cache-async-lock` | `asyncio.Lock` on `SemanticCache` | Prevents spurious cache misses and crashes under concurrent requests |
120-
| `perf/3-scheduler-lock-free-queues` | `dict.setdefault()` for queue creation | Removes `async with self._lock` serialisation on every `submit()` |
121-
| `perf/4-speculation-fixes` | Precompiled regex + case-exact token match | Reduces overhead per speculation round; fixes inflated acceptance rates |
122-
| `perf/5-memoize-estimators` | `lru_cache` on weight/context estimators | Near-zero latency for repeated same-model estimates under load |
123-
| `perf/6-queue-bounded-cancellation` | Guard `_cancelled` with `_queued_ids` | Prevents unbounded set growth from stale cancellation IDs |
124-
| `perf/7-config-driven-server` | Wire `InferenceConfig` into server lifespan | Enables zero-code tuning via `configs/default.yaml` |
125-
| `perf/8-backoff-jitter` | Jitter `+ uniform(0,1)` in retry sleep | Spreads retry storms across `[base·2^n, base·2^n + 1]` second window |
99+
## Memory Footprint by Quantization Level
126100

127-
### Configuration Tuning Reference
101+
Estimated for 7B parameter models (applies to mistral:7b, deepseek-r1:7b):
128102

129-
After perf/7, all critical limits are in `configs/default.yaml`:
103+
| Quantization | Weights | KV-cache (2k tokens) | Total peak |
104+
|---|---|---|---|
105+
| fp16 | 14.0 GB | 0.84 GB | 14.84 GB |
106+
| q8_0 | 7.0 GB | 0.84 GB | 7.84 GB |
107+
| q4_K_M | 3.5 GB | 0.84 GB | 4.34 GB |
108+
| q4_0 | 3.2 GB | 0.84 GB | 4.04 GB |
109+
| q3_K_M | 2.7 GB | 0.84 GB | 3.54 GB |
110+
| q2_K | 2.0 GB | 0.84 GB | 2.84 GB |
130111

131-
```yaml
132-
cache:
133-
max_size: 1000 # LRU capacity (entries)
134-
ttl_seconds: 300.0 # TTL before eviction
112+
All installed models use Q4_K_M. Engine memory estimator uses these values for admission control (configured at `memory.limit_gb=14.0`).
135113

136-
scheduling:
137-
policy: fcfs # fcfs | sjf | priority | token_budget
138-
max_requests_per_batch: 8
139-
token_budget: 512
114+
---
140115

141-
memory:
142-
limit_gb: 14.0 # Hard admission reject threshold
143-
soft_limit_ratio: 0.85 # Soft limit for QUEUE decision
116+
## Reproducing
144117

145-
ollama:
146-
retry_backoff_seconds: 1.0 # Base for exponential + jitter backoff
118+
```bash
119+
# Requires Ollama running with models pulled
120+
python scripts/start_server.py &
121+
python scripts/run_integration_benchmarks.py
147122
```
123+
124+
Outputs: `docs/benchmark_results.json`, `docs/PERFORMANCE_REPORT.md`

0 commit comments

Comments
 (0)