|
1 | | -# Performance Report — LLM Inference Optimization Engine |
| 1 | +# Performance Report — Integration Benchmarks |
2 | 2 |
|
3 | | -Target hardware: **Apple M2 Air, 8 GB unified memory** |
4 | | -Baseline: raw Ollama `generate` calls with no batching or caching. |
| 3 | +**Hardware**: Apple M2 Air, 16 GB unified memory |
| 4 | +**Ollama version**: local |
| 5 | +**Benchmark date**: 2026-03-02 |
| 6 | +**Methodology**: 1 warmup run discarded, 3 measured runs averaged. All 4 models installed locally (Q4_K_M quantization). |
5 | 7 |
|
6 | 8 | --- |
7 | 9 |
|
8 | | -## Throughput vs Latency |
| 10 | +## Summary |
9 | 11 |
|
10 | | -| Scenario | Requests/s | P50 latency (ms) | P95 latency (ms) | Notes | |
| 12 | +| Model | Direct Ollama (ms) | Engine Cold (ms) | Cache Hit (ms) | Concurrent Speedup | |
11 | 13 | |---|---|---|---|---| |
12 | | -| Baseline (no batching) | 1.0 | 1 800 | 3 200 | Single requests, sequential | |
13 | | -| FCFS batching (batch=4) | 2.9 | 620 | 1 100 | Concurrent Ollama fan-out | |
14 | | -| TokenBudget batching | 3.4 | 540 | 980 | Packs requests to 512-token budget | |
15 | | -| + Semantic cache (50% hit) | 6.1 | 190 | 410 | Cache hit avoids Ollama entirely | |
16 | | -| + Speculative decoding | ~4.2 | 430 | 790 | 1.35× speedup vs no speculation | |
| 14 | +| `phi3:latest` (3.8B) | ~80 warm¹ | ~770 avg | **2** | varies² | |
| 15 | +| `mistral:7b` (7B) | 127 | 426 | **3** | 1.11x | |
| 16 | +| `llama3.1:8b` (8B) | 262 | 434 | **3** | 1.19x | |
| 17 | +| `deepseek-r1:7b` (7B reasoning) | ~50 000 | ~107 000 | **2** | N/A³ | |
17 | 18 |
|
18 | | -> **Note:** Numbers are illustrative; reproduce with `scripts/run_benchmarks.py`. |
| 19 | +¹ phi3 first request included model cold-load (~3100ms); subsequent warm requests averaged ~80ms. |
| 20 | +² phi3 concurrent shows high variance — model loads/unloads between model switches on 16GB. |
| 21 | +³ deepseek-r1 generates chain-of-thought reasoning tokens; concurrent test exceeded 5-minute timeout. |
19 | 22 |
|
20 | 23 | --- |
21 | 24 |
|
22 | | -## Memory Footprint by Quantization Level |
| 25 | +## Key Findings |
23 | 26 |
|
24 | | -| Quantization | Model (7B params) | KV-cache (2 k tokens) | Total peak | |
25 | | -|---|---|---|---| |
26 | | -| fp16 | 14.0 GB | 0.84 GB | 14.84 GB | |
27 | | -| q8_0 | 7.0 GB | 0.84 GB | 7.84 GB | |
28 | | -| q4_K_M | 3.5 GB | 0.84 GB | 4.34 GB | |
29 | | -| q4_0 | 3.2 GB | 0.84 GB | 4.04 GB | |
30 | | -| q3_K_M | 2.7 GB | 0.84 GB | 3.54 GB | |
| 27 | +### 1. Cache Hit: 2–3 ms for all models |
31 | 28 |
|
32 | | -The `MemoryEstimator` adds a **10 % safety margin** to all figures above. |
33 | | -The `AdaptiveThrottler` uses a soft threshold of **85 %** of the configured |
34 | | -limit (default 14 GB for M2 Air) and a hard reject at the limit. |
| 29 | +Repeated identical prompts are served entirely from the in-process LRU cache, bypassing Ollama completely. For `mistral:7b` (127ms warm baseline), this is a **42x speedup**. For `deepseek-r1:7b` (50s baseline), the speedup approaches **25,000x** on cached responses. |
35 | 30 |
|
36 | | ---- |
| 31 | +Cache miss (cold path) adds ~150–600ms scheduling and dispatch overhead on top of model inference time. |
37 | 32 |
|
38 | | -## Scheduling Policy Comparison |
| 33 | +### 2. Engine Cold Path Overhead |
39 | 34 |
|
40 | | -| Policy | Best for | Avg wait (ms) | Max wait (ms) | |
| 35 | +The engine adds scheduling → queue → dispatch → result-mapping overhead. For short generations (1–5 tokens), this overhead is proportionally high. For longer generations the ratio improves — the overhead is roughly constant at 150–400ms regardless of generation length. |
| 36 | + |
| 37 | +| Model | Baseline (ms) | Engine cold (ms) | Overhead (ms) | |
41 | 38 | |---|---|---|---| |
42 | | -| FCFS | Fairness, debugging | 210 | 850 | |
43 | | -| SJF | Low mean latency | 145 | 1 600 | |
44 | | -| Priority | Multi-tenant SLAs | 120 (hi-pri) | 3 000 (lo-pri) | |
45 | | -| TokenBudget | Throughput maximisation | 195 | 780 | |
| 39 | +| `mistral:7b` | 127 | 426 | ~300 | |
| 40 | +| `llama3.1:8b` | 262 | 434 | ~170 | |
46 | 41 |
|
47 | | -Measured under 10 concurrent clients, 7B q4_K_M model. |
| 42 | +### 3. Concurrent Batching (4 parallel requests) |
48 | 43 |
|
49 | | ---- |
| 44 | +| Model | Sequential estimate (ms) | Concurrent wall (ms) | Speedup | |
| 45 | +|---|---|---|---| |
| 46 | +| `mistral:7b` — P1 | 572 | 491 | 1.17x | |
| 47 | +| `mistral:7b` — P2 | 476 | 438 | 1.08x | |
| 48 | +| `mistral:7b` — P3 | 476 | 438 | 1.09x | |
| 49 | +| `llama3.1:8b` — P1 | 1068 | 908 | 1.18x | |
| 50 | +| `llama3.1:8b` — P2 | 1032 | 880 | 1.17x | |
| 51 | +| `llama3.1:8b` — P3 | 1048 | 864 | 1.21x | |
50 | 52 |
|
51 | | -## Speculative Decoding Acceptance Rate |
| 53 | +Batching provides a consistent ~1.1–1.2x wall-time speedup for 4 concurrent requests on an M2 Air with a single loaded model. Gains are limited by Ollama's own single-threaded inference — the engine's concurrent `httpx` fan-out saturates Ollama's queue, but Ollama processes requests sequentially. The speedup comes from pipelining queue drain, HTTP connection reuse, and result dispatch. |
52 | 54 |
|
53 | | -| Draft model | Target model | Acceptance rate | Speedup | |
54 | | -|---|---|---|---| |
55 | | -| phi3:mini | llama3:8b | 68 % | 1.35× | |
56 | | -| phi3:mini | mistral:7b | 52 % | 1.18× | |
57 | | -| gemma:2b | llama3:8b | 44 % | 1.09× | |
| 55 | +### 4. deepseek-r1:7b — Reasoning Model |
58 | 56 |
|
59 | | -Acceptance rate is highly sensitive to prompt domain. |
60 | | -Speculative decoding is most effective for: |
61 | | -- Structured output (code, JSON) |
62 | | -- Repetitive or templated prompts |
63 | | -- Prompt–draft model family alignment |
| 57 | +deepseek-r1 generates an internal `<think>...</think>` chain before answering, producing hundreds of tokens even for trivial prompts. Measured baseline was ~50s per request. The engine's cache is the only optimization that provides meaningful speedup (2ms hit vs 50s miss = ~25,000x) — scheduling and batching do not help because the bottleneck is generation time. |
64 | 58 |
|
65 | 59 | --- |
66 | 60 |
|
67 | | -## Cache Effectiveness |
| 61 | +## Per-Model Detail |
68 | 62 |
|
69 | | -| Cache hit rate | Effective RPS | Reduction in Ollama calls | |
70 | | -|---|---|---| |
71 | | -| 0 % | 3.4 | 0 % | |
72 | | -| 25 % | 4.5 | 25 % | |
73 | | -| 50 % | 6.1 | 50 % | |
74 | | -| 75 % | 9.8 | 75 % | |
| 63 | +### phi3:latest |
75 | 64 |
|
76 | | -`SemanticCache` uses exact `(model, prompt)` matching with configurable TTL |
77 | | -(default 300 s) and LRU eviction at 1 000 entries. |
| 65 | +| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s | |
| 66 | +|---|---|---|---|---| |
| 67 | +| "What is 2+2?" (cold model) | 3146 | 266 | 2 | 38.5 | |
| 68 | +| "Capital of France?" | 94 | 184 | 2 | 57.6 | |
| 69 | +| "Sky colour?" | 69 | 1860 | 2 | 77.0 | |
78 | 70 |
|
79 | | ---- |
| 71 | +Note: Prompt 1 baseline includes model cold-load. phi3 was first model benchmarked. Warm baseline (P2, P3) is 69–94ms at 57–77 tok/s. |
80 | 72 |
|
81 | | -## Reproducing Benchmarks |
| 73 | +### mistral:7b |
82 | 74 |
|
83 | | -```bash |
84 | | -# Start Ollama first |
85 | | -ollama serve |
| 75 | +| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s | |
| 76 | +|---|---|---|---|---| |
| 77 | +| "What is 2+2?" | 143 | 840 | 3 | 35.1 | |
| 78 | +| "Capital of France?" | 119 | 219 | 3 | 39.5 | |
| 79 | +| "Sky colour?" | 119 | 219 | 3 | 39.4 | |
86 | 80 |
|
87 | | -# Run all benchmarks |
88 | | -python scripts/run_benchmarks.py --config configs/benchmarks.yaml |
| 81 | +### llama3.1:8b |
89 | 82 |
|
90 | | -# Start the API server |
91 | | -python scripts/start_server.py |
| 83 | +| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s | |
| 84 | +|---|---|---|---|---| |
| 85 | +| "What is 2+2?" | 267 | 464 | 3 | 27.1 | |
| 86 | +| "Capital of France?" | 258 | 403 | 2 | 27.6 | |
| 87 | +| "Sky colour?" | 262 | 434 | 3 | 27.3 | |
92 | 88 |
|
93 | | -# Run a quick load test (requires httpx) |
94 | | -python -c " |
95 | | -import asyncio, httpx, time |
| 89 | +### deepseek-r1:7b |
96 | 90 |
|
97 | | -async def main(): |
98 | | - async with httpx.AsyncClient(base_url='http://localhost:8000') as c: |
99 | | - start = time.perf_counter() |
100 | | - tasks = [c.post('/completions', json={'model': 'llama3:8b', 'prompt': 'Hello'}) |
101 | | - for _ in range(20)] |
102 | | - responses = await asyncio.gather(*tasks) |
103 | | - print(f'{len(responses)} requests in {time.perf_counter()-start:.2f}s') |
| 91 | +| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s | |
| 92 | +|---|---|---|---|---| |
| 93 | +| "What is 2+2?" | 49 783 | 107 266 | 2 | 9.0 | |
104 | 94 |
|
105 | | -asyncio.run(main()) |
106 | | -" |
107 | | -``` |
| 95 | +One prompt only; reasoning chain tokens drove generation time to ~50s. Cache hit still 2ms. |
108 | 96 |
|
109 | 97 | --- |
110 | 98 |
|
111 | | -## Post-Phase-7 Performance Improvements (perf/1–8) |
112 | | - |
113 | | -The following improvements were landed after Phase 7 to address bottlenecks found |
114 | | -under production-level load testing: |
115 | | - |
116 | | -| Branch | Change | Observed Impact | |
117 | | -|---|---|---| |
118 | | -| `perf/1-batch-token-counter` | O(n²)→O(1) token tracking in `Batch` | Eliminates quadratic scheduling overhead at batch_size > 16 | |
119 | | -| `perf/2-cache-async-lock` | `asyncio.Lock` on `SemanticCache` | Prevents spurious cache misses and crashes under concurrent requests | |
120 | | -| `perf/3-scheduler-lock-free-queues` | `dict.setdefault()` for queue creation | Removes `async with self._lock` serialisation on every `submit()` | |
121 | | -| `perf/4-speculation-fixes` | Precompiled regex + case-exact token match | Reduces overhead per speculation round; fixes inflated acceptance rates | |
122 | | -| `perf/5-memoize-estimators` | `lru_cache` on weight/context estimators | Near-zero latency for repeated same-model estimates under load | |
123 | | -| `perf/6-queue-bounded-cancellation` | Guard `_cancelled` with `_queued_ids` | Prevents unbounded set growth from stale cancellation IDs | |
124 | | -| `perf/7-config-driven-server` | Wire `InferenceConfig` into server lifespan | Enables zero-code tuning via `configs/default.yaml` | |
125 | | -| `perf/8-backoff-jitter` | Jitter `+ uniform(0,1)` in retry sleep | Spreads retry storms across `[base·2^n, base·2^n + 1]` second window | |
| 99 | +## Memory Footprint by Quantization Level |
126 | 100 |
|
127 | | -### Configuration Tuning Reference |
| 101 | +Estimated for 7B parameter models (applies to mistral:7b, deepseek-r1:7b): |
128 | 102 |
|
129 | | -After perf/7, all critical limits are in `configs/default.yaml`: |
| 103 | +| Quantization | Weights | KV-cache (2k tokens) | Total peak | |
| 104 | +|---|---|---|---| |
| 105 | +| fp16 | 14.0 GB | 0.84 GB | 14.84 GB | |
| 106 | +| q8_0 | 7.0 GB | 0.84 GB | 7.84 GB | |
| 107 | +| q4_K_M | 3.5 GB | 0.84 GB | 4.34 GB | |
| 108 | +| q4_0 | 3.2 GB | 0.84 GB | 4.04 GB | |
| 109 | +| q3_K_M | 2.7 GB | 0.84 GB | 3.54 GB | |
| 110 | +| q2_K | 2.0 GB | 0.84 GB | 2.84 GB | |
130 | 111 |
|
131 | | -```yaml |
132 | | -cache: |
133 | | - max_size: 1000 # LRU capacity (entries) |
134 | | - ttl_seconds: 300.0 # TTL before eviction |
| 112 | +All installed models use Q4_K_M. Engine memory estimator uses these values for admission control (configured at `memory.limit_gb=14.0`). |
135 | 113 |
|
136 | | -scheduling: |
137 | | - policy: fcfs # fcfs | sjf | priority | token_budget |
138 | | - max_requests_per_batch: 8 |
139 | | - token_budget: 512 |
| 114 | +--- |
140 | 115 |
|
141 | | -memory: |
142 | | - limit_gb: 14.0 # Hard admission reject threshold |
143 | | - soft_limit_ratio: 0.85 # Soft limit for QUEUE decision |
| 116 | +## Reproducing |
144 | 117 |
|
145 | | -ollama: |
146 | | - retry_backoff_seconds: 1.0 # Base for exponential + jitter backoff |
| 118 | +```bash |
| 119 | +# Requires Ollama running with models pulled |
| 120 | +python scripts/start_server.py & |
| 121 | +python scripts/run_integration_benchmarks.py |
147 | 122 | ``` |
| 123 | + |
| 124 | +Outputs: `docs/benchmark_results.json`, `docs/PERFORMANCE_REPORT.md` |
0 commit comments