munimx
diff --git a/‎docs/PERFORMANCE_REPORT.md‎
Lines changed: 80 additions & 103 deletions b/‎docs/PERFORMANCE_REPORT.md‎
Lines changed: 80 additions & 103 deletions
@@ -1,147 +1,124 @@
-# Performance Report — LLM Inference Optimization Engine
+# Performance Report — Integration Benchmarks
 
-Target hardware: **Apple M2 Air, 8 GB unified memory**  
-Baseline: raw Ollama `generate` calls with no batching or caching.
+**Hardware**: Apple M2 Air, 16 GB unified memory  
+**Ollama version**: local  
+**Benchmark date**: 2026-03-02  
+**Methodology**: 1 warmup run discarded, 3 measured runs averaged. All 4 models installed locally (Q4_K_M quantization).
 
 ---
 
-## Throughput vs Latency
+## Summary
 
-| Scenario | Requests/s | P50 latency (ms) | P95 latency (ms) | Notes |
+| Model | Direct Ollama (ms) | Engine Cold (ms) | Cache Hit (ms) | Concurrent Speedup |
 |---|---|---|---|---|
-| Baseline (no batching) | 1.0 | 1 800 | 3 200 | Single requests, sequential |
-| FCFS batching (batch=4) | 2.9 | 620 | 1 100 | Concurrent Ollama fan-out |
-| TokenBudget batching | 3.4 | 540 | 980 | Packs requests to 512-token budget |
-| + Semantic cache (50% hit) | 6.1 | 190 | 410 | Cache hit avoids Ollama entirely |
-| + Speculative decoding | ~4.2 | 430 | 790 | 1.35× speedup vs no speculation |
+| `phi3:latest` (3.8B) | ~80 warm¹ | ~770 avg | **2** | varies² |
+| `mistral:7b` (7B) | 127 | 426 | **3** | 1.11x |
+| `llama3.1:8b` (8B) | 262 | 434 | **3** | 1.19x |
+| `deepseek-r1:7b` (7B reasoning) | ~50 000 | ~107 000 | **2** | N/A³ |
 
-> **Note:** Numbers are illustrative; reproduce with `scripts/run_benchmarks.py`.
+¹ phi3 first request included model cold-load (~3100ms); subsequent warm requests averaged ~80ms.  
+² phi3 concurrent shows high variance — model loads/unloads between model switches on 16GB.  
+³ deepseek-r1 generates chain-of-thought reasoning tokens; concurrent test exceeded 5-minute timeout.
 
 ---
 
-## Memory Footprint by Quantization Level
+## Key Findings
 
-| Quantization | Model (7B params) | KV-cache (2 k tokens) | Total peak |
-|---|---|---|---|
-| fp16 | 14.0 GB | 0.84 GB | 14.84 GB |
-| q8_0 | 7.0 GB | 0.84 GB | 7.84 GB |
-| q4_K_M | 3.5 GB | 0.84 GB | 4.34 GB |
-| q4_0 | 3.2 GB | 0.84 GB | 4.04 GB |
-| q3_K_M | 2.7 GB | 0.84 GB | 3.54 GB |
+### 1. Cache Hit: 2–3 ms for all models
 
-The `MemoryEstimator` adds a **10 % safety margin** to all figures above.  
-The `AdaptiveThrottler` uses a soft threshold of **85 %** of the configured
-limit (default 14 GB for M2 Air) and a hard reject at the limit.
+Repeated identical prompts are served entirely from the in-process LRU cache, bypassing Ollama completely. For `mistral:7b` (127ms warm baseline), this is a **42x speedup**. For `deepseek-r1:7b` (50s baseline), the speedup approaches **25,000x** on cached responses.
 
----
+Cache miss (cold path) adds ~150–600ms scheduling and dispatch overhead on top of model inference time.
 
-## Scheduling Policy Comparison
+### 2. Engine Cold Path Overhead
 
-| Policy | Best for | Avg wait (ms) | Max wait (ms) |
+The engine adds scheduling → queue → dispatch → result-mapping overhead. For short generations (1–5 tokens), this overhead is proportionally high. For longer generations the ratio improves — the overhead is roughly constant at 150–400ms regardless of generation length.
+
+| Model | Baseline (ms) | Engine cold (ms) | Overhead (ms) |
 |---|---|---|---|
-| FCFS | Fairness, debugging | 210 | 850 |
-| SJF | Low mean latency | 145 | 1 600 |
-| Priority | Multi-tenant SLAs | 120 (hi-pri) | 3 000 (lo-pri) |
-| TokenBudget | Throughput maximisation | 195 | 780 |
+| `mistral:7b` | 127 | 426 | ~300 |
+| `llama3.1:8b` | 262 | 434 | ~170 |
 
-Measured under 10 concurrent clients, 7B q4_K_M model.
+### 3. Concurrent Batching (4 parallel requests)
 
----
+| Model | Sequential estimate (ms) | Concurrent wall (ms) | Speedup |
+|---|---|---|---|
+| `mistral:7b` — P1 | 572 | 491 | 1.17x |
+| `mistral:7b` — P2 | 476 | 438 | 1.08x |
+| `mistral:7b` — P3 | 476 | 438 | 1.09x |
+| `llama3.1:8b` — P1 | 1068 | 908 | 1.18x |
+| `llama3.1:8b` — P2 | 1032 | 880 | 1.17x |
+| `llama3.1:8b` — P3 | 1048 | 864 | 1.21x |
 
-## Speculative Decoding Acceptance Rate
+Batching provides a consistent ~1.1–1.2x wall-time speedup for 4 concurrent requests on an M2 Air with a single loaded model. Gains are limited by Ollama's own single-threaded inference — the engine's concurrent `httpx` fan-out saturates Ollama's queue, but Ollama processes requests sequentially. The speedup comes from pipelining queue drain, HTTP connection reuse, and result dispatch.
 
-| Draft model | Target model | Acceptance rate | Speedup |
-|---|---|---|---|
-| phi3:mini | llama3:8b | 68 % | 1.35× |
-| phi3:mini | mistral:7b | 52 % | 1.18× |
-| gemma:2b | llama3:8b | 44 % | 1.09× |
+### 4. deepseek-r1:7b — Reasoning Model
 
-Acceptance rate is highly sensitive to prompt domain.  
-Speculative decoding is most effective for:
-- Structured output (code, JSON)
-- Repetitive or templated prompts
-- Prompt–draft model family alignment
+deepseek-r1 generates an internal `<think>...</think>` chain before answering, producing hundreds of tokens even for trivial prompts. Measured baseline was ~50s per request. The engine's cache is the only optimization that provides meaningful speedup (2ms hit vs 50s miss = ~25,000x) — scheduling and batching do not help because the bottleneck is generation time.
 
 ---
 
-## Cache Effectiveness
+## Per-Model Detail
 
-| Cache hit rate | Effective RPS | Reduction in Ollama calls |
-|---|---|---|
-| 0 % | 3.4 | 0 % |
-| 25 % | 4.5 | 25 % |
-| 50 % | 6.1 | 50 % |
-| 75 % | 9.8 | 75 % |
+### phi3:latest
 
-`SemanticCache` uses exact `(model, prompt)` matching with configurable TTL
-(default 300 s) and LRU eviction at 1 000 entries.
+| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
+|---|---|---|---|---|
+| "What is 2+2?" (cold model) | 3146 | 266 | 2 | 38.5 |
+| "Capital of France?" | 94 | 184 | 2 | 57.6 |
+| "Sky colour?" | 69 | 1860 | 2 | 77.0 |
 
----
+Note: Prompt 1 baseline includes model cold-load. phi3 was first model benchmarked. Warm baseline (P2, P3) is 69–94ms at 57–77 tok/s.
 
-## Reproducing Benchmarks
+### mistral:7b
 
-```bash
-# Start Ollama first
-ollama serve
+| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
+|---|---|---|---|---|
+| "What is 2+2?" | 143 | 840 | 3 | 35.1 |
+| "Capital of France?" | 119 | 219 | 3 | 39.5 |
+| "Sky colour?" | 119 | 219 | 3 | 39.4 |
 
-# Run all benchmarks
-python scripts/run_benchmarks.py --config configs/benchmarks.yaml
+### llama3.1:8b
 
-# Start the API server
-python scripts/start_server.py
+| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
+|---|---|---|---|---|
+| "What is 2+2?" | 267 | 464 | 3 | 27.1 |
+| "Capital of France?" | 258 | 403 | 2 | 27.6 |
+| "Sky colour?" | 262 | 434 | 3 | 27.3 |
 
-# Run a quick load test (requires httpx)
-python -c "
-import asyncio, httpx, time
+### deepseek-r1:7b
 
-async def main():
-    async with httpx.AsyncClient(base_url='http://localhost:8000') as c:
-        start = time.perf_counter()
-        tasks = [c.post('/completions', json={'model': 'llama3:8b', 'prompt': 'Hello'})
-                 for _ in range(20)]
-        responses = await asyncio.gather(*tasks)
-    print(f'{len(responses)} requests in {time.perf_counter()-start:.2f}s')
+| Prompt | Baseline (ms) | Cold (ms) | Hit (ms) | tok/s |
+|---|---|---|---|---|
+| "What is 2+2?" | 49 783 | 107 266 | 2 | 9.0 |
 
-asyncio.run(main())
-"
-```
+One prompt only; reasoning chain tokens drove generation time to ~50s. Cache hit still 2ms.
 
 ---
 
-## Post-Phase-7 Performance Improvements (perf/1–8)
-
-The following improvements were landed after Phase 7 to address bottlenecks found
-under production-level load testing:
-
-| Branch | Change | Observed Impact |
-|---|---|---|
-| `perf/1-batch-token-counter` | O(n²)→O(1) token tracking in `Batch` | Eliminates quadratic scheduling overhead at batch_size > 16 |
-| `perf/2-cache-async-lock` | `asyncio.Lock` on `SemanticCache` | Prevents spurious cache misses and crashes under concurrent requests |
-| `perf/3-scheduler-lock-free-queues` | `dict.setdefault()` for queue creation | Removes `async with self._lock` serialisation on every `submit()` |
-| `perf/4-speculation-fixes` | Precompiled regex + case-exact token match | Reduces overhead per speculation round; fixes inflated acceptance rates |
-| `perf/5-memoize-estimators` | `lru_cache` on weight/context estimators | Near-zero latency for repeated same-model estimates under load |
-| `perf/6-queue-bounded-cancellation` | Guard `_cancelled` with `_queued_ids` | Prevents unbounded set growth from stale cancellation IDs |
-| `perf/7-config-driven-server` | Wire `InferenceConfig` into server lifespan | Enables zero-code tuning via `configs/default.yaml` |
-| `perf/8-backoff-jitter` | Jitter `+ uniform(0,1)` in retry sleep | Spreads retry storms across `[base·2^n, base·2^n + 1]` second window |
+## Memory Footprint by Quantization Level
 
-### Configuration Tuning Reference
+Estimated for 7B parameter models (applies to mistral:7b, deepseek-r1:7b):
 
-After perf/7, all critical limits are in `configs/default.yaml`:
+| Quantization | Weights | KV-cache (2k tokens) | Total peak |
+|---|---|---|---|
+| fp16 | 14.0 GB | 0.84 GB | 14.84 GB |
+| q8_0 | 7.0 GB | 0.84 GB | 7.84 GB |
+| q4_K_M | 3.5 GB | 0.84 GB | 4.34 GB |
+| q4_0 | 3.2 GB | 0.84 GB | 4.04 GB |
+| q3_K_M | 2.7 GB | 0.84 GB | 3.54 GB |
+| q2_K | 2.0 GB | 0.84 GB | 2.84 GB |
 
-```yaml
-cache:
-  max_size: 1000          # LRU capacity (entries)
-  ttl_seconds: 300.0      # TTL before eviction
+All installed models use Q4_K_M. Engine memory estimator uses these values for admission control (configured at `memory.limit_gb=14.0`).
 
-scheduling:
-  policy: fcfs            # fcfs | sjf | priority | token_budget
-  max_requests_per_batch: 8
-  token_budget: 512
+---
 
-memory:
-  limit_gb: 14.0          # Hard admission reject threshold
-  soft_limit_ratio: 0.85  # Soft limit for QUEUE decision
+## Reproducing
 
-ollama:
-  retry_backoff_seconds: 1.0   # Base for exponential + jitter backoff
+```bash
+# Requires Ollama running with models pulled
+python scripts/start_server.py &
+python scripts/run_integration_benchmarks.py
 ```
+
+Outputs: `docs/benchmark_results.json`, `docs/PERFORMANCE_REPORT.md`