Skip to content
This repository was archived by the owner on Mar 4, 2026. It is now read-only.

Commit 080d0cb

Browse files
munimxCopilot
andcommitted
docs: update README with new features and architecture
Update architecture diagram, add Key Features section (streaming, auth, Prometheus, semantic cache, multi-backend, coalescing), add auth config knobs, update test count to 570. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 5db2a13 commit 080d0cb

1 file changed

Lines changed: 27 additions & 7 deletions

File tree

README.md

Lines changed: 27 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,28 @@
11
# LLM Inference Optimization Engine
22

3-
Request scheduling, semantic caching, and speculative decoding middleware for [Ollama](https://ollama.ai/), exposing an OpenAI-compatible HTTP API.
3+
Request scheduling, caching, streaming, and inference orchestration middleware for [Ollama](https://ollama.ai/) (with an extensible multi-backend interface), exposing an OpenAI-compatible HTTP API.
44

55
[![CI](https://github.com/munimx/LLM-Inference-Optimization-Engine/actions/workflows/ci.yml/badge.svg)](https://github.com/munimx/LLM-Inference-Optimization-Engine/actions/workflows/ci.yml)
66
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue)](https://www.python.org/downloads/)
77
[![License: Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE)
88

99
---
1010

11-
Sits between your application and Ollama. Incoming completions requests are checked against a semantic cache, queued by configurable scheduling policy, dispatched to Ollama in concurrent batches, and returned via per-request futures. A draft-model speculation loop and adaptive memory throttler are available as opt-in layers.
11+
Sits between your application and Ollama (or other inference backends). Incoming requests are checked against an exact-match or embedding-based semantic cache, queued by configurable scheduling policy, dispatched to the backend in concurrent batches, and streamed back via SSE or returned as a complete response. Features include API-key authentication, Prometheus metrics, request coalescing, and adaptive memory throttling.
1212

1313
## Architecture
1414

1515
```
16-
POST /completions
16+
POST /completions or /chat/completions
1717
1818
19-
SemanticCache ──── hit ───────────────────────────▶ response
19+
API-Key Auth (optional)
20+
21+
22+
RequestCoalescer ── dedup identical in-flight requests
23+
24+
25+
ExactMatchCache / EmbeddingCache ── hit ─────────────▶ response
2026
│ miss
2127
2228
RequestAggregator
@@ -25,13 +31,13 @@ RequestAggregator
2531
Scheduler (per-model RequestQueue + SchedulingPolicy)
2632
2733
28-
dispatch_batch() ── concurrent httpx ──▶ Ollama
34+
dispatch_batch() ── concurrent httpx ──▶ Ollama / InferenceBackend
2935
3036
3137
ResultMapper (asyncio.Future per request)
3238
3339
34-
CompletionResponse
40+
CompletionResponse or SSE stream
3541
```
3642

3743
See [docs/architecture.md](docs/architecture.md) for component details.
@@ -82,13 +88,27 @@ All settings are in `configs/default.yaml`. The key knobs:
8288
| `cache.max_size` | `256` | LRU capacity (entries) |
8389
| `cache.ttl_seconds` | `300` | Seconds before a cache entry is treated as a miss |
8490
| `memory.limit_gb` | `14.0` | Hard admission reject threshold (M2 Air default) |
91+
| `auth.enabled` | `false` | Enable API-key authentication |
92+
| `auth.api_keys` | `[]` | List of valid Bearer tokens |
8593
| `ollama.retry_count` | `3` | Retries on Ollama transport errors |
8694
| `ollama.retry_backoff_seconds` | `1.0` | Base for exponential + jitter backoff |
8795

96+
## Key Features
97+
98+
- **SSE Streaming**`"stream": true` proxies Ollama's token-by-token output via Server-Sent Events
99+
- **Chat completions**`/chat/completions` uses Ollama's native `/api/chat` with structured messages
100+
- **Exact-match cache** — fast LRU cache keyed on `(model, prompt)`
101+
- **Semantic cache** — embedding-based similarity matching via Ollama's `/api/embed` (opt-in)
102+
- **Request coalescing** — identical in-flight requests are deduplicated
103+
- **API-key auth** — optional Bearer-token authentication middleware
104+
- **Prometheus metrics** — scrapable at `GET /metrics/prometheus`
105+
- **Multi-backend interface** — abstract `InferenceBackend` ABC; Ollama adapter included, extensible to vLLM/TGI/llama.cpp
106+
- **Prompt token counting** — uses Ollama's `prompt_eval_count` with char/4 fallback
107+
88108
## Development
89109

90110
```bash
91-
# Tests (no Ollama required, ~2 s, 539 tests)
111+
# Tests (no Ollama required, ~2 s, 570 tests)
92112
pytest tests/unit/ --no-cov
93113

94114
# Coverage report

0 commit comments

Comments
 (0)