You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 4, 2026. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+27-7Lines changed: 27 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,22 +1,28 @@
1
1
# LLM Inference Optimization Engine
2
2
3
-
Request scheduling, semantic caching, and speculative decoding middleware for [Ollama](https://ollama.ai/), exposing an OpenAI-compatible HTTP API.
3
+
Request scheduling, caching, streaming, and inference orchestration middleware for [Ollama](https://ollama.ai/) (with an extensible multi-backend interface), exposing an OpenAI-compatible HTTP API.
Sits between your application and Ollama. Incoming completions requests are checked against a semantic cache, queued by configurable scheduling policy, dispatched to Ollama in concurrent batches, and returned via per-request futures. A draft-model speculation loop and adaptive memory throttler are available as opt-in layers.
11
+
Sits between your application and Ollama (or other inference backends). Incoming requests are checked against an exact-match or embedding-based semantic cache, queued by configurable scheduling policy, dispatched to the backend in concurrent batches, and streamed back via SSE or returned as a complete response. Features include API-key authentication, Prometheus metrics, request coalescing, and adaptive memory throttling.
12
12
13
13
## Architecture
14
14
15
15
```
16
-
POST /completions
16
+
POST /completions or /chat/completions
17
17
│
18
18
▼
19
-
SemanticCache ──── hit ───────────────────────────▶ response
0 commit comments