⚠️ DEPRECATED — This project is archived and no longer maintained. Development has moved to munimx/llm-semantic-cache.
An OpenAI-compatible proxy middleware for vLLM. It sat between an application and one or more vLLM instances, adding Redis-backed response caching, cross-worker request coalescing, token-count-based model routing, and KV-cache-pressure-aware admission control.
This project went through two iterations. The first was Ollama middleware — a foundation that turned out to be wrong. The second (this repo) was a vLLM proxy, which was cleaner but ultimately a worse version of LiteLLM. LiteLLM already exists, is production-grade, has enterprise backing, and covers every feature here plus hundreds more. There is no credible answer to "why not just use LiteLLM?" for a generic proxy layer.
munimx/llm-semantic-cache — a focused Python library that adds semantic caching in front of any OpenAI-compatible LLM API. One thing done well: understand whether two prompts are asking the same thing, and skip the redundant LLM call if they are.
# Prerequisites: Python 3.11+, vLLM on localhost:8080, Redis on localhost:6379
pip install -e .
uvicorn llm_inference_engine.api.server:app --host 0.0.0.0 --port 8000# Text completion
curl http://localhost:8000/completions \
-H "Content-Type: application/json" \
-d '{"model": "mistralai/Mistral-7B-Instruct-v0.2", "prompt": "Explain quicksort", "max_tokens": 128}'
# Chat completion (streaming)
curl http://localhost:8000/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "mistralai/Mistral-7B-Instruct-v0.2", "messages": [{"role": "user", "content": "Explain quicksort"}], "stream": true}'┌─────────────┐ ┌──────────────────────────────────────────────────────────────┐ ┌────────────┐
│ Application │────▶│ LLM Inference Optimization Engine │────▶│ vLLM(s) │
│ │◀────│ ModelRouter → Cache → Coalescer → Throttler → BackendPool │◀────│ │
└─────────────┘ └──────────────────────────────────────────────────────────────┘ └────────────┘
│ ▲
▼ │
┌────────┐
│ Redis │
└────────┘
| Layer | Component | Purpose |
|---|---|---|
| API | FastAPI server, Pydantic models | OpenAI-compatible HTTP endpoints |
| Routing | ModelRouter, FallbackRouter | Token-count-based fast/large model selection; stale-cache fallback |
| Cache | RedisCache | Redis-backed LRU response cache with TTL |
| Coalescing | RequestCoalescer | Cross-worker deduplication via Redis SET NX + pub/sub |
| Admission | AdaptiveThrottler | ACCEPT / QUEUE / REJECT based on live vllm:kv_cache_usage_perc |
| Reliability | BackendPool, CircuitBreaker | Round-robin pool with per-backend circuit breaker |
| Integration | VLLMBackend | Async httpx client targeting vLLM's OpenAI-compatible API |
See docs/architecture.md for component details and the full request lifecycle.
- OpenAI-compatible API —
/completionsand/chat/completionswith the same request/response schema used by the OpenAI SDK - SSE streaming — real-time token-by-token delivery with
"stream": true - Redis response cache — LRU eviction + TTL; shared across all workers and replicas
- Cross-worker coalescing — identical concurrent requests share one backend call via Redis pub/sub
- Automatic model routing — short prompts go to the fast model, long prompts to the large model; explicit
modelfield overrides routing - KV-cache admission control — polls
vllm:kv_cache_usage_perc; queues or rejects when GPU memory is under pressure - Backend pool + circuit breaker — round-robin across multiple vLLM instances; open circuits are skipped
- Fallback chain — on full pool failure: fallback model → stale cache → 503
- Prometheus metrics — latency histograms, token counters, KV-cache gauge, healthy backend count at
/metrics/prometheus - API key authentication — optional Bearer token validation via
auth.enabled - Docker support — compose file brings up Redis, vLLM, and the engine with health checks
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness/readiness — backend availability, version |
GET |
/metrics |
JSON snapshot — KV-cache usage, healthy backends, cache hit/miss |
GET |
/metrics/prometheus |
Prometheus-format metrics for scraping |
POST |
/completions |
Text completion (streaming or non-streaming) |
POST |
/chat/completions |
Chat completion with message history (streaming or non-streaming) |
See docs/integration_guide.md for full API reference with request/response schemas.
The engine reads configs/default.yaml. Two env vars override the most common deployment settings:
| Variable | Default | Purpose |
|---|---|---|
VLLM_URL |
http://localhost:8080 |
vLLM base URL (overrides vllm.instances[0].url) |
REDIS_URL |
redis://localhost:6379/0 |
Redis connection URL |
vllm:
instances:
- url: "http://localhost:8080" # add more for multi-instance pools
timeout_seconds: 120
retry_count: 2
redis:
url: "redis://localhost:6379/0"
cache:
enabled: true
max_size: 256 # LRU eviction after this many entries
ttl_seconds: 300.0 # entries older than this are treated as misses
admission_control:
soft_limit: 0.70 # QUEUE above this KV-cache fraction
hard_limit: 0.90 # REJECT above this KV-cache fraction
model_registry:
fast_model: "mistralai/Mistral-7B-Instruct-v0.2"
large_model: "meta-llama/Meta-Llama-3-70B-Instruct"
fast_model_token_threshold: 512 # prompts shorter than this → fast model
fallback_model: "mistralai/Mistral-7B-Instruct-v0.2"See docs/usage_guide.md for all options and tuning recommendations.
# Set your Hugging Face token and model name, then:
HF_TOKEN=hf_... VLLM_MODEL=mistralai/Mistral-7B-Instruct-v0.2 docker compose up --build
# Engine on :8000, vLLM on :8080, Redis on :6379The compose file starts Redis and vLLM with health checks, then starts the engine once both are healthy.
pip install -e ".[dev]"
python3 -m pytest tests/unit/ -q # 234 tests
ruff check src/ tests/ # linting
mypy src/llm_inference_engine --strict # type checking| Document | Purpose |
|---|---|
| Architecture | Component design, request lifecycle, design decisions |
| Usage Guide | Configuration reference, tuning, troubleshooting |
| Integration Guide | API reference, streaming, error handling, Docker deployment |
| Performance Report | Benchmark methodology and results |
See CONTRIBUTING.md for setup instructions, code style, and PR workflow.