LLM Inference Optimization Engine

⚠️ DEPRECATED — This project is archived and no longer maintained. Development has moved to munimx/llm-semantic-cache.

What This Was

An OpenAI-compatible proxy middleware for vLLM. It sat between an application and one or more vLLM instances, adding Redis-backed response caching, cross-worker request coalescing, token-count-based model routing, and KV-cache-pressure-aware admission control.

Why It Was Retired

This project went through two iterations. The first was Ollama middleware — a foundation that turned out to be wrong. The second (this repo) was a vLLM proxy, which was cleaner but ultimately a worse version of LiteLLM. LiteLLM already exists, is production-grade, has enterprise backing, and covers every feature here plus hundreds more. There is no credible answer to "why not just use LiteLLM?" for a generic proxy layer.

Where Development Continues

munimx/llm-semantic-cache — a focused Python library that adds semantic caching in front of any OpenAI-compatible LLM API. One thing done well: understand whether two prompts are asking the same thing, and skip the redundant LLM call if they are.

Quick Start

# Prerequisites: Python 3.11+, vLLM on localhost:8080, Redis on localhost:6379
pip install -e .
uvicorn llm_inference_engine.api.server:app --host 0.0.0.0 --port 8000

# Text completion
curl http://localhost:8000/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistralai/Mistral-7B-Instruct-v0.2", "prompt": "Explain quicksort", "max_tokens": 128}'

# Chat completion (streaming)
curl http://localhost:8000/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistralai/Mistral-7B-Instruct-v0.2", "messages": [{"role": "user", "content": "Explain quicksort"}], "stream": true}'

Architecture

┌─────────────┐     ┌──────────────────────────────────────────────────────────────┐     ┌────────────┐
│ Application │────▶│  LLM Inference Optimization Engine                           │────▶│  vLLM(s)   │
│             │◀────│  ModelRouter → Cache → Coalescer → Throttler → BackendPool   │◀────│            │
└─────────────┘     └──────────────────────────────────────────────────────────────┘     └────────────┘
                                              │   ▲
                                              ▼   │
                                           ┌────────┐
                                           │ Redis  │
                                           └────────┘

Layer	Component	Purpose
API	FastAPI server, Pydantic models	OpenAI-compatible HTTP endpoints
Routing	ModelRouter, FallbackRouter	Token-count-based fast/large model selection; stale-cache fallback
Cache	RedisCache	Redis-backed LRU response cache with TTL
Coalescing	RequestCoalescer	Cross-worker deduplication via Redis SET NX + pub/sub
Admission	AdaptiveThrottler	ACCEPT / QUEUE / REJECT based on live `vllm:kv_cache_usage_perc`
Reliability	BackendPool, CircuitBreaker	Round-robin pool with per-backend circuit breaker
Integration	VLLMBackend	Async httpx client targeting vLLM's OpenAI-compatible API

See docs/architecture.md for component details and the full request lifecycle.

Features

OpenAI-compatible API — /completions and /chat/completions with the same request/response schema used by the OpenAI SDK
SSE streaming — real-time token-by-token delivery with "stream": true
Redis response cache — LRU eviction + TTL; shared across all workers and replicas
Cross-worker coalescing — identical concurrent requests share one backend call via Redis pub/sub
Automatic model routing — short prompts go to the fast model, long prompts to the large model; explicit model field overrides routing
KV-cache admission control — polls vllm:kv_cache_usage_perc; queues or rejects when GPU memory is under pressure
Backend pool + circuit breaker — round-robin across multiple vLLM instances; open circuits are skipped
Fallback chain — on full pool failure: fallback model → stale cache → 503
Prometheus metrics — latency histograms, token counters, KV-cache gauge, healthy backend count at /metrics/prometheus
API key authentication — optional Bearer token validation via auth.enabled
Docker support — compose file brings up Redis, vLLM, and the engine with health checks

API Endpoints

Method	Path	Description
`GET`	`/health`	Liveness/readiness — backend availability, version
`GET`	`/metrics`	JSON snapshot — KV-cache usage, healthy backends, cache hit/miss
`GET`	`/metrics/prometheus`	Prometheus-format metrics for scraping
`POST`	`/completions`	Text completion (streaming or non-streaming)
`POST`	`/chat/completions`	Chat completion with message history (streaming or non-streaming)

See docs/integration_guide.md for full API reference with request/response schemas.

Configuration

The engine reads configs/default.yaml. Two env vars override the most common deployment settings:

Variable	Default	Purpose
`VLLM_URL`	`http://localhost:8080`	vLLM base URL (overrides `vllm.instances[0].url`)
`REDIS_URL`	`redis://localhost:6379/0`	Redis connection URL

vllm:
  instances:
    - url: "http://localhost:8080"   # add more for multi-instance pools
  timeout_seconds: 120
  retry_count: 2

redis:
  url: "redis://localhost:6379/0"

cache:
  enabled: true
  max_size: 256       # LRU eviction after this many entries
  ttl_seconds: 300.0  # entries older than this are treated as misses

admission_control:
  soft_limit: 0.70    # QUEUE above this KV-cache fraction
  hard_limit: 0.90    # REJECT above this KV-cache fraction

model_registry:
  fast_model: "mistralai/Mistral-7B-Instruct-v0.2"
  large_model: "meta-llama/Meta-Llama-3-70B-Instruct"
  fast_model_token_threshold: 512   # prompts shorter than this → fast model
  fallback_model: "mistralai/Mistral-7B-Instruct-v0.2"

See docs/usage_guide.md for all options and tuning recommendations.

Docker

# Set your Hugging Face token and model name, then:
HF_TOKEN=hf_... VLLM_MODEL=mistralai/Mistral-7B-Instruct-v0.2 docker compose up --build
# Engine on :8000, vLLM on :8080, Redis on :6379

The compose file starts Redis and vLLM with health checks, then starts the engine once both are healthy.

Testing

pip install -e ".[dev]"
python3 -m pytest tests/unit/ -q          # 234 tests
ruff check src/ tests/                    # linting
mypy src/llm_inference_engine --strict    # type checking

Documentation

Document	Purpose
Architecture	Component design, request lifecycle, design decisions
Usage Guide	Configuration reference, tuning, troubleshooting
Integration Guide	API reference, streaming, error handling, Docker deployment
Performance Report	Benchmark methodology and results

Contributing

See CONTRIBUTING.md for setup instructions, code style, and PR workflow.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
scripts		scripts
src/llm_inference_engine		src/llm_inference_engine
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
PROJECT_DIRECTION.md		PROJECT_DIRECTION.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Inference Optimization Engine

What This Was

Why It Was Retired

Where Development Continues

Quick Start

Architecture

Features

API Endpoints

Configuration

Docker

Testing

Documentation

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Optimization Engine

What This Was

Why It Was Retired

Where Development Continues

Quick Start

Architecture

Features

API Endpoints

Configuration

Docker

Testing

Documentation

Contributing

License

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages