GitHub - thushan/olla: High-performance lightweight proxy and load balancer for LLM infrastructure. Intelligent routing, automatic failover and unified model discovery across local and remote inference backends.

Recorded with VHS - see demo tape

Olla is a high-performance, low-overhead, low-latency proxy and load balancer for managing LLM infrastructure. It intelligently routes LLM requests across local and remote inference nodes with a wide variety of natively supported endpoints and extensible enough to support others. Olla provides model discovery and unified model catalogues within each provider, enabling seamless routing to available models on compatible endpoints.

Olla works alongside API gateways like LiteLLM or orchestration platforms like GPUStack, focusing on making your existing LLM infrastructure reliable through intelligent routing and failover. You can choose between two proxy engines: Sherpa for simplicity and maintainability or Olla for maximum performance with advanced features like circuit breakers and connection pooling.

Single CLI application and config file is all you need to go Olla!

For Large GPU deployments, Enterprise & Data-Centre use, see TensorFoundry FoundryOS.

Key Features

🔄 Smart Load Balancing: Priority-based routing with automatic failover and connection retry
📌 Sticky Sessions: KV-cache-aware affinity routing that pins multi-turn conversations to the same backend
🔍 Smart Model Unification: Per-provider unification + OpenAI-compatible cross-provider routing
⚡ Dual Proxy Engines: Sherpa (simple) and Olla (high-performance)
🎯 Advanced Filtering: Profile and model filtering with glob patterns for precise control
💊 Health Monitoring: Continuous endpoint health checks with circuit breakers and automatic recovery
🔁 Intelligent Retry: Automatic retry on connection failures with immediate transparent endpoint failover
🔧 Self-Healing: Automatic model discovery refresh when endpoints recover
📊 Request Tracking: Detailed response headers and statistics
⚡🔄 Anthropic Messages API: Passthrough for backends with native support; automatic translation for others
🛡️ Production Ready: Rate limiting, request size limits, graceful shutdown
⚡ High Performance: Sub-millisecond endpoint selection with lock-free atomic stats
🎯 LLM-Optimised: Streaming-first design with optimised timeouts for long inference
⚙️ High Performance: Designed to be very lightweight & efficient, runs on less than 50Mb RAM.

Platform Support

Olla runs on multiple platforms and architectures:

Platform	AMD64	ARM64	Notes
Linux	✅	✅	Full support including Raspberry Pi 4+
macOS	✅	✅	Intel and Apple Silicon (M1/M2/M3/M4)
Windows	✅	✅	Windows 10/11 and Windows on ARM
Docker	✅	✅	Multi-architecture images (amd64/arm64)

Quick Start

Installation

# Download latest release (auto-detects your platform)
bash <(curl -s https://raw.githubusercontent.com/thushan/olla/main/install.sh)

# Docker (automatically pulls correct architecture)
docker run -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest

# Or explicitly specify platform (e.g., for ARM64)
docker run --platform linux/arm64 -t \
  --name olla \
  -p 40114:40114 \
  ghcr.io/thushan/olla:latest

# Install via Go
go install github.com/thushan/olla@latest

# Build from source
git clone https://github.com/thushan/olla.git && cd olla && make build-release
# Run Olla
./bin/olla

Verification

When you have everything running, you can check it's all working with:

# Check health of Olla
curl http://localhost:40114/internal/health

# Check endpoints
curl http://localhost:40114/internal/status/endpoints

# Check models available
curl http://localhost:40114/internal/status/models

For detailed installation and deployment options, see Getting Started Guide.

Querying Olla

Olla exposes multiple API paths depending on your use case:

Path	Format	Use Case
`/olla/proxy/`	OpenAI	Routes to any backend — universal endpoint
`/olla/openai/`	OpenAI	Routes to any backend — universal endpoint
`/olla/anthropic/`	Anthropic	Claude-compatible clients (passthrough or translated)
`/olla/{provider}/`	OpenAI	Target a specific backend type (e.g. `/olla/vllm/`, `/olla/ollama/`)

OpenAI-Compatible (Universal Proxy)

You can use /olla/openai or /olla/proxy

# Chat completion (routes to best available backend)
curl http://localhost:40114/olla/proxy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

# Streaming
curl http://localhost:40114/olla/proxy/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100, "stream": true}'

# List all models across backends
curl http://localhost:40114/olla/proxy/v1/models

Anthropic Messages API

# Chat completion (passthrough for supported backends, translated for others)
curl http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: not-needed" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello"}]}'

# Streaming
curl http://localhost:40114/olla/anthropic/v1/messages \
  -H "Content-Type: application/json" \
  -H "x-api-key: not-needed" \
  -H "anthropic-version: 2023-06-01" \
  -d '{"model": "llama3.2", "max_tokens": 100, "messages": [{"role": "user", "content": "Hello"}], "stream": true}'

Provider-Specific Endpoints

# Target a specific backend type directly
curl http://localhost:40114/olla/ollama/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100}'

# Other providers: /olla/vllm/, /olla/vllm-mlx/, /olla/lm-studio/, /olla/llamacpp/, etc.

Examples

We've also got ready-to-use Docker Compose setups for common scenarios:

Common Architectures

Home Lab: Olla → Multiple Ollama (or OpenAI Compatible - eg. vLLM) instances across your machines
Hybrid Cloud: Olla → Local endpoints + LiteLLM → Cloud APIs (OpenAI, Anthropic, Bedrock, etc.)
Enterprise: Olla → GPUStack cluster + vLLM servers + LiteLLM (cloud overflow)
Development: Olla → Local + Shared team endpoints + LiteLLM (API access)

See integration patterns for detailed architectures.

For a robust enterprise setup, consider TensorFoundry FoundryOS.

🌐 OpenWebUI Integration

Complete setup with OpenWebUI + Olla load balancing multiple Ollama instances or unify all OpenAI compatible models.

See: examples/ollama-openwebui/
Services: OpenWebUI (web UI) + Olla (proxy/load balancer)
Use Case: Web interface with intelligent load balancing across multiple Ollama servers with Olla

Quick Start:

cd examples/ollama-openwebui
# Edit olla.yaml to configure your Ollama endpoints
docker compose up -d
# Access OpenWebUI at http://localhost:3000

You can learn more about OpenWebUI Ollama with Olla or see OpenWebUI OpenAI with Olla.

🤖 Anthropic Message API / CLI Tools - Claude Code, OpenCode, Crush

Olla's Anthropic Messages API support (v0.0.20+) is enabled by default, allowing you to use CLI tools like Claude Code with local AI models on your machine via /olla/anthropic. It operates in two modes depending on your backend:

⚡ Passthrough: requests are forwarded as-is for backends with native Anthropic support (vLLM, llama.cpp, Ollama, LM Studio, Lemonade)
🔄 Translation: Anthropic ↔ OpenAI format conversion for backends that don't natively support the Anthropic Messages API

Still actively being improved -- please report any issues or feedback.

We have examples for:

Learn more about Anthropic API Translation.

Documentation

Full documentation is available at https://thushan.github.io/olla/

Getting Started - Getting Started with Olla
Integrations - See which LLM backends are supported by Olla
Comparisons - Compare with LiteLLM, GPUStack, LocalAI
Olla Concepts - Understand Key Olla concepts
Configuration - Extensive configuration documentation
API Reference - Olla System API Reference
Development - Contributing and development guide

🤝 Contributing

We welcome contributions! Please open an issue first to discuss major changes.

🤖 AI Disclosure

This project has been built with the assistance of AI tools for documentation, test refinement, and code reviews.

We've utilised GitHub Copilot, Anthropic Claude, JetBrains Junie, Codex & TensorFoundry Kaizen for documentation, code reviews, test refinement and troubleshooting.

We also utilise CodeRabbit for AI-driven code reviews for PRs prior to human review.

🙏 Acknowledgements

@pterm/pterm - Terminal UI framework
@puzpuzpuz/xsync - High-performance concurrent maps
@golangci/golangci-lint - Go linting
@dkorunic/betteralign - Struct alignment optimisation

📄 License

Licensed under the Apache License 2.0. See LICENSE for details.

Made with ❤️ for the LLM community

🏠 Homepage • 📖 Documentation • 🐛 Issues • 🚀 Releases

Name		Name	Last commit message	Last commit date
Latest commit History 735 Commits
.github		.github
.vscode		.vscode
assets		assets
config		config
data		data
docs		docs
examples		examples
internal		internal
pkg		pkg
scripts		scripts
test		test
theme		theme
.coderabbit.yaml		.coderabbit.yaml
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
.gosec.json		.gosec.json
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
docker-compose.yaml		docker-compose.yaml
go.mod		go.mod
go.sum		go.sum
install.sh		install.sh
main.go		main.go
makefile		makefile
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Key Features

Platform Support

Quick Start

Installation

Verification

Querying Olla

OpenAI-Compatible (Universal Proxy)

Anthropic Messages API

Provider-Specific Endpoints

Examples

Common Architectures

🌐 OpenWebUI Integration

🤖 Anthropic Message API / CLI Tools - Claude Code, OpenCode, Crush

Documentation

🤝 Contributing

🤖 AI Disclosure

🙏 Acknowledgements

📄 License

About

Uh oh!

Releases 20

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 5

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Key Features

Platform Support

Quick Start

Installation

Verification

Querying Olla

OpenAI-Compatible (Universal Proxy)

Anthropic Messages API

Provider-Specific Endpoints

Examples

Common Architectures

🌐 OpenWebUI Integration

🤖 Anthropic Message API / CLI Tools - Claude Code, OpenCode, Crush

Documentation

🤝 Contributing

🤖 AI Disclosure

🙏 Acknowledgements

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 20

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 5

Languages

Packages