Cake is a multimodal AI inference server written in Rust that can run models as a single node, or shard them across a heterogeneous cluster of devices — iOS, Android, macOS, Linux, Windows — to run workloads that wouldn't fit on a single GPU, effectively leveraging planned obsolescence to make AI more accessible and democratic.
This is experimental code that's being actively developed and changed very quickly.
- Multi Modal — Text generation, image generation (Stable Diffusion, FLUX), and voice synthesis (VibeVoice TTS with voice cloning).
- Multi Model — 15 text model families, 6 image model variants, and 2 TTS models. Architecture auto-detected from HuggingFace checkpoints.
- Multi Platform — CUDA, Metal, and CPU backends across Linux, macOS, Windows, iOS, and Android.
- Multi Node — Shard transformer blocks across devices with zero-config mDNS clustering or manual topology. Also runs entirely on a single machine.
- OpenAI-Compatible API — REST API with streaming, plus a built-in web UI and TUI chat client.
- Docker — Container builds for Linux/NVIDIA with docker-compose cluster support.
cargo build --release --features cuda # Linux (NVIDIA)
cargo build --release --features metal # macOS (Apple Silicon)
cargo build --release # CPU onlyDownload models from HuggingFace with cake pull. Models are stored in the standard HuggingFace cache directory (~/.cache/huggingface/hub/) and are shared with any other tools that use the same cache (transformers, huggingface-cli, etc.).
cake pull evilsocket/Qwen3-0.6B # text model (600M params)
cake pull evilsocket/flux1-dev # image model (FLUX.1-dev FP8)
cake pull evilsocket/VibeVoice-1.5B # voice synthesis model
cake list # show all locally available modelsModels are also downloaded automatically on first use if not already cached.
Run any model locally on a single machine — architecture is auto-detected from the model's config.json:
# Text generation
cake run evilsocket/Qwen3-0.6B "Explain quantum computing in simple terms"
# Start an API server + web UI
cake serve evilsocket/Qwen3-0.6B
# Image generation (FLUX.1-dev FP8)
cake run evilsocket/flux1-dev --model-type image-model --image-model-arch flux1 \
--sd-image-prompt "a cyberpunk cityscape at night"
# Voice synthesis with voice cloning
cake run evilsocket/VibeVoice-1.5B --model-type audio-model \
--voice-prompt voice.wav "Hello world"Shard a model across multiple machines using --cluster-key. Workers don't need the model data — the master automatically streams the required tensor weights over the network (compressed with zstd, verified with CRC32 checksums). Workers cache received data locally for subsequent runs.
# Start workers on any machines (no model needed)
cake run --cluster-key mysecret --name gpu-server-1 # machine A
cake run --cluster-key mysecret --name macbook # machine B
# Run inference from the master (has the model)
cake run evilsocket/Qwen3-0.6B "Hello" --cluster-key mysecret
# Or start an API server as the master
cake serve evilsocket/Qwen3-0.6B --cluster-key mysecretThe master discovers workers via mDNS, assigns layers proportionally to each device's VRAM/compute, and pushes only the required weight shards. See the clustering documentation for manual topology files and advanced configuration.
For the full usage guide and API reference, check the project documentation.
Released under the GPL 3 license. To see the licenses of the project dependencies, install cargo license with cargo install cargo-license and then run cargo license.