Skip to content

evilsocket/cake

Repository files navigation

cake

Documentation Release Rust Report CI License

Cake is a multimodal AI inference server written in Rust that can run models as a single node, or shard them across a heterogeneous cluster of devices — iOS, Android, macOS, Linux, Windows — to run workloads that wouldn't fit on a single GPU, effectively leveraging planned obsolescence to make AI more accessible and democratic.

This is experimental code that's being actively developed and changed very quickly.

Key Features

Quick Start

Build

cargo build --release --features cuda  # Linux (NVIDIA)
cargo build --release --features metal # macOS (Apple Silicon)
cargo build --release                  # CPU only

Models

Download models from HuggingFace with cake pull. Models are stored in the standard HuggingFace cache directory (~/.cache/huggingface/hub/) and are shared with any other tools that use the same cache (transformers, huggingface-cli, etc.).

cake pull evilsocket/Qwen3-0.6B             # text model (600M params)
cake pull evilsocket/flux1-dev               # image model (FLUX.1-dev FP8)
cake pull evilsocket/VibeVoice-1.5B          # voice synthesis model

cake list                                    # show all locally available models

Models are also downloaded automatically on first use if not already cached.

Single Node

Run any model locally on a single machine — architecture is auto-detected from the model's config.json:

# Text generation
cake run evilsocket/Qwen3-0.6B "Explain quantum computing in simple terms"

# Start an API server + web UI
cake serve evilsocket/Qwen3-0.6B

# Image generation (FLUX.1-dev FP8)
cake run evilsocket/flux1-dev --model-type image-model --image-model-arch flux1 \
  --sd-image-prompt "a cyberpunk cityscape at night"

# Voice synthesis with voice cloning
cake run evilsocket/VibeVoice-1.5B --model-type audio-model \
  --voice-prompt voice.wav "Hello world"

Distributed

Shard a model across multiple machines using --cluster-key. Workers don't need the model data — the master automatically streams the required tensor weights over the network (compressed with zstd, verified with CRC32 checksums). Workers cache received data locally for subsequent runs.

# Start workers on any machines (no model needed)
cake run --cluster-key mysecret --name gpu-server-1    # machine A
cake run --cluster-key mysecret --name macbook          # machine B

# Run inference from the master (has the model)
cake run evilsocket/Qwen3-0.6B "Hello" --cluster-key mysecret

# Or start an API server as the master
cake serve evilsocket/Qwen3-0.6B --cluster-key mysecret

The master discovers workers via mDNS, assigns layers proportionally to each device's VRAM/compute, and pushes only the required weight shards. See the clustering documentation for manual topology files and advanced configuration.

For the full usage guide and API reference, check the project documentation.

Star History

Star History Chart

License

Released under the GPL 3 license. To see the licenses of the project dependencies, install cargo license with cargo install cargo-license and then run cargo license.

About

Distributed inference for mobile, desktop and server.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

  •  

Packages

 
 
 

Contributors