🧠 Claude Code Local

Run a 122 billion parameter AI on your MacBook.
No cloud. No fees. No data leaves your machine.

🤔 What Is This?

Your MacBook has a powerful GPU built right into the chip. This project uses that GPU to run a massive AI model — the same kind that powers ChatGPT and Claude — entirely on your computer.

🚫 No internet needed 💰 No monthly subscription 🔒 No one sees your code or data ✅ Full Claude Code experience — write code, edit files, manage projects, control your browser

         📱 You (Mac or Phone)
          │
     🤖 Claude Code         ← the AI coding tool you know
          │
     ⚡ MLX Native Server    ← our server (200 lines of Python)
          │
     🧠 Qwen 3.5 122B       ← 122 billion parameter brain
          │
     🖥️ Apple Silicon GPU    ← your M-series chip does all the work

📱 Control From Your Phone

You don't have to be at your Mac to use this. We built a remote control pipeline:

📱 Your iPhone                    💻 Your Mac
     │                                │
     │── iMessage ──────────────────>│
     │                                │── Claude Code
     │                                │── MLX Server
     │                                │── Qwen 3.5 122B
     │                                │── (does the work)
     │<── iMessage response ────────│
     │                                │
   🛋️ From your couch            🖥️ At your desk

How it works:

📲 Send a message from your phone via iMessage
🤖 Claude Code receives it and runs the task on your local AI
💬 Response comes back to your phone
✈️ Works anywhere your Mac has power — even offline for the AI part

We built this before Anthropic shipped their Dispatch feature. Same concept, but ours uses iMessage and runs on your local model instead of cloud.

💡 Pro tip: Anthropic's Dispatch doesn't read your CLAUDE.md. Mention it in your message or it'll miss your custom setup. Our iMessage system doesn't have this problem.

📊 Benchmarks

We built and tested three different approaches. Each one got faster.

⚡ Speed Comparison

                         Tokens per Second
  🐌 Ollama (Gen 1)      ██████████████████████████████ 30 tok/s
  🏃 llama.cpp (Gen 2)   █████████████████████████████████████████ 41 tok/s
  🚀 MLX Native (Gen 3)  ████████████████████████████████████████████████████████████████ 65 tok/s

⏱️ Real-World Claude Code Task

How long to ask Claude Code to write a function:

  😴 Ollama + Proxy          ████████████████████████████████████████████ 133 seconds
  😐 llama.cpp + Proxy       ████████████████████████████████████████████ 133 seconds
  🔥 MLX Native (no proxy)   ██████ 17.6 seconds

                              7.5x faster ⚡

📋 Side-by-Side

	🐌 Ollama	🏃 llama.cpp + TurboQuant	🚀 MLX Native (ours)
Speed	30 tok/s	41 tok/s	65 tok/s
Claude Code task	133s	133s	17.6s
Needs a proxy?	❌ Yes	❌ Yes	✅ No
Lines of code	N/A	N/A (C++ fork)	~200 Python
Apple native?	❌ Generic	❌ Ported	✅ MLX

☁️ vs Cloud APIs

	🖥️ Our Local Setup	☁️ Claude Sonnet	☁️ Claude Opus
Speed	65 tok/s	~80 tok/s	~40 tok/s
Monthly cost	$0 🎉	$20-100+	$20-100+
Privacy	100% local 🔒	Cloud	Cloud
Works offline	Yes ✈️	No	No
Data leaves your Mac	Never	Always	Always

💡 Our local setup beats cloud Opus on raw speed (65 vs 40 tok/s) at $0/month.

💡 How We Got Here

Most people trying to run Claude Code locally hit the same wall:

Claude Code speaks Anthropic API. Local models speak OpenAI API. Different languages. 🤷

So everyone builds a proxy to translate between them. That proxy adds latency, complexity, and breaks things.

We took a different approach:

🐌 What everyone else does	🚀 What we did
Claude Code → Proxy → Ollama → Model	Claude Code → Our Server → Model
3 processes, 2 API translations	1 process, 0 translations
133 seconds per task	17.6 seconds per task

🎯 That one change — eliminating the proxy — made it 7.5x faster.

💻 What You Need

Your Mac	RAM	What You Can Run
M1/M2/M3/M4 (base)	8-16 GB	🟡 Small models (4B)
M1/M2/M3/M4 Pro	18-36 GB	🟠 Medium models (32B)
M2/M3/M4/M5 Max	64-128 GB	🟢 Large models (122B)
M2/M3/M4 Ultra	128-192 GB	🔵 Multiple large models

Also need:

🐍 Python 3.12+ (for MLX)
🤖 Claude Code (npm install -g @anthropic-ai/claude-code)

🚀 Quick Start (4 Steps)

1️⃣ Set up Python environment

python3.12 -m venv ~/.local/mlx-server
~/.local/mlx-server/bin/pip install mlx-lm

2️⃣ Download the AI model

First run downloads ~50 GB (one time only):

~/.local/mlx-server/bin/python3 -c "
from mlx_lm.utils import load
load('mlx-community/Qwen3.5-122B-A10B-4bit')
print('Done!')
"

3️⃣ Start the server

~/.local/mlx-server/bin/python3 proxy/server.py

4️⃣ Launch Claude Code

ANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_API_KEY=sk-local \
claude --model claude-sonnet-4-6

💡 Or just double-click Claude Local.command on your Desktop. It does all of this automatically.

🔧 How It Works

┌──────────────────────────────────────────────────┐
│              Your MacBook (M5 Max)               │
│                                                  │
│  📱 You type ──> 🤖 Claude Code                  │
│                      │                           │
│                      ▼                           │
│                 ⚡ MLX Server (port 4000)         │
│                      │                           │
│                      ▼                           │
│                 🧠 Qwen 3.5 122B ──> 🖥️ GPU      │
│                      │                           │
│                      ▼                           │
│  📱 Answer <─── ✨ Clean response                │
│                                                  │
│         🔒 Nothing leaves this box. Ever.        │
└──────────────────────────────────────────────────┘

The server (proxy/server.py) is one file, ~200 lines. It does three things:

📦 Loads the model — Apple's MLX framework, native Metal GPU, unified memory
🔌 Speaks Anthropic API — Claude Code thinks it's talking to Anthropic's cloud. It's not.
🧹 Cleans the output — Qwen thinks out loud in <think> tags. We strip those.

🌐 Browser Control

Claude Code can control your real web browser — not a sandbox. Your actual browser with all your logins. 🔓

	🟢 Chrome DevTools (CDP)	🔵 Playwright
Controls	Your real Brave/Chrome	Separate sandboxed browser
Logged in?	✅ All your sessions	❌ Starts fresh
Speed	⚡ Fast	🐌 Slower
Best for	Daily tasks	Automated jobs

💡 Example: "Go to my GitHub and check which PRs need review" — it opens your actual browser, already logged in, and does it. No re-authenticating. Ever.

✈️ When To Use This

Situation	Use This?	Why
On a plane	✅	Full AI coding, no internet needed
Sensitive client code	✅	Nothing leaves your machine
Don't want API fees	✅	$0/month forever
Want fastest possible	☁️	Cloud API is still faster
Need Claude-level reasoning	☁️	Local model is good, not Claude-level
Controlling from phone	✅	iMessage pipeline works offline

📁 What's In This Repo

📦 claude-code-local/
 ├── ⚡ proxy/
 │   └── server.py              ← The entire server. 200 lines. This IS the project.
 ├── 🚀 launchers/
 │   ├── Claude Local.command    ← Double-click to start everything
 │   └── Browser Agent.command   ← Double-click for browser control
 ├── 🛠️ scripts/
 │   ├── download-and-import.sh  ← Download models
 │   ├── persistent-download.sh  ← Auto-retry downloader
 │   └── start-mlx-server.sh    ← Alternative config
 ├── 📊 docs/
 │   ├── BENCHMARKS.md           ← Detailed speed comparisons
 │   └── TWITTER-THREAD.md       ← Social media content
 └── setup.sh                    ← One-command installer

🔒 Security

We audited every component before running it:

Component	Source	Verdict
server.py	We wrote it	✅ Safe
MLX framework	Apple	✅ Safe
Qwen 3.5 model	HuggingFace verified	✅ Safe

🚫 No telemetry 🚫 No analytics 🚫 No phone-home 🚫 No sketchy pip packages

⚠️ We removed LiteLLM after supply chain attack concerns were raised. Every dependency was audited.

🛤️ The Journey

We didn't start here. We went through three generations in one night:

Gen	What We Tried	Speed	💡 What We Learned
1️⃣	Ollama + custom proxy	30 tok/s	Ollama works but Claude Code can't talk to it directly
2️⃣	llama.cpp TurboQuant + proxy	41 tok/s	TurboQuant compresses KV cache 4.9x, but the proxy is the bottleneck
3️⃣	MLX native server	65 tok/s	Kill the proxy. Speak Anthropic API directly. 7.5x faster.

🎯 Each generation taught us something. The final insight — the proxy was the bottleneck, not the model — changed everything.

🙏 Credits

Built on the shoulders of giants:

Project	What It Does	By
🤖 Claude Code	AI coding agent	Anthropic
🍎 MLX	Apple Silicon ML framework	Apple
📦 mlx-lm	Model loading + inference	Apple
🧠 Qwen 3.5	The 122B model	Alibaba
⚡ TurboQuant	KV cache compression research	Google Research

Tested on Apple M5 Max with 128 GB unified memory.

📜 MIT License — Use it however you want.

⭐ Star this repo if it helped you! ⭐

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🧠 Claude Code Local

🤔 What Is This?

📱 Control From Your Phone

📊 Benchmarks

⚡ Speed Comparison

⏱️ Real-World Claude Code Task

📋 Side-by-Side

☁️ vs Cloud APIs

💡 How We Got Here

💻 What You Need

🚀 Quick Start (4 Steps)

1️⃣ Set up Python environment

2️⃣ Download the AI model

3️⃣ Start the server

4️⃣ Launch Claude Code

🔧 How It Works

🌐 Browser Control

✈️ When To Use This

📁 What's In This Repo

🔒 Security

🛤️ The Journey

🙏 Credits

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🧠 Claude Code Local

🤔 What Is This?

📱 Control From Your Phone

📊 Benchmarks

⚡ Speed Comparison

⏱️ Real-World Claude Code Task

📋 Side-by-Side

☁️ vs Cloud APIs

💡 How We Got Here

💻 What You Need

🚀 Quick Start (4 Steps)

1️⃣ Set up Python environment

2️⃣ Download the AI model

3️⃣ Start the server

4️⃣ Launch Claude Code

🔧 How It Works

🌐 Browser Control

✈️ When To Use This

📁 What's In This Repo

🔒 Security

🛤️ The Journey

🙏 Credits