Upload a screenshot, diagram, or CLI output — get a structured markdown description.
Built for on-device inference — with an API fallback for machines without a GPU.
Built as part of an evaluation of local agentic coding setups. The entire codebase was written using a local model running on a RTX 3090
| Full (Local) | Lite (Online) | |
|---|---|---|
| Model | Gemma-3-4B (on-device) | Claude Sonnet (Anthropic API) |
| GPU | NVIDIA 4GB+ VRAM required | None — runs on any machine |
| Offline | Yes | No (API calls) |
| Image size | ~12 GB | ~600 MB |
| Cost | Free after setup | Per-request Anthropic pricing |
Fastest way to try caption-engine. No GPU required. Requires an Anthropic API key.
# Pull compose file
curl -O https://raw.githubusercontent.com/cubebecu/caption-engine/main/docker-compose-lite.yml
# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
# Start the service
docker compose -f docker-compose-lite.yml up -dOpen http://localhost:8000 in your browser.
For workflows where screenshots can't leave your network or where API costs add up at scale.
- Docker + Docker Compose
- NVIDIA GPU (4 GB VRAM minimum): Ada, Ampere, Hopper, Blackwell, Turing
- NVIDIA Driver 570+
- NVIDIA Container Toolkit
# Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker# Pull compose file
curl -O https://raw.githubusercontent.com/cubebecu/caption-engine/main/docker-compose.yml
# Start the service
docker compose up -dOpen http://localhost:8000 in your browser.
Upload an image, click Generate Caption, get markdown output.
| Component | Details |
|---|---|
| Model | Gemma-3-4B (multimodal vision, quantized Q4_K_M) |
| Backend | llama.cpp + FastAPI |
| GPU | NVIDIA CUDA, all 32 layers offloaded |
| VRAM | 4 GB minimum (no CPU fallback) |
| Component | Details |
|---|---|
| Model | Auto-detected latest Claude Sonnet |
| Backend | Anthropic Messages API + FastAPI |
| Config | LLM_BACKEND=anthropic, ANTHROPIC_API_KEY required |
Full technical reference — API endpoints, configuration, GPU requirements, building from source:
Note
Built by cubebecu as part of an evaluation of local agentic coding setups. The entire codebase was written using a local model running on a single NVIDIA GPU
Note
Code — Apache License 2.0
The application code in this repository is licensed under Apache 2.0.
See LICENSE for full text.
Note
Model weights (Local mode only) — Google Gemma Terms of Use
The Docker image bundles Gemma-3-4B model weights, which are NOT
under Apache 2.0. Gemma is governed by Google's Gemma Terms of Use
and Prohibited Use Policy.
By pulling the Docker image or using the bundled model in any form, you
agree to those terms. The Prohibited Use Policy contains binding
restrictions on what Gemma may be used for — read it before deploying
this in production.
Lite mode does not bundle any model weights and is not subject to Gemma ToU.
See NOTICE and third_party/ for full attribution and license texts of bundled dependencies.


