caption-engine

Image captioning for technical documentation

Upload a screenshot, diagram, or CLI output — get a structured markdown description.
Built for on-device inference — with an API fallback for machines without a GPU.

Built as part of an evaluation of local agentic coding setups. The entire codebase was written using a local model running on a RTX 3090

Two Modes

	Full (Local)	Lite (Online)
Model	Gemma-3-4B (on-device)	Claude Sonnet (Anthropic API)
GPU	NVIDIA 4GB+ VRAM required	None — runs on any machine
Offline	Yes	No (API calls)
Image size	~12 GB	~600 MB
Cost	Free after setup	Per-request Anthropic pricing

Quick Start — Lite (no GPU)

Fastest way to try caption-engine. No GPU required. Requires an Anthropic API key.

# Pull compose file
curl -O https://raw.githubusercontent.com/cubebecu/caption-engine/main/docker-compose-lite.yml

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# Start the service
docker compose -f docker-compose-lite.yml up -d

Open http://localhost:8000 in your browser.

Quick Start — Local (GPU)

For workflows where screenshots can't leave your network or where API costs add up at scale.

Prerequisites

Docker + Docker Compose
NVIDIA GPU (4 GB VRAM minimum): Ada, Ampere, Hopper, Blackwell, Turing
NVIDIA Driver 570+
NVIDIA Container Toolkit

Install NVIDIA Container Toolkit

# Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Install Caption Engine

# Pull compose file
curl -O https://raw.githubusercontent.com/cubebecu/caption-engine/main/docker-compose.yml

# Start the service
docker compose up -d

Open http://localhost:8000 in your browser.

Upload an image, click Generate Caption, get markdown output.

Under the Hood

Local Mode

Component	Details
Model	Gemma-3-4B (multimodal vision, quantized Q4_K_M)
Backend	llama.cpp + FastAPI
GPU	NVIDIA CUDA, all 32 layers offloaded
VRAM	4 GB minimum (no CPU fallback)

Lite Mode

Component	Details
Model	Auto-detected latest Claude Sonnet
Backend	Anthropic Messages API + FastAPI
Config	`LLM_BACKEND=anthropic`, `ANTHROPIC_API_KEY` required

Results

Batch processing

Caption repository

Documentation

Full technical reference — API endpoints, configuration, GPU requirements, building from source:

DOCS.md

Author

Note

Built by cubebecu as part of an evaluation of local agentic coding setups. The entire codebase was written using a local model running on a single NVIDIA GPU

License

Note

Code — Apache License 2.0
The application code in this repository is licensed under Apache 2.0. See LICENSE for full text.

Note

Model weights (Local mode only) — Google Gemma Terms of Use
The Docker image bundles Gemma-3-4B model weights, which are NOT under Apache 2.0. Gemma is governed by Google's Gemma Terms of Use and Prohibited Use Policy. By pulling the Docker image or using the bundled model in any form, you agree to those terms. The Prohibited Use Policy contains binding restrictions on what Gemma may be used for — read it before deploying this in production.

Lite mode does not bundle any model weights and is not subject to Gemma ToU.

Third-party components

See NOTICE and third_party/ for full attribution and license texts of bundled dependencies.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
raw		raw
server		server
third_party		third_party
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
DOCS.md		DOCS.md
Dockerfile		Dockerfile
Dockerfile.lite		Dockerfile.lite
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
docker-compose-lite.yml		docker-compose-lite.yml
docker-compose.yml		docker-compose.yml
requirements-lite.txt		requirements-lite.txt
requirements.txt		requirements.txt
system_prompt_default.md		system_prompt_default.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

caption-engine

Image captioning for technical documentation

Two Modes

Quick Start — Lite (no GPU)

Quick Start — Local (GPU)

Prerequisites

Install NVIDIA Container Toolkit

Install Caption Engine

Under the Hood

Local Mode

Lite Mode

Results

Batch processing

Caption repository

Documentation

Author

License

Third-party components

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

caption-engine

Image captioning for technical documentation

Two Modes

Quick Start — Lite (no GPU)

Quick Start — Local (GPU)

Prerequisites

Install NVIDIA Container Toolkit

Install Caption Engine

Under the Hood

Local Mode

Lite Mode

Results

Batch processing

Caption repository

Documentation

Author

License

Third-party components

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages