Skip to content

cubebecu/caption-engine

Repository files navigation

caption-engine

Image captioning for technical documentation

Upload a screenshot, diagram, or CLI output — get a structured markdown description.
Built for on-device inference — with an API fallback for machines without a GPU.

Built as part of an evaluation of local agentic coding setups. The entire codebase was written using a local model running on a RTX 3090

Docker Pulls Local LLM Eval Full GPU Lite Online

Web UI

Two Modes

Full (Local) Lite (Online)
Model Gemma-3-4B (on-device) Claude Sonnet (Anthropic API)
GPU NVIDIA 4GB+ VRAM required None — runs on any machine
Offline Yes No (API calls)
Image size ~12 GB ~600 MB
Cost Free after setup Per-request Anthropic pricing

Quick Start — Lite (no GPU)

Fastest way to try caption-engine. No GPU required. Requires an Anthropic API key.

# Pull compose file
curl -O https://raw.githubusercontent.com/cubebecu/caption-engine/main/docker-compose-lite.yml

# Set your API key
export ANTHROPIC_API_KEY=sk-ant-...

# Start the service
docker compose -f docker-compose-lite.yml up -d

Open http://localhost:8000 in your browser.

Quick Start — Local (GPU)

For workflows where screenshots can't leave your network or where API costs add up at scale.

Prerequisites

  • Docker + Docker Compose
  • NVIDIA GPU (4 GB VRAM minimum): Ada, Ampere, Hopper, Blackwell, Turing
  • NVIDIA Driver 570+
  • NVIDIA Container Toolkit

Install NVIDIA Container Toolkit

# Ubuntu
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Install Caption Engine

# Pull compose file
curl -O https://raw.githubusercontent.com/cubebecu/caption-engine/main/docker-compose.yml

# Start the service
docker compose up -d

Open http://localhost:8000 in your browser.

Upload an image, click Generate Caption, get markdown output.

Under the Hood

Local Mode

Component Details
Model Gemma-3-4B (multimodal vision, quantized Q4_K_M)
Backend llama.cpp + FastAPI
GPU NVIDIA CUDA, all 32 layers offloaded
VRAM 4 GB minimum (no CPU fallback)

Lite Mode

Component Details
Model Auto-detected latest Claude Sonnet
Backend Anthropic Messages API + FastAPI
Config LLM_BACKEND=anthropic, ANTHROPIC_API_KEY required

Results

Batch processing

Web UI

Caption repository

Web UI

Documentation

Full technical reference — API endpoints, configuration, GPU requirements, building from source:

DOCS.md

Author

Note

Built by cubebecu as part of an evaluation of local agentic coding setups. The entire codebase was written using a local model running on a single NVIDIA GPU

License

Note

Code — Apache License 2.0
The application code in this repository is licensed under Apache 2.0. See LICENSE for full text.

Note

Model weights (Local mode only) — Google Gemma Terms of Use
The Docker image bundles Gemma-3-4B model weights, which are NOT under Apache 2.0. Gemma is governed by Google's Gemma Terms of Use and Prohibited Use Policy. By pulling the Docker image or using the bundled model in any form, you agree to those terms. The Prohibited Use Policy contains binding restrictions on what Gemma may be used for — read it before deploying this in production.

Lite mode does not bundle any model weights and is not subject to Gemma ToU.

Third-party components

See NOTICE and third_party/ for full attribution and license texts of bundled dependencies.

About

Image caption generator for technical documentation

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors