Early Alpha Release — VoxWatch is under active development and being released early for community feedback. Expect bugs, breaking changes, and rough edges. Issues and pull requests are very welcome. See Known Limitations below.
AI-powered security deterrent that makes your cameras talk back.
VoxWatch turns passive security cameras into active deterrents. When Frigate detects a person, VoxWatch instantly warns them over the camera speaker, then escalates with AI-generated descriptions of their appearance and behavior — all in real-time.
"All units, 10-97 at 482 Elm Street. One suspect, dark hoodie, approaching the front gate." [radio pause] "Copy dispatch. Unit 7 en route."
That's what an intruder hears. Not a beep. Not silence. A specific, real-time callout that makes it obvious someone is watching.
Grab the docker-compose.yml and deploy it however you normally run containers — Portainer, docker compose up -d, Dockge, whatever. Images are on GHCR so no building required.
Open http://your-host:33344 — the setup wizard auto-discovers Frigate, MQTT, and your cameras, then walks you through AI provider, TTS engine, and response mode selection. No config files to edit.
Prerequisites: Frigate NVR with MQTT + go2rtc + a camera with two-way audio.
Frigate NVR MQTT VoxWatch Service MQTT Home Assistant
┌──────┐ Event ┌──────────────────┐ Events ┌─────────────────┐
│Detect│ ──────────────> │ 1. Initial │ ──────────────> │ Lights, Locks, │
│Person│ │ Response │ │ Notifications, │
└──────┘ │ │ │ Automations │
│ 2. Escalation │ └────────┬────────┘
│ (AI analysis) │ voxwatch/announce │
│ │ <───────────────────────┘
│ 3. Persistent │ (TTS on camera speakers)
│ Deterrence │
│ │
│ 4. Resolution │
└──────┬───────────┘
│
┌──────────────┴──────────────┐
v v
Audio Pipeline go2rtc Audio Push
(TTS + Effects) (RTSP Backchannel)
│ │
└──────────────┬──────────────┘
v
Camera Speaker
| Stage | Timing | What Happens |
|---|---|---|
| 1. Initial Response | 0-2 seconds | Pre-cached warning plays immediately. AI analysis starts in parallel. |
| 2. Escalation | 5-15 seconds | AI analyzes snapshots. Describes appearance, clothing, carried items, actions. The intruder hears themselves described in real-time. |
| 3. Persistent Deterrence | 30s+ loops | If person stays, VoxWatch keeps warning them with fresh AI descriptions every N seconds. Tone escalates with each iteration. Configurable max iterations. |
| 4. Resolution | After person leaves | Optional "all clear" message. Disabled by default. |
Each stage only fires if the person is still detected (Frigate re-check). AI adapts automatically for nightvision — no color descriptions from IR footage.
Home Assistant integration: Every stage publishes an MQTT event (voxwatch/events/stage) with the stage number, camera name, and AI description. Build HA automations that escalate with each stage — lights on at Stage 1, phone notification at Stage 2, sirens and door locks at Stage 3. See Home Assistant Integration for examples.
Persistent Deterrence (optional): When enabled, Stage 3 loops -- each iteration waits a configurable delay, re-checks whether the person is still present, generates a fresh AI description, and escalates the tone. The loop ends when the person leaves or a configurable max iteration count is reached. Configure in the Pipeline tab under Persistent Deterrence.
Control not just what is said, but how it's said. 10 built-in response modes + custom mode.
| Mode | Style | Example |
|---|---|---|
| police_dispatch | Full radio dispatch with 10-codes, officer response, radio effects | "All units, 10-97 at 482 Elm. One suspect, dark hoodie..." |
| live_operator | Simulates a real person watching live | "I can see you moving to the left." |
| private_security | Corporate security firm, firm and professional | "You are on private property and under surveillance." |
| recorded_evidence | Cold, system-driven, forensic | "Subject recorded. Entry logged. Authorities notified." |
| homeowner | Personal, calm, direct | "Hey. I can see you. You need to leave." |
| automated_surveillance | Neutral AI monitoring voice | "Movement detected. Behavior flagged." |
| standard | Clear, authoritative default | "Attention. You are being recorded. Leave immediately." |
| Mode | Use Case |
|---|---|
| guard_dog | Imply dog threat. "They haven't been fed yet." |
| neighborhood_watch | Community pressure. "Neighbors have been alerted." |
| silent_pressure | Delayed, tension-building response |
Write your own persona via response_mode.custom_prompt in config. Full control over tone, vocabulary, and escalation style.
The flagship feature. Simulates a complete police radio transmission with authentic radio effects, 10-codes, and an officer response.
Full sequence:
- Channel Intro — Clean voice: "Connecting to County Sheriff dispatch frequency..." + radio tuning static + tail end of another call
- Main Dispatch — [beep] Female dispatcher (radio-processed): "All units, 10-97 at 482 Elm Street. One suspect on property. Subject wearing dark hoodie, estimated six feet tall."
- Officer Response — [beep] Male officer (different voice, radio-processed): "Copy dispatch. Unit 7 en route. ETA two minutes."
All customizable — address, agency, callsign, officer voice, radio intensity, channel intro toggle. ElevenLabs voices sound the most realistic for dispatch — the voice quality makes a big difference when you're simulating a real radio call.
| Preset | Sound | Use Case |
|---|---|---|
| low | Natural, conversational | Casual radio chatter |
| medium | Standard police radio (default) | Realistic dispatch |
| high | Gritty scanner sound | Maximum intimidation |
Fine-grained control: bandpass frequency, compression, noise level, squelch toggle.
Group cameras by physical area so that one detection triggers one speaker and all cameras in the zone share a single cooldown timer.
Useful when multiple cameras cover the same area (e.g. two angles on the front gate) -- without zones, each camera fires independently and the intruder hears duplicate audio.
zones:
front_yard:
cameras: [frontdoor, driveway]
speaker: frontdoor # audio always plays on this camera
cooldown_seconds: 90 # optional per-zone cooldown override
back_yard:
cameras: [backgate, patio]
speaker: backgate| Behavior | Detail |
|---|---|
| Shared cooldown | Keyed by zone name ("zone:front_yard"). First camera to fire blocks the rest. |
| Speaker routing | Audio pushed to the zone speaker stream, not the triggering camera stream. |
| Zone cooldown | Overrides global cooldown when set; falls back to global when omitted. |
| Per-camera override | A camera not in any zone still supports audio_output for individual speaker routing. |
Configure zones in the Camera Zones tab of the Config editor.
7 providers with automatic fallback chain. Primary fails? Secondary kicks in seamlessly.
| Provider | Latency | Cost/Detection | Video Support | Local/Cloud |
|---|---|---|---|---|
| Google Gemini Flash | 2-5s | ~$0.001 | Yes (native) | Cloud |
| OpenAI GPT-4o | 3-5s | ~$0.005-0.012 | Snapshots | Cloud |
| Anthropic Claude Haiku | 3-5s | ~$0.003 | Snapshots | Cloud |
| xAI Grok | 3-5s | ~$0.005 | Snapshots | Cloud |
| Ollama (LLaVA) | 5-15s | Free | Snapshots | Local |
| Custom OpenAI-compatible | Varies | Varies | Varies | Either |
| Fallback error handling | <1s | Free | N/A | Local |
Nightvision-aware: Automatically adapts prompts for IR footage — focuses on silhouette, build, and clothing type instead of unreliable colors.
7 text-to-speech engines with automatic fallback chain.
| Provider | Quality | Latency | Cost | Local/Cloud |
|---|---|---|---|---|
| Kokoro-82M | Near-human | 1-3s | Free | Local |
| Piper | Natural | <1s | Free | Local (bundled) |
| ElevenLabs | Highest | 1-3s | $5-99/mo | Cloud |
| Cartesia Sonic | Excellent | 0.5-1s | Paid | Cloud |
| Amazon Polly | Good | 1-3s | $0.02/1k chars | Cloud |
| OpenAI TTS | Good | 1-3s | $0.015/1k chars | Cloud |
| espeak-ng | Robotic | <1s | Free | Local (always available) |
Natural cadence speech: AI responses are broken into phrases with human-like pauses between thoughts, not read as a single flat script. Punctuation-aware timing and optional per-phrase speed variation across all providers.
Two-way MQTT integration — no custom components needed.
| Direction | Topic | Purpose |
|---|---|---|
| VoxWatch → HA | voxwatch/events/detection |
Person detected — trigger lights, notifications |
| VoxWatch → HA | voxwatch/events/stage |
Stage fired — escalating automations |
| VoxWatch → HA | voxwatch/events/ended |
Detection over — restore normal state |
| VoxWatch → HA | voxwatch/status |
Online/offline (LWT) — availability sensor |
| HA → VoxWatch | voxwatch/announce |
Play TTS on camera speakers on demand |
Use VoxWatch as a general-purpose announcement system for any camera with a speaker:
automation:
- alias: "Doorbell announcement"
trigger:
- platform: state
entity_id: binary_sensor.doorbell
to: "on"
action:
- service: mqtt.publish
data:
topic: "voxwatch/announce"
payload: '{"camera": "driveway", "message": "Someone is at the front door.", "tone": "short"}'Supports: camera, message, voice, provider, speed, tone. Also available via REST at POST /api/audio/announce.
Full docs with automation examples: docs/HOME_ASSISTANT.md
VoxWatch pushes audio through camera backchannels via go2rtc. One-way outbound only — no recording.
| Camera | Codec | Speaker | Status |
|---|---|---|---|
| Reolink CX410 | PCMU/8000 | Built-in | Working |
| Reolink CX420 | PCMU/8000 | Built-in | Working |
| Reolink E1 Zoom | PCMU/8000 | Built-in | Working |
| Dahua IPC-Color4K-T180 | PCMA/8000 | Built-in | Working |
| Dahua IPC-T54IR | PCMA/8000 | RCA out | Compatible |
| Dahua IPC-B54IR | PCMA/8000 | RCA out | Compatible |
Per-camera codec override supported. The setup wizard auto-detects backchannel codec.
Latency: Stage 1 in 0-2s (pre-cached), Stage 2 in 5-8s (AI hidden behind Stage 1).
Full-featured React + TypeScript + Tailwind dashboard at http://your-host:33344.
- Setup Wizard — 5-step guided flow: discover cameras, detect codecs, test audio, configure, save
- Camera Management — Backchannel status, last detection timestamps, ONVIF identification
- Configuration Editor — Form-based with dropdowns, connection testing, in-browser voice preview
- Audio Test Player — Push test audio to any camera speaker (rate-limited, mobile-friendly)
- System Status — Real-time connectivity to Frigate, go2rtc, MQTT, AI providers
- Dark Mode — Full dark theme support
- Hot-Reload — Config changes apply in ~10 seconds without restart
Single config.yaml with environment variable substitution (${GEMINI_API_KEY}).
frigate:
host: "localhost"
mqtt_host: "localhost"
go2rtc:
host: "localhost"
cameras:
frontdoor:
enabled: true
conditions:
min_score: 0.7
cooldown_seconds: 60
active_hours:
mode: "sunset_sunrise" # or "fixed" or "always"
ai:
primary:
provider: "gemini"
model: "gemini-2.5-flash"
api_key: "${GEMINI_API_KEY}"
tts:
provider: "kokoro"
fallback_chain: ["piper", "espeak"]
response_mode:
name: "police_dispatch"
dispatch:
address: "123 Main Street"
agency: "County Sheriff"
mqtt_publish:
enabled: true
topic_prefix: "voxwatch"
announce_enabled: trueActive hours: Always, sunset-to-sunrise (solar calculation via astral), or fixed time window with automatic midnight crossing.
Hot-reload: Service polls config every 10 seconds. Changes apply without restart — in-flight detections continue on old config.
# docker-compose.yml (simplified)
services:
voxwatch:
image: voxwatch:latest
network_mode: host
volumes: [./config:/config, ./data:/data]
mem_limit: 512m
restart: unless-stopped
voxwatch-dashboard:
image: voxwatch-dashboard:latest
network_mode: host # Dashboard on port 33344
volumes: [./config:/config, ./data:/data:ro]
mem_limit: 256m
restart: unless-stopped- Docker image: 911MB (optimized from 1769MB — 49% reduction)
- Network: Host mode for direct camera/MQTT/go2rtc access
- Dashboard is optional after setup — stop it to save resources, deterrent keeps running
- Data directory:
status.json(real-time, 5s interval),events.jsonl(detection log),voxwatch.log
Core Service — Python 3.11, ~24k LOC
- MQTT listener for Frigate events + announce topic
- Three-stage async detection pipeline with concurrent warmup
- 7 TTS providers with automatic fallback chain
- Natural cadence speech system (phrase-level pauses + speed variation)
- Full radio dispatch audio composition (multi-segment, multi-voice, radio effects)
- Audio codec conversion via ffmpeg + go2rtc backchannel push
- MQTT event publishing for Home Assistant
- 10-second config hot-reload with environment variable substitution
Dashboard — React 18 + TypeScript + FastAPI, ~21k LOC
- Interactive setup wizard with camera auto-discovery
- Form-based config editor with live voice preview
- Camera ONVIF identification cross-referenced against compatibility database
- REST API with Bearer token auth, rate limiting, SSRF protection
Security: Camera name validation (strict allowlist pattern), API key authentication, per-camera rate limiting on audio push, input sanitization on all TTS inputs.
| Mode | Config | Behavior |
|---|---|---|
| Always | mode: "always" |
24/7 active |
| Sunset to Sunrise | mode: "sunset_sunrise" |
Solar calculation via astral library |
| Fixed Window | mode: "fixed" |
Custom start/end times (handles midnight crossing) |
Per-camera schedules: Each camera can override the global active hours with its own schedule. Set a camera's schedule to always, a fixed time window, or sunset-to-sunrise -- independently of every other camera. Cameras with no per-camera schedule fall back to the global setting. Configure per-camera schedules in the Detection tab of the Config editor, or in the camera detail panel.
You can also specify a city name (e.g. city: "Seattle") instead of explicit latitude/longitude for sunset/sunrise calculations. VoxWatch uses the astral library's built-in geocoder database for city name resolution.
VoxWatch broadcasts one-way audio deterrents — no recording from the intruder.
- Two-party recording consent laws generally do NOT apply (one-directional broadcast)
- Property owners have the right to deter trespassers with reasonable measures
- Signage (e.g., "Audio Deterrent Active") strengthens your legal position
Consult a licensed attorney in your jurisdiction before deployment. See docs/LEGAL.md for guidance covering US, UK, EU (GDPR), and Australia.
We welcome contributions — especially camera compatibility reports, bug fixes, and performance improvements.
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
cd dashboard/frontend && npm install && npm run dev
# In another terminal:
cd dashboard/backend && uvicorn main:app --reloadSee CONTRIBUTING.md for code style and PR guidelines.
Recently shipped: Camera zones (shared cooldown + speaker routing), per-camera schedules, persistent deterrence loop (Stage 3), per-camera audio output override, Home Assistant MQTT, natural cadence speech
In progress: Dynamic TTS library loading, custom voice models
Planned: SMS/Telegram notifications
VoxWatch is an early alpha. Here's what to expect:
Latency — There is a delay between detection and audio playback (typically 5-15 seconds for Stage 1, 30-60 seconds for the full dispatch sequence). This is inherent to the pipeline: Frigate detection + snapshot capture + AI analysis + TTS generation + audio push. We're actively working on reducing this. Local TTS providers (Kokoro, Piper) are significantly faster than cloud providers.
False Positives — Frigate may detect "persons" that aren't there (shadows, animals, reflections). VoxWatch uses AI validation to skip escalation when the AI can't identify anyone, but Stage 1 may still fire. Tune your Frigate min_score threshold and zone configuration to reduce false positives.
Camera Compatibility — Audio backchannel (pushing audio to the camera speaker) requires cameras that support two-way audio via ONVIF or RTSP backchannel. Tested primarily with Reolink cameras. See Supported Cameras.
Single Event at a Time — Each camera processes one detection at a time. If the same camera triggers again during an active event, it's queued or dropped depending on cooldown settings. Multiple cameras can fire simultaneously.
Cloud API Costs — If using cloud providers (Gemini, ElevenLabs, OpenAI), each detection event costs a small amount. A busy camera with frequent detections can add up. Use local providers (Kokoro, Piper, Ollama) for zero ongoing cost.
Rough Edges — The dashboard UI, dispatch pipeline, and voice management are functional but not polished. Config options may change between releases. The documentation is a work in progress.
If you find a bug or have a suggestion, please open an issue. Pull requests are welcome.
"What if cameras didn't just detect... but actually confronted?"
Everything here is built around that idea. If it ever becomes bloated, overcomplicated, or loses that core purpose — call it out.
Built using an AI-assisted workflow (primarily Claude) with a focus on making the codebase easy to read, fork, and extend. If you see something that could be better, I'd genuinely appreciate the feedback.
GNU General Public License v3.0 — Free for open-source and personal use. Commercial use in closed-source products requires a commercial license. Contact jason@voxwatch.dev.
If VoxWatch made your setup more powerful (or just more fun): https://buymeacoffee.com/badbread
Built with: Frigate NVR | go2rtc | Google Gemini | Kokoro TTS | Piper TTS | FastAPI | React | Tailwind CSS | Docker
Docs: Home Assistant | Architecture | Supported Cameras | Audio Research | Legal

