Skip to content

badbread/VoxWatch

Repository files navigation

VoxWatch

CI License: GPL v3

Early Alpha Release — VoxWatch is under active development and being released early for community feedback. Expect bugs, breaking changes, and rough edges. Issues and pull requests are very welcome. See Known Limitations below.

AI-powered security deterrent that makes your cameras talk back.

VoxWatch turns passive security cameras into active deterrents. When Frigate detects a person, VoxWatch instantly warns them over the camera speaker, then escalates with AI-generated descriptions of their appearance and behavior — all in real-time.

"All units, 10-97 at 482 Elm Street. One suspect, dark hoodie, approaching the front gate." [radio pause] "Copy dispatch. Unit 7 en route."

That's what an intruder hears. Not a beep. Not silence. A specific, real-time callout that makes it obvious someone is watching.


Quick Start

Grab the docker-compose.yml and deploy it however you normally run containers — Portainer, docker compose up -d, Dockge, whatever. Images are on GHCR so no building required.

Open http://your-host:33344 — the setup wizard auto-discovers Frigate, MQTT, and your cameras, then walks you through AI provider, TTS engine, and response mode selection. No config files to edit.

Prerequisites: Frigate NVR with MQTT + go2rtc + a camera with two-way audio.


VoxWatch Dashboard

Detection Pipeline


How It Works

Frigate NVR        MQTT         VoxWatch Service         MQTT        Home Assistant
   ┌──────┐        Event    ┌──────────────────┐       Events    ┌─────────────────┐
   │Detect│ ──────────────> │ 1. Initial       │ ──────────────> │ Lights, Locks,  │
   │Person│                 │    Response      │                 │ Notifications,  │
   └──────┘                 │                  │                 │ Automations     │
                            │ 2. Escalation    │                 └────────┬────────┘
                            │    (AI analysis) │  voxwatch/announce       │
                            │                  │ <───────────────────────┘
                            │ 3. Persistent    │  (TTS on camera speakers)
                            │    Deterrence    │
                            │                  │
                            │ 4. Resolution    │
                            └──────┬───────────┘
                                   │
                    ┌──────────────┴──────────────┐
                    v                             v
              Audio Pipeline              go2rtc Audio Push
              (TTS + Effects)             (RTSP Backchannel)
                    │                             │
                    └──────────────┬──────────────┘
                                   v
                            Camera Speaker

Four-Stage Escalating Deterrent

Stage Timing What Happens
1. Initial Response 0-2 seconds Pre-cached warning plays immediately. AI analysis starts in parallel.
2. Escalation 5-15 seconds AI analyzes snapshots. Describes appearance, clothing, carried items, actions. The intruder hears themselves described in real-time.
3. Persistent Deterrence 30s+ loops If person stays, VoxWatch keeps warning them with fresh AI descriptions every N seconds. Tone escalates with each iteration. Configurable max iterations.
4. Resolution After person leaves Optional "all clear" message. Disabled by default.

Each stage only fires if the person is still detected (Frigate re-check). AI adapts automatically for nightvision — no color descriptions from IR footage.

Home Assistant integration: Every stage publishes an MQTT event (voxwatch/events/stage) with the stage number, camera name, and AI description. Build HA automations that escalate with each stage — lights on at Stage 1, phone notification at Stage 2, sirens and door locks at Stage 3. See Home Assistant Integration for examples.

Persistent Deterrence (optional): When enabled, Stage 3 loops -- each iteration waits a configurable delay, re-checks whether the person is still present, generates a fresh AI description, and escalates the tone. The loop ends when the person leaves or a configurable max iteration count is reached. Configure in the Pipeline tab under Persistent Deterrence.


Response Modes

Control not just what is said, but how it's said. 10 built-in response modes + custom mode.

Core Modes

Mode Style Example
police_dispatch Full radio dispatch with 10-codes, officer response, radio effects "All units, 10-97 at 482 Elm. One suspect, dark hoodie..."
live_operator Simulates a real person watching live "I can see you moving to the left."
private_security Corporate security firm, firm and professional "You are on private property and under surveillance."
recorded_evidence Cold, system-driven, forensic "Subject recorded. Entry logged. Authorities notified."
homeowner Personal, calm, direct "Hey. I can see you. You need to leave."
automated_surveillance Neutral AI monitoring voice "Movement detected. Behavior flagged."
standard Clear, authoritative default "Attention. You are being recorded. Leave immediately."

Advanced Modes

Mode Use Case
guard_dog Imply dog threat. "They haven't been fed yet."
neighborhood_watch Community pressure. "Neighbors have been alerted."
silent_pressure Delayed, tension-building response

Custom Mode

Write your own persona via response_mode.custom_prompt in config. Full control over tone, vocabulary, and escalation style.


Police Dispatch: The Crown Jewel

The flagship feature. Simulates a complete police radio transmission with authentic radio effects, 10-codes, and an officer response.

Full sequence:

  1. Channel Intro — Clean voice: "Connecting to County Sheriff dispatch frequency..." + radio tuning static + tail end of another call
  2. Main Dispatch — [beep] Female dispatcher (radio-processed): "All units, 10-97 at 482 Elm Street. One suspect on property. Subject wearing dark hoodie, estimated six feet tall."
  3. Officer Response — [beep] Male officer (different voice, radio-processed): "Copy dispatch. Unit 7 en route. ETA two minutes."

All customizable — address, agency, callsign, officer voice, radio intensity, channel intro toggle. ElevenLabs voices sound the most realistic for dispatch — the voice quality makes a big difference when you're simulating a real radio call.

Radio Effect Presets

Preset Sound Use Case
low Natural, conversational Casual radio chatter
medium Standard police radio (default) Realistic dispatch
high Gritty scanner sound Maximum intimidation

Fine-grained control: bandpass frequency, compression, noise level, squelch toggle.


Camera Zones

Group cameras by physical area so that one detection triggers one speaker and all cameras in the zone share a single cooldown timer.

Useful when multiple cameras cover the same area (e.g. two angles on the front gate) -- without zones, each camera fires independently and the intruder hears duplicate audio.

zones:
  front_yard:
    cameras: [frontdoor, driveway]
    speaker: frontdoor        # audio always plays on this camera
    cooldown_seconds: 90      # optional per-zone cooldown override

  back_yard:
    cameras: [backgate, patio]
    speaker: backgate
Behavior Detail
Shared cooldown Keyed by zone name ("zone:front_yard"). First camera to fire blocks the rest.
Speaker routing Audio pushed to the zone speaker stream, not the triggering camera stream.
Zone cooldown Overrides global cooldown when set; falls back to global when omitted.
Per-camera override A camera not in any zone still supports audio_output for individual speaker routing.

Configure zones in the Camera Zones tab of the Config editor.


AI Vision Providers

7 providers with automatic fallback chain. Primary fails? Secondary kicks in seamlessly.

Provider Latency Cost/Detection Video Support Local/Cloud
Google Gemini Flash 2-5s ~$0.001 Yes (native) Cloud
OpenAI GPT-4o 3-5s ~$0.005-0.012 Snapshots Cloud
Anthropic Claude Haiku 3-5s ~$0.003 Snapshots Cloud
xAI Grok 3-5s ~$0.005 Snapshots Cloud
Ollama (LLaVA) 5-15s Free Snapshots Local
Custom OpenAI-compatible Varies Varies Varies Either
Fallback error handling <1s Free N/A Local

Nightvision-aware: Automatically adapts prompts for IR footage — focuses on silhouette, build, and clothing type instead of unreliable colors.


TTS Providers

7 text-to-speech engines with automatic fallback chain.

Provider Quality Latency Cost Local/Cloud
Kokoro-82M Near-human 1-3s Free Local
Piper Natural <1s Free Local (bundled)
ElevenLabs Highest 1-3s $5-99/mo Cloud
Cartesia Sonic Excellent 0.5-1s Paid Cloud
Amazon Polly Good 1-3s $0.02/1k chars Cloud
OpenAI TTS Good 1-3s $0.015/1k chars Cloud
espeak-ng Robotic <1s Free Local (always available)

Natural cadence speech: AI responses are broken into phrases with human-like pauses between thoughts, not read as a single flat script. Punctuation-aware timing and optional per-phrase speed variation across all providers.


Home Assistant Integration

Two-way MQTT integration — no custom components needed.

Direction Topic Purpose
VoxWatch → HA voxwatch/events/detection Person detected — trigger lights, notifications
VoxWatch → HA voxwatch/events/stage Stage fired — escalating automations
VoxWatch → HA voxwatch/events/ended Detection over — restore normal state
VoxWatch → HA voxwatch/status Online/offline (LWT) — availability sensor
HA → VoxWatch voxwatch/announce Play TTS on camera speakers on demand

TTS Announcements from HA

Use VoxWatch as a general-purpose announcement system for any camera with a speaker:

automation:
  - alias: "Doorbell announcement"
    trigger:
      - platform: state
        entity_id: binary_sensor.doorbell
        to: "on"
    action:
      - service: mqtt.publish
        data:
          topic: "voxwatch/announce"
          payload: '{"camera": "driveway", "message": "Someone is at the front door.", "tone": "short"}'

Supports: camera, message, voice, provider, speed, tone. Also available via REST at POST /api/audio/announce.

Full docs with automation examples: docs/HOME_ASSISTANT.md


Camera Compatibility

VoxWatch pushes audio through camera backchannels via go2rtc. One-way outbound only — no recording.

Camera Codec Speaker Status
Reolink CX410 PCMU/8000 Built-in Working
Reolink CX420 PCMU/8000 Built-in Working
Reolink E1 Zoom PCMU/8000 Built-in Working
Dahua IPC-Color4K-T180 PCMA/8000 Built-in Working
Dahua IPC-T54IR PCMA/8000 RCA out Compatible
Dahua IPC-B54IR PCMA/8000 RCA out Compatible

Per-camera codec override supported. The setup wizard auto-detects backchannel codec.

Latency: Stage 1 in 0-2s (pre-cached), Stage 2 in 5-8s (AI hidden behind Stage 1).


Web Dashboard

Full-featured React + TypeScript + Tailwind dashboard at http://your-host:33344.

  • Setup Wizard — 5-step guided flow: discover cameras, detect codecs, test audio, configure, save
  • Camera Management — Backchannel status, last detection timestamps, ONVIF identification
  • Configuration Editor — Form-based with dropdowns, connection testing, in-browser voice preview
  • Audio Test Player — Push test audio to any camera speaker (rate-limited, mobile-friendly)
  • System Status — Real-time connectivity to Frigate, go2rtc, MQTT, AI providers
  • Dark Mode — Full dark theme support
  • Hot-Reload — Config changes apply in ~10 seconds without restart

Configuration

Single config.yaml with environment variable substitution (${GEMINI_API_KEY}).

frigate:
  host: "localhost"
  mqtt_host: "localhost"

go2rtc:
  host: "localhost"

cameras:
  frontdoor:
    enabled: true

conditions:
  min_score: 0.7
  cooldown_seconds: 60
  active_hours:
    mode: "sunset_sunrise"    # or "fixed" or "always"

ai:
  primary:
    provider: "gemini"
    model: "gemini-2.5-flash"
    api_key: "${GEMINI_API_KEY}"

tts:
  provider: "kokoro"
  fallback_chain: ["piper", "espeak"]

response_mode:
  name: "police_dispatch"
  dispatch:
    address: "123 Main Street"
    agency: "County Sheriff"

mqtt_publish:
  enabled: true
  topic_prefix: "voxwatch"
  announce_enabled: true

Active hours: Always, sunset-to-sunrise (solar calculation via astral), or fixed time window with automatic midnight crossing.

Hot-reload: Service polls config every 10 seconds. Changes apply without restart — in-flight detections continue on old config.


Deployment

# docker-compose.yml (simplified)
services:
  voxwatch:
    image: voxwatch:latest
    network_mode: host
    volumes: [./config:/config, ./data:/data]
    mem_limit: 512m
    restart: unless-stopped

  voxwatch-dashboard:
    image: voxwatch-dashboard:latest
    network_mode: host      # Dashboard on port 33344
    volumes: [./config:/config, ./data:/data:ro]
    mem_limit: 256m
    restart: unless-stopped
  • Docker image: 911MB (optimized from 1769MB — 49% reduction)
  • Network: Host mode for direct camera/MQTT/go2rtc access
  • Dashboard is optional after setup — stop it to save resources, deterrent keeps running
  • Data directory: status.json (real-time, 5s interval), events.jsonl (detection log), voxwatch.log

Architecture

Core Service — Python 3.11, ~24k LOC

  • MQTT listener for Frigate events + announce topic
  • Three-stage async detection pipeline with concurrent warmup
  • 7 TTS providers with automatic fallback chain
  • Natural cadence speech system (phrase-level pauses + speed variation)
  • Full radio dispatch audio composition (multi-segment, multi-voice, radio effects)
  • Audio codec conversion via ffmpeg + go2rtc backchannel push
  • MQTT event publishing for Home Assistant
  • 10-second config hot-reload with environment variable substitution

Dashboard — React 18 + TypeScript + FastAPI, ~21k LOC

  • Interactive setup wizard with camera auto-discovery
  • Form-based config editor with live voice preview
  • Camera ONVIF identification cross-referenced against compatibility database
  • REST API with Bearer token auth, rate limiting, SSRF protection

Security: Camera name validation (strict allowlist pattern), API key authentication, per-camera rate limiting on audio push, input sanitization on all TTS inputs.


Active Hours & Scheduling

Mode Config Behavior
Always mode: "always" 24/7 active
Sunset to Sunrise mode: "sunset_sunrise" Solar calculation via astral library
Fixed Window mode: "fixed" Custom start/end times (handles midnight crossing)

Per-camera schedules: Each camera can override the global active hours with its own schedule. Set a camera's schedule to always, a fixed time window, or sunset-to-sunrise -- independently of every other camera. Cameras with no per-camera schedule fall back to the global setting. Configure per-camera schedules in the Detection tab of the Config editor, or in the camera detail panel.

You can also specify a city name (e.g. city: "Seattle") instead of explicit latitude/longitude for sunset/sunrise calculations. VoxWatch uses the astral library's built-in geocoder database for city name resolution.


Legal Considerations

VoxWatch broadcasts one-way audio deterrents — no recording from the intruder.

  • Two-party recording consent laws generally do NOT apply (one-directional broadcast)
  • Property owners have the right to deter trespassers with reasonable measures
  • Signage (e.g., "Audio Deterrent Active") strengthens your legal position

Consult a licensed attorney in your jurisdiction before deployment. See docs/LEGAL.md for guidance covering US, UK, EU (GDPR), and Australia.


Contributing

We welcome contributions — especially camera compatibility reports, bug fixes, and performance improvements.

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

cd dashboard/frontend && npm install && npm run dev
# In another terminal:
cd dashboard/backend && uvicorn main:app --reload

See CONTRIBUTING.md for code style and PR guidelines.


Roadmap

Recently shipped: Camera zones (shared cooldown + speaker routing), per-camera schedules, persistent deterrence loop (Stage 3), per-camera audio output override, Home Assistant MQTT, natural cadence speech

In progress: Dynamic TTS library loading, custom voice models

Planned: SMS/Telegram notifications


Known Limitations

VoxWatch is an early alpha. Here's what to expect:

Latency — There is a delay between detection and audio playback (typically 5-15 seconds for Stage 1, 30-60 seconds for the full dispatch sequence). This is inherent to the pipeline: Frigate detection + snapshot capture + AI analysis + TTS generation + audio push. We're actively working on reducing this. Local TTS providers (Kokoro, Piper) are significantly faster than cloud providers.

False Positives — Frigate may detect "persons" that aren't there (shadows, animals, reflections). VoxWatch uses AI validation to skip escalation when the AI can't identify anyone, but Stage 1 may still fire. Tune your Frigate min_score threshold and zone configuration to reduce false positives.

Camera Compatibility — Audio backchannel (pushing audio to the camera speaker) requires cameras that support two-way audio via ONVIF or RTSP backchannel. Tested primarily with Reolink cameras. See Supported Cameras.

Single Event at a Time — Each camera processes one detection at a time. If the same camera triggers again during an active event, it's queued or dropped depending on cooldown settings. Multiple cameras can fire simultaneously.

Cloud API Costs — If using cloud providers (Gemini, ElevenLabs, OpenAI), each detection event costs a small amount. A busy camera with frequent detections can add up. Use local providers (Kokoro, Piper, Ollama) for zero ongoing cost.

Rough Edges — The dashboard UI, dispatch pipeline, and voice management are functional but not polished. Config options may change between releases. The documentation is a work in progress.

If you find a bug or have a suggestion, please open an issue. Pull requests are welcome.


Why This Exists

"What if cameras didn't just detect... but actually confronted?"

Everything here is built around that idea. If it ever becomes bloated, overcomplicated, or loses that core purpose — call it out.

Built using an AI-assisted workflow (primarily Claude) with a focus on making the codebase easy to read, fork, and extend. If you see something that could be better, I'd genuinely appreciate the feedback.

https://it.badbread.com


License

GNU General Public License v3.0 — Free for open-source and personal use. Commercial use in closed-source products requires a commercial license. Contact jason@voxwatch.dev.


Support VoxWatch

If VoxWatch made your setup more powerful (or just more fun): https://buymeacoffee.com/badbread


Built with: Frigate NVR | go2rtc | Google Gemini | Kokoro TTS | Piper TTS | FastAPI | React | Tailwind CSS | Docker

Docs: Home Assistant | Architecture | Supported Cameras | Audio Research | Legal

About

AI-powered security deterrent — detects people on cameras, uses AI vision to describe them, plays personalized audio warnings through camera speakers

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors