Add service health endpoints + realtime transcription monitoring by atlas-agent-omi[bot] · Pull Request #6033 · BasedHardware/omi

atlas-agent-omi · 2026-03-25T08:49:35Z

What

Adds health check endpoints for all critical services, including realtime transcription monitoring via Deepgram keepalive failure tracking.

Endpoints

Endpoint	Checks	For Instatus
`GET /v1/health/chat`	Anthropic API	Chat component
`GET /v1/health/transcription`	Deepgram API	Transcription (pre-recorded)
`GET /v1/health/listen`	Deepgram WebSocket keepalive failures	Realtime Transcription
`GET /v1/health/ai`	OpenAI API	AI Processing
`GET /v1/health/storage`	Firestore	Database
`GET /v1/health/search`	Typesense	Search
`GET /v1/health/services`	All above concurrently	Aggregate

How `/v1/health/listen` works

Instruments SafeDeepgramSocket.keep_alive() to record failures to a rolling 5-minute window counter
0-5 failures → 200 ok (normal background noise)
6-20 failures → 200 degraded
21+ failures → 503 down
Also exposes active_connections count and a Prometheus counter dg_keepalive_failures_total

Why

The transcription outage on Mar 24 wasn'''t visible on Cloud Run metrics because realtime transcription runs on GKE (backend-listen pods). This endpoint lets Instatus (or any external monitor) detect when Deepgram WebSocket connections are dying.

All endpoints are unauthenticated, 5s timeout, return JSON with Cache-Control: no-cache.

Files changed

backend/routers/health.py (new) — all health endpoints
backend/utils/metrics.py — adds DG_KEEPALIVE_FAILURES counter + dg_failure_tracker
backend/utils/stt/safe_socket.py — instruments keepalive failure recording
backend/main.py — registers health router

Supersedes #6032.

New endpoints (all unauthenticated, 5s timeout): - GET /v1/health/chat — pings Anthropic API - GET /v1/health/transcription — pings Deepgram API - GET /v1/health/ai — pings OpenAI API - GET /v1/health/storage — pings Firestore - GET /v1/health/search — pings Typesense - GET /v1/health/services — aggregate check (all above) Returns 200 + {status: ok} when healthy, 503 + {status: down, error: ...} when not. Designed for external monitoring (Instatus/BetterStack/etc).

- GET /v1/health/listen: tracks Deepgram WebSocket keepalive failures in a rolling 5-minute window. Returns ok/degraded/down based on failure count thresholds. - Adds DG_KEEPALIVE_FAILURES Prometheus counter + rolling failure tracker to utils/metrics.py - Instruments safe_socket.py to record keepalive failures - Includes listen health in /v1/health/services aggregate - Also includes all Cloud Run service health endpoints (chat, transcription, ai, storage, search) from PR #6032

greptile-apps · 2026-03-25T08:52:55Z

Greptile Summary

This PR adds a suite of unauthenticated health-check endpoints (/v1/health/*) for use by Instatus and similar external monitors, along with rolling-window keepalive failure tracking for Deepgram WebSocket connections to make realtime-transcription outages visible.

The core idea is sound and addresses a real observability gap. A few issues need attention before merging:

Blocking event loop in _check_firestore: db.collection(...).document('ping').get() is a synchronous Firestore SDK call invoked directly inside an async def. This will block the event loop thread for up to 5 seconds on every /v1/health/storage (and /v1/health/services) call, stalling all concurrent async handlers in the process. The call should be offloaded with run_in_executor.
Multiple in-function imports in health.py (lines 82, 162, 200) and safe_socket.py (lines 93, 105) violate the project's explicit backend import rule ("no in-function imports").
Private prometheus_client API (._value.get()) is fragile and not part of the public contract.
DG_CONNECTION_CLOSES counter is declared in metrics.py but never incremented anywhere — it will always read 0.
import time / import threading are inserted mid-file in metrics.py rather than at the top with the other imports.

Confidence Score: 3/5

Not safe to merge as-is: the blocking sync Firestore call in an async handler can cause event-loop stalls in production.
The P1 blocking-IO bug in _check_firestore is a concrete reliability concern — every call to /v1/health/storage or /v1/health/services can freeze the FastAPI event loop for up to 5 seconds, degrading all concurrent requests. The in-function imports rule violation is also flagged as a required fix by the project's own style guide. Once those two are addressed the PR is on a clear path to merge; the remaining items (private Prometheus API, unused counter, import ordering) are straightforward clean-ups.
backend/routers/health.py (blocking Firestore call + in-function imports), backend/utils/metrics.py (unused counter + import ordering)

Important Files Changed

Filename	Overview
backend/routers/health.py	New health-check router with 7 endpoints; contains a blocking sync Firestore call in an async function that can stall the event loop, plus multiple in-function imports that violate the backend import policy.
backend/utils/metrics.py	Adds `DG_KEEPALIVE_FAILURES`, `DG_CONNECTION_CLOSES` counters and a `_FailureTracker` rolling-window class; `DG_CONNECTION_CLOSES` is never incremented anywhere, and `import time`/`import threading` are placed mid-file rather than at the top.
backend/utils/stt/safe_socket.py	Instruments keepalive failure paths to increment `DG_KEEPALIVE_FAILURES` and record to `dg_failure_tracker`; uses in-function imports as a deliberate lightweight-module trade-off, which violates the backend import rule.
backend/main.py	Minimal change: imports and registers the new `health` router in the correct position.

Sequence Diagram

sequenceDiagram
    participant Monitor as Instatus / External Monitor
    participant FastAPI as FastAPI (backend)
    participant Health as health.py router
    participant Metrics as utils/metrics.py
    participant SafeSocket as SafeDeepgramSocket
    participant DG as Deepgram WebSocket

    Note over SafeSocket,DG: Continuous background thread (dg-keepalive)
    SafeSocket->>DG: keep_alive()
    DG-->>SafeSocket: False / Exception
    SafeSocket->>Metrics: DG_KEEPALIVE_FAILURES.inc()
    SafeSocket->>Metrics: dg_failure_tracker.record(timestamp)

    Note over Monitor,FastAPI: Health poll (unauthenticated)
    Monitor->>FastAPI: GET /v1/health/listen
    FastAPI->>Health: health_listen()
    Health->>Metrics: dg_failure_tracker.count()
    Metrics-->>Health: failures (rolling 5-min window)
    Health-->>Monitor: 200 ok / 200 degraded / 503 down

    Monitor->>FastAPI: GET /v1/health/services
    FastAPI->>Health: health_services()
    par Concurrent checks
        Health->>Health: _check_anthropic()
        Health->>Health: _check_deepgram()
        Health->>Health: _check_openai()
        Health->>Health: _check_firestore()
        Health->>Health: _check_typesense()
    end
    Health-->>Monitor: 200 ok / 200 degraded / 503 down (aggregate)

_{Reviews (1): Last reviewed commit: "Add realtime transcription health endpoi..." | Re-trigger Greptile}

greptile-apps · 2026-03-25T08:52:58Z

+async def _check_firestore() -> dict:
+    """Check Firestore connectivity with a minimal read."""
+    try:
+        from database._client import db
+        # Read a nonexistent doc — fast, just checks connectivity
+        doc = db.collection('_health_check').document('ping').get()
+        return {"status": "ok"}
+    except Exception as e:
+        return {"status": "down", "error": str(e)[:200]}


Blocking sync Firestore call inside async function

db.collection(...).document(...).get() is a synchronous, blocking call from the google-cloud-firestore sync client (firestore.Client()). Calling it directly inside an async def function blocks the entire event loop thread for as long as Firestore takes to respond — up to the 5-second TIMEOUT. During that window every other concurrent async handler in the FastAPI process is stalled.

Wrap the call with asyncio.get_event_loop().run_in_executor so it runs on a thread-pool thread:

async def _check_firestore() -> dict: """Check Firestore connectivity with a minimal read.""" try: import asyncio from database._client import db loop = asyncio.get_event_loop() await loop.run_in_executor( None, lambda: db.collection('_health_check').document('ping').get() ) return {"status": "ok"} except Exception as e: return {"status": "down", "error": str(e)[:200]}

(Note: importing inside the function is used here only for illustration — see separate comment about in-function imports.)

greptile-apps · 2026-03-25T08:52:59Z

+async def _check_firestore() -> dict:
+    """Check Firestore connectivity with a minimal read."""
+    try:
+        from database._client import db


In-function imports violate backend import rules

The project's explicit backend import rule states: "Keep imports at module top level only; no in-function imports." This file has three in-function imports:

health.py:82 — from database._client import db inside _check_firestore

health.py:162 — from utils.metrics import dg_failure_tracker, BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS inside health_listen

health.py:200 — from utils.metrics import dg_failure_tracker inside health_services

Additionally, safe_socket.py lines 93 and 105 add in-function imports (from utils.metrics import DG_KEEPALIVE_FAILURES, dg_failure_tracker), also violating this rule. While the module docstring says it's intentionally lightweight for tests, the right solution is a conditional top-level import or a separate adapter rather than in-function imports.

All of these should be moved to module-level imports at the top of their respective files.

Context Used: Backend Python import rules - no in-function impor... (source)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-03-25T08:53:00Z

+        from utils.metrics import dg_failure_tracker, BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS
+        failures = dg_failure_tracker.count()
+        try:
+            active_connections = int(BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS._value.get())


Accessing private prometheus_client internal API

BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS._value.get() relies on the private _value attribute of prometheus_client's Gauge class. This is an undocumented internal implementation detail that can change without notice in any minor release.

Consider using prometheus_client's collect() method to read the current value, or simply track a separate plain Python int / threading.Event counter for the connection count that you can read safely without touching private attributes:

try: samples = list(BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS.collect()[0].samples) active_connections = int(samples[0].value) if samples else -1 except Exception: active_connections = -1

greptile-apps · 2026-03-25T08:53:01Z

+# Rolling window for health check (last 5 minutes)
+import time
+import threading


Standard library imports placed mid-file

import time and import threading are placed after the Prometheus counter definitions instead of at the top of the file with the other imports. Python convention (and most linters) require all imports to appear at the top of the module before any other statements.

Suggested change

# Rolling window for health check (last 5 minutes)

import time

import threading

import threading

import time

from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST

from fastapi import Response

(The import time/import threading lines at lines 40-41 should be removed from their current location.)

greptile-apps · 2026-03-25T08:53:03Z

+DG_CONNECTION_CLOSES = Counter(
+    'dg_connection_closes_total',
+    'Total Deepgram WebSocket connection closes (1011 errors)',
+)


DG_CONNECTION_CLOSES counter is defined but never incremented

The DG_CONNECTION_CLOSES counter is added to metrics.py but there is no corresponding call to DG_CONNECTION_CLOSES.inc() anywhere in the PR (or in the changed code in safe_socket.py). The counter will always read 0 in Prometheus, which can mislead operators into thinking no 1011 errors have occurred.

Either wire it up to the relevant close-event code path (e.g. when the socket is closed with a 1011 status), or remove the counter until the instrumentation point is ready.

…DG_CONNECTION_CLOSES counter

…ecutor, use public Gauge API

…level import

mdmohsin7 · 2026-04-08T11:49:29Z

Closing — we went with a different approach.

Instead of health endpoints inside the backend, we wired Grafana alerting directly to Instatus via native webhook integration:

13 outage-signal alert rules (5XX, error rates, quota exhaustion) labeled with instatus_component
5 webhook contact points in Grafana, one per Instatus status page component (API, Live Transcription, AI & Chat, Speech Processing, Plugins)
Notification policies route alerts to the right Instatus component

No new code, no new services — just Grafana config + Instatus config. Alerts fire → Instatus incident created → alert resolves → incident auto-resolves.

github-actions · 2026-04-08T11:49:40Z

Hey @atlas-agent-omi[bot] 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

Project standards — Ensuring consistency across the codebase
User needs — Making sure changes align with what our users need
Code best practices — Maintaining code quality and maintainability
Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

atlas-agent[bot] added 2 commits March 25, 2026 04:48

greptile-apps bot reviewed Mar 25, 2026

View reviewed changes

aaravgarg requested a review from mdmohsin7 March 25, 2026 09:18

mdmohsin7 added 3 commits March 26, 2026 23:46

Fix imports in metrics.py: move time/threading to top, remove unused …

787028d

…DG_CONNECTION_CLOSES counter

Fix health.py: move imports to top level, wrap Firestore in run_in_ex…

c98a04e

…ecutor, use public Gauge API

Fix safe_socket.py: replace in-function imports with conditional top-…

db01afd

…level import

mdmohsin7 closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add service health endpoints + realtime transcription monitoring#6033

Add service health endpoints + realtime transcription monitoring#6033
atlas-agent-omi[bot] wants to merge 5 commits intomainfrom
atlas/listen-health-endpoint

atlas-agent-omi bot commented Mar 25, 2026

Uh oh!

greptile-apps bot commented Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

greptile-apps bot Mar 25, 2026

Uh oh!

mdmohsin7 commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atlas-agent-omi bot commented Mar 25, 2026

What

Endpoints

How /v1/health/listen works

Why

Files changed

Uh oh!

greptile-apps bot commented Mar 25, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

mdmohsin7 commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

How `/v1/health/listen` works