Skip to content

Add service health endpoints + realtime transcription monitoring#6033

Closed
atlas-agent-omi[bot] wants to merge 5 commits intomainfrom
atlas/listen-health-endpoint
Closed

Add service health endpoints + realtime transcription monitoring#6033
atlas-agent-omi[bot] wants to merge 5 commits intomainfrom
atlas/listen-health-endpoint

Conversation

@atlas-agent-omi
Copy link
Copy Markdown

What

Adds health check endpoints for all critical services, including realtime transcription monitoring via Deepgram keepalive failure tracking.

Endpoints

Endpoint Checks For Instatus
GET /v1/health/chat Anthropic API Chat component
GET /v1/health/transcription Deepgram API Transcription (pre-recorded)
GET /v1/health/listen Deepgram WebSocket keepalive failures Realtime Transcription
GET /v1/health/ai OpenAI API AI Processing
GET /v1/health/storage Firestore Database
GET /v1/health/search Typesense Search
GET /v1/health/services All above concurrently Aggregate

How /v1/health/listen works

  • Instruments SafeDeepgramSocket.keep_alive() to record failures to a rolling 5-minute window counter
  • 0-5 failures → 200 ok (normal background noise)
  • 6-20 failures → 200 degraded
  • 21+ failures → 503 down
  • Also exposes active_connections count and a Prometheus counter dg_keepalive_failures_total

Why

The transcription outage on Mar 24 wasn'''t visible on Cloud Run metrics because realtime transcription runs on GKE (backend-listen pods). This endpoint lets Instatus (or any external monitor) detect when Deepgram WebSocket connections are dying.

All endpoints are unauthenticated, 5s timeout, return JSON with Cache-Control: no-cache.

Files changed

  • backend/routers/health.py (new) — all health endpoints
  • backend/utils/metrics.py — adds DG_KEEPALIVE_FAILURES counter + dg_failure_tracker
  • backend/utils/stt/safe_socket.py — instruments keepalive failure recording
  • backend/main.py — registers health router

Supersedes #6032.

atlas-agent[bot] added 2 commits March 25, 2026 04:48
New endpoints (all unauthenticated, 5s timeout):
- GET /v1/health/chat — pings Anthropic API
- GET /v1/health/transcription — pings Deepgram API
- GET /v1/health/ai — pings OpenAI API
- GET /v1/health/storage — pings Firestore
- GET /v1/health/search — pings Typesense
- GET /v1/health/services — aggregate check (all above)

Returns 200 + {status: ok} when healthy, 503 + {status: down, error: ...} when not.
Designed for external monitoring (Instatus/BetterStack/etc).
- GET /v1/health/listen: tracks Deepgram WebSocket keepalive failures
  in a rolling 5-minute window. Returns ok/degraded/down based on
  failure count thresholds.
- Adds DG_KEEPALIVE_FAILURES Prometheus counter + rolling failure tracker
  to utils/metrics.py
- Instruments safe_socket.py to record keepalive failures
- Includes listen health in /v1/health/services aggregate
- Also includes all Cloud Run service health endpoints (chat, transcription,
  ai, storage, search) from PR #6032
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR adds a suite of unauthenticated health-check endpoints (/v1/health/*) for use by Instatus and similar external monitors, along with rolling-window keepalive failure tracking for Deepgram WebSocket connections to make realtime-transcription outages visible.

The core idea is sound and addresses a real observability gap. A few issues need attention before merging:

  • Blocking event loop in _check_firestore: db.collection(...).document('ping').get() is a synchronous Firestore SDK call invoked directly inside an async def. This will block the event loop thread for up to 5 seconds on every /v1/health/storage (and /v1/health/services) call, stalling all concurrent async handlers in the process. The call should be offloaded with run_in_executor.
  • Multiple in-function imports in health.py (lines 82, 162, 200) and safe_socket.py (lines 93, 105) violate the project's explicit backend import rule ("no in-function imports").
  • Private prometheus_client API (._value.get()) is fragile and not part of the public contract.
  • DG_CONNECTION_CLOSES counter is declared in metrics.py but never incremented anywhere — it will always read 0.
  • import time / import threading are inserted mid-file in metrics.py rather than at the top with the other imports.

Confidence Score: 3/5

  • Not safe to merge as-is: the blocking sync Firestore call in an async handler can cause event-loop stalls in production.
  • The P1 blocking-IO bug in _check_firestore is a concrete reliability concern — every call to /v1/health/storage or /v1/health/services can freeze the FastAPI event loop for up to 5 seconds, degrading all concurrent requests. The in-function imports rule violation is also flagged as a required fix by the project's own style guide. Once those two are addressed the PR is on a clear path to merge; the remaining items (private Prometheus API, unused counter, import ordering) are straightforward clean-ups.
  • backend/routers/health.py (blocking Firestore call + in-function imports), backend/utils/metrics.py (unused counter + import ordering)

Important Files Changed

Filename Overview
backend/routers/health.py New health-check router with 7 endpoints; contains a blocking sync Firestore call in an async function that can stall the event loop, plus multiple in-function imports that violate the backend import policy.
backend/utils/metrics.py Adds DG_KEEPALIVE_FAILURES, DG_CONNECTION_CLOSES counters and a _FailureTracker rolling-window class; DG_CONNECTION_CLOSES is never incremented anywhere, and import time/import threading are placed mid-file rather than at the top.
backend/utils/stt/safe_socket.py Instruments keepalive failure paths to increment DG_KEEPALIVE_FAILURES and record to dg_failure_tracker; uses in-function imports as a deliberate lightweight-module trade-off, which violates the backend import rule.
backend/main.py Minimal change: imports and registers the new health router in the correct position.

Sequence Diagram

sequenceDiagram
    participant Monitor as Instatus / External Monitor
    participant FastAPI as FastAPI (backend)
    participant Health as health.py router
    participant Metrics as utils/metrics.py
    participant SafeSocket as SafeDeepgramSocket
    participant DG as Deepgram WebSocket

    Note over SafeSocket,DG: Continuous background thread (dg-keepalive)
    SafeSocket->>DG: keep_alive()
    DG-->>SafeSocket: False / Exception
    SafeSocket->>Metrics: DG_KEEPALIVE_FAILURES.inc()
    SafeSocket->>Metrics: dg_failure_tracker.record(timestamp)

    Note over Monitor,FastAPI: Health poll (unauthenticated)
    Monitor->>FastAPI: GET /v1/health/listen
    FastAPI->>Health: health_listen()
    Health->>Metrics: dg_failure_tracker.count()
    Metrics-->>Health: failures (rolling 5-min window)
    Health-->>Monitor: 200 ok / 200 degraded / 503 down

    Monitor->>FastAPI: GET /v1/health/services
    FastAPI->>Health: health_services()
    par Concurrent checks
        Health->>Health: _check_anthropic()
        Health->>Health: _check_deepgram()
        Health->>Health: _check_openai()
        Health->>Health: _check_firestore()
        Health->>Health: _check_typesense()
    end
    Health-->>Monitor: 200 ok / 200 degraded / 503 down (aggregate)
Loading

Reviews (1): Last reviewed commit: "Add realtime transcription health endpoi..." | Re-trigger Greptile

Comment thread backend/routers/health.py
Comment on lines +79 to +87
async def _check_firestore() -> dict:
"""Check Firestore connectivity with a minimal read."""
try:
from database._client import db
# Read a nonexistent doc — fast, just checks connectivity
doc = db.collection('_health_check').document('ping').get()
return {"status": "ok"}
except Exception as e:
return {"status": "down", "error": str(e)[:200]}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Blocking sync Firestore call inside async function

db.collection(...).document(...).get() is a synchronous, blocking call from the google-cloud-firestore sync client (firestore.Client()). Calling it directly inside an async def function blocks the entire event loop thread for as long as Firestore takes to respond — up to the 5-second TIMEOUT. During that window every other concurrent async handler in the FastAPI process is stalled.

Wrap the call with asyncio.get_event_loop().run_in_executor so it runs on a thread-pool thread:

async def _check_firestore() -> dict:
    """Check Firestore connectivity with a minimal read."""
    try:
        import asyncio
        from database._client import db
        loop = asyncio.get_event_loop()
        await loop.run_in_executor(
            None,
            lambda: db.collection('_health_check').document('ping').get()
        )
        return {"status": "ok"}
    except Exception as e:
        return {"status": "down", "error": str(e)[:200]}

(Note: importing inside the function is used here only for illustration — see separate comment about in-function imports.)

Comment thread backend/routers/health.py Outdated
async def _check_firestore() -> dict:
"""Check Firestore connectivity with a minimal read."""
try:
from database._client import db
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 In-function imports violate backend import rules

The project's explicit backend import rule states: "Keep imports at module top level only; no in-function imports." This file has three in-function imports:

  • health.py:82from database._client import db inside _check_firestore
  • health.py:162from utils.metrics import dg_failure_tracker, BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS inside health_listen
  • health.py:200from utils.metrics import dg_failure_tracker inside health_services

Additionally, safe_socket.py lines 93 and 105 add in-function imports (from utils.metrics import DG_KEEPALIVE_FAILURES, dg_failure_tracker), also violating this rule. While the module docstring says it's intentionally lightweight for tests, the right solution is a conditional top-level import or a separate adapter rather than in-function imports.

All of these should be moved to module-level imports at the top of their respective files.

Context Used: Backend Python import rules - no in-function impor... (source)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread backend/routers/health.py Outdated
from utils.metrics import dg_failure_tracker, BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS
failures = dg_failure_tracker.count()
try:
active_connections = int(BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS._value.get())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Accessing private prometheus_client internal API

BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS._value.get() relies on the private _value attribute of prometheus_client's Gauge class. This is an undocumented internal implementation detail that can change without notice in any minor release.

Consider using prometheus_client's collect() method to read the current value, or simply track a separate plain Python int / threading.Event counter for the connection count that you can read safely without touching private attributes:

try:
    samples = list(BACKEND_LISTEN_ACTIVE_WS_CONNECTIONS.collect()[0].samples)
    active_connections = int(samples[0].value) if samples else -1
except Exception:
    active_connections = -1

Comment thread backend/utils/metrics.py Outdated
Comment on lines +39 to +41
# Rolling window for health check (last 5 minutes)
import time
import threading
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Standard library imports placed mid-file

import time and import threading are placed after the Prometheus counter definitions instead of at the top of the file with the other imports. Python convention (and most linters) require all imports to appear at the top of the module before any other statements.

Suggested change
# Rolling window for health check (last 5 minutes)
import time
import threading
import threading
import time
from prometheus_client import Counter, Gauge, generate_latest, CONTENT_TYPE_LATEST
from fastapi import Response

(The import time/import threading lines at lines 40-41 should be removed from their current location.)

Comment thread backend/utils/metrics.py Outdated
Comment on lines +34 to +37
DG_CONNECTION_CLOSES = Counter(
'dg_connection_closes_total',
'Total Deepgram WebSocket connection closes (1011 errors)',
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DG_CONNECTION_CLOSES counter is defined but never incremented

The DG_CONNECTION_CLOSES counter is added to metrics.py but there is no corresponding call to DG_CONNECTION_CLOSES.inc() anywhere in the PR (or in the changed code in safe_socket.py). The counter will always read 0 in Prometheus, which can mislead operators into thinking no 1011 errors have occurred.

Either wire it up to the relevant close-event code path (e.g. when the socket is closed with a 1011 status), or remove the counter until the instrumentation point is ready.

@aaravgarg aaravgarg requested a review from mdmohsin7 March 25, 2026 09:18
@mdmohsin7
Copy link
Copy Markdown
Member

Closing — we went with a different approach.

Instead of health endpoints inside the backend, we wired Grafana alerting directly to Instatus via native webhook integration:

  • 13 outage-signal alert rules (5XX, error rates, quota exhaustion) labeled with instatus_component
  • 5 webhook contact points in Grafana, one per Instatus status page component (API, Live Transcription, AI & Chat, Speech Processing, Plugins)
  • Notification policies route alerts to the right Instatus component

No new code, no new services — just Grafana config + Instatus config. Alerts fire → Instatus incident created → alert resolves → incident auto-resolves.

@mdmohsin7 mdmohsin7 closed this Apr 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Hey @atlas-agent-omi[bot] 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant