Skip to content

Feature/stale job monitor#5558

Open
brendankowitz wants to merge 13 commits into
mainfrom
feature/stale-job-monitor
Open

Feature/stale job monitor#5558
brendankowitz wants to merge 13 commits into
mainfrom
feature/stale-job-monitor

Conversation

@brendankowitz
Copy link
Copy Markdown
Member

@brendankowitz brendankowitz commented May 7, 2026

Description

This pull request introduces a stale job monitor for SQL-backed async job queues. The monitor reports the age of the oldest queued job per queue type so stalled queues can be detected before customers observe delayed operations.

Key changes

Stale job monitor

  • Added StaleJobWatchdog to query active jobs for each QueueType, compute the oldest queued job age per queue, log stale queues, and publish StaleJobMetricsNotification.
  • Added StaleJobMetricsNotification and StaleJobMetricHandler to expose the latest queue-age snapshot through the FhirServer meter as Jobs.OldestQueuedAgeSeconds with a queue_type tag.

Dependency injection and background service integration

  • Registered StaleJobWatchdog as a singleton in SQL Server service registration.
  • Re-registered StaleJobMetricHandler as a singleton MediatR notification handler so the observable gauge reads a stable metric snapshot.
  • Updated WatchdogsBackgroundService to start StaleJobWatchdog with the existing SQL watchdogs.

Testing and documentation

  • Added logic tests for queue age computation and metric snapshot updates.
  • Added a SQL watchdog integration test that verifies notifications include all queue types when the queue is empty.
  • Added ADR documentation at docs/arch/adr-2605-stale-job-monitor.md.

Related issues

Addresses AB#164461.

Testing

  • dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net8.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.Core\Microsoft.Health.Fhir.Core.csproj -c Release -f net9.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net8.0 --no-restore
  • dotnet build .\src\Microsoft.Health.Fhir.SqlServer\Microsoft.Health.Fhir.SqlServer.csproj -c Release -f net9.0 --no-restore

FHIR Team Checklist

  • Title is succinct and less than 65 characters.
  • Milestone added for the sprint that it is merged.
  • Tagged with the type of update: New Feature.
  • Tagged with release area: Azure Healthcare APIs.
  • Tagged with PaaS compatibility: No-PaaS-breaking-change.
  • ADR included: docs/arch/adr-2605-stale-job-monitor.md.
  • CI is green before merge.
  • Reviewed squash-merge requirements.

Semver Change

Feature

@brendankowitz brendankowitz added New Feature Label for a new feature in FHIR OSS Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs No-PaaS-breaking-change ADR-Included ADR Included in the PR labels May 7, 2026
@brendankowitz brendankowitz added this to the FY26\Q4\2Wk\2Wk23 milestone May 7, 2026
brendankowitz and others added 12 commits May 19, 2026 12:11
Spec for a StaleJobWatchdog that emits fhir_oldest_queued_job_age_seconds
Prometheus gauge per queue type when no jobs are running.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6-task plan: notification, metric handler, watchdog, WatchdogsBackgroundService
wiring, DI registration, integration test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Previously, a single running job in any queue masked staleness in every
  other queue, defeating per-queue-type alerting. ComputeQueueAges now
  evaluates the running check per queue.
- StaleJobMetricHandler swapped from a per-key-updated ConcurrentDictionary
  to a volatile reference swap so ObservableGauge scrapes never observe a
  partial multi-queue update.
- Added logic test asserting a running job in one queue does not suppress
  another queue's staleness.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add an ObservableGauge<long> named Jobs.QueueDepth to the existing
StaleJobMetricHandler, using the same per-tick SQL result set already
fetched by StaleJobWatchdog. Reports pending (Created) and running job
counts per QueueType via queue_type and state tags, complementing the
existing Jobs.OldestQueuedAgeSeconds metric for full active-queue
observability. ADR 2605 amended with the depth metric decision.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The watchdog reports both stale-queue age and queue depth metrics, so
JobMonitorWatchdog more accurately describes its broader monitoring role.

Renames the watchdog and its companion notification and metric handler,
moves StaleJobMetricHandler into the Logging/Metrics/Handlers folder to
match the post-rebase main layout (PR #5555 moved metric handlers there),
and updates the Features/Operations/StaleJob folder to
Features/Operations/JobMonitor.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@brendankowitz brendankowitz force-pushed the feature/stale-job-monitor branch from b3a52ff to 98da39a Compare May 19, 2026 19:26
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.33%. Comparing base (5b31cf5) to head (230fbb1).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5558      +/-   ##
==========================================
- Coverage   77.36%   77.33%   -0.03%     
==========================================
  Files         993      997       +4     
  Lines       36418    36500      +82     
  Branches     5518     5529      +11     
==========================================
+ Hits        28175    28228      +53     
- Misses       6884     6911      +27     
- Partials     1359     1361       +2     

see 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@brendankowitz brendankowitz marked this pull request as ready for review May 19, 2026 21:21
@brendankowitz brendankowitz requested a review from a team as a code owner May 19, 2026 21:21
- Mark _now as readonly in JobMonitorWatchdogLogicTests
- Replace ContainsKey+indexer with TryGetValue in SqlServerWatchdogTests
  to avoid double dictionary lookups

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@brendankowitz brendankowitz force-pushed the feature/stale-job-monitor branch from 4487a78 to 230fbb1 Compare May 19, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ADR-Included ADR Included in the PR Azure Healthcare APIs Label denotes that the issue or PR is relevant to the FHIR service in the Azure Healthcare APIs New Feature Label for a new feature in FHIR OSS No-PaaS-breaking-change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants