Public status pages v1: live today + 90-day history + RSS feed#79
Conversation
CLAUDE.md @-imports AGENTS.md so Claude Code, Codex, and other tools all pick up the same notes.
Adds Upright.configuration.public_status_enabled (off by default) and public_status_custom_domains for CNAME support. Routes constrain on the 'status' subdomain so both status.<hostname> and CNAMEd customer domains hit Upright::Public::* controllers.
Upright::Service is a FrozenRecord loaded from config/services.yml. Probes opt in via 'service: <code>' in their YAML — Probeable defaults probe_service to try(:service), so no probe-class changes are needed. Probeable also self-registers including classes so Service#probes can iterate them without a central registry.
ProbeRollup.aggregate_day reads upright:probe_uptime_daily from Prometheus at end-of-day and upserts one row per probe with the uptime_fraction and a derived status (operational, degraded_performance, partial_outage, major_outage). ServiceRollup.aggregate_day takes min(uptime_fraction) of the day's ProbeRollups per service, so a service is only as healthy as its worst component. DailyAggregationJob just orchestrates per day — all logic lives on the rollup models. Requires the host app's upright:probe_uptime_daily recording rule to preserve probe_service in its 'by' clauses.
- Inline the enum values rather than naming a STATUSES constant nothing else references. - Fold the nil/operational guards into the case/when so status_for is one expression. - Use Time.now to pair with Date.today elsewhere in the rollup path (both Ruby stdlib clock sources, no AS time-zone coupling).
ServiceRollup was a materialized min(uptime_fraction) of the day's ProbeRollups, grouped by probe_service. That's cheap to compute on demand from ProbeRollup, so the extra table, write step, and aggregation-lag window weren't earning their keep. Upright::Service now exposes uptime_for(day), status_for(day), and daily_uptime(days:) — all backed by ProbeRollup queries. The job only aggregates ProbeRollup; ServiceRollup.aggregate_day is gone. Also switches the job to a lookback: Duration kwarg, iterating lookback.ago.to_date..Date.today. Default 1.day mirrors the previous behaviour (yesterday + today).
bin/seed-prometheus now emits a probe_service label on upright:probe_uptime_daily and upright_probe_up, then runs DailyAggregationJob with a 30-day lookback so ProbeRollup is populated against the seeded series. test/dummy/probes/*.yml gain matching service: attributes so a live probe run produces the same probe_service label as the seed, keeping the rollup path consistent end-to-end.
Flips the dummy app's public_status_enabled, gives the status controller a real action that loads services plus a 90-day window of days, and adds the show view rendering each service with a 90-day uptime bar strip. A new public layout keeps the page free of the admin's signed_in? layout chrome.
Adds a status banner that surfaces the worst service's status for today, a per-service row with the current status label, and a 90-bar uptime strip with per-day tooltips and an average uptime % below. CSS lives alongside the existing app/assets/stylesheets/upright files, which the layout's upright_stylesheet_link_tag globs in automatically. Uses the project's OKLCH design tokens; adds a small status color palette (operational/degraded/partial/major) keyed off the rollup status enum.
The previous distribution only generated uptimes >= 0.92, so partial_outage and major_outage bars never appeared on the status page. Updated each probe's distribution to occasionally hit those tiers, plus deterministic incident days (a MultiProxy major outage 22 days ago, a Gmail partial outage 12 days ago) so the service-level min-of-probes rollup visibly reflects them.
status_for(nil) returns :operational so any rollup that's missing for a day was getting the operational green colour, making empty days look identical to perfect-uptime days. Only assign a status class when the day actually has a rollup; otherwise the bar falls back to --status-none (neutral grey).
Three stacked problems were silently dropping seeded uptime samples:
* DailyAggregationJob queries upright:probe_uptime_daily at day.end_of_day,
but the seed emitted samples at NOW - day*86400 — Prometheus's 5-min
lookback rejected anything off by more than a few hours. Anchor each
day's sample at 23:59 UTC, clamping today to NOW.
* Default 15d retention dropped seeded blocks older than 15 days. Bump to
90d to match the public status page window.
* The recording rule grouped by (name, type, probe_target) and stripped
probe_service. Add probe_service to both by clauses so ProbeRollup can
tie rollups back to their Service.
Also wait until the OLDEST seeded block is queryable before invoking the
job — Prometheus loads blocks oldest-last, so racing ahead produced
partial rollups.
Today is still in progress, so persisting an aggregate from an incomplete day just produces a stale value the rest of the day. The public status page now shows today from live Prometheus state instead. Rename the job's `lookback:` keyword to `past:` since it's a Duration, and cap the range at Date.yesterday. Switch ProbeRollup.fetch_uptime_for from Time.now to Time.current so the iso8601 query time matches Prometheus's UTC samples — the comparison was epoch-correct either way, but the rendered timestamp picked up the system offset.
Add a `public:` flag to services.yml so only public-facing services
show up. Render today live from Prometheus (Service#live_status, via a
new LiveStatus concern that reads upright:probe_down_fraction) and past
days from ProbeRollup, presented through a single DailyStatus value
object so the view is agnostic to the source.
Move collection logic onto Service: `overall_status`, `by_history`,
`degraded` (with `current_outage_started_at` for outage duration in the
banner-adjacent list). The controller is now a one-liner.
Extract a StatusHelper and four partials (overall_banner, degraded_list,
service, uptime_bar). Rename the `degraded_performance` enum to
`degraded` and pull the enum order into Status::VALUES/PRIORITY so the
overall_status calculation can reuse it. Swap ProbeRollup tests to
fixtures and `travel_to` for stable dates.
Add a `:month_day` Date format ("%b %-d") so DailyStatus#tooltip can use
to_fs instead of strftime.
The Status concern's helpers and Service collection methods are short enough that an early return obscures rather than clarifies the happy-path expression — wrap them instead. Collapse the three stacked `return nil if` guards in `current_outage_started_at` into a single conditional, leaning on `rindex` returning nil for both empty arrays and no-match.
The public status page renders a collection of services, not a singular status resource. Rename to align with the standard collection→index Rails convention. The status page is now Upright::Public::ServicesController#index, served from `services/index.html.erb`, with helpers in Upright::Public::ServicesHelper.
Status was a Rails concern under Upright::Rollups, which meant the only way to call its `status_for` mapping was through an including class — forcing Service to reach into `Upright::Rollups::ProbeRollup.status_for(...)` for a concept that has nothing to do with rollups. Promote it to a plain module at Upright::Status with VALUES, PRIORITY, and a pure `for(uptime_fraction)`. ProbeRollup declares `enum :status, Upright::Status::VALUES` directly and owns its own `uptime_percentage`. Service and LiveStatus call `Upright::Status.for(fraction)` without the detour. `Upright::Status.for(nil)` now returns nil (a missing rollup is no-data, not :operational) so callers can drop the `fraction && ...` guard. Service#status_for(day) drops out — `live_status` and `daily_status_history` cover every remaining caller.
Drop the dedicated upsert_day method in favor of `find_or_create_by` with a block — inserts the rollup when the (probe_name, period_start) slot is empty and leaves an existing rollup alone. Move the fraction → status derivation into a before_save callback so the rollup's status can never drift from its uptime_fraction. aggregate_day no longer has to spell it out. Rename the per-element variable from `sample` (Prometheus jargon) to `probe_uptime` and the `:name` key to `:probe_name` so the hash matches the rollup column it'll populate. Tests switch to fixtures, dropping the `delete_all` setup and the ad-hoc `create!`, and gain coverage for the no-op-on-existing-rollup path and the before_save callback.
ServicesController#index now sets `Cache-Control: max-age=15, public` (plus the body-derived ETag Rails adds by default) so an outage-driven traffic spike doesn't tear through SQLite. Same TTL on both representations. The same action also responds to RSS at /feed (route defaults format to :rss), listing each currently-degraded service as a feed item keyed on the outage's start time. Channel envelope still renders when nothing is degraded. Layout dispatch is format-driven: `app/views/layouts/upright/public.html.erb` wraps the page, `public.rss.builder` is a one-line passthrough so the RSS body flows through unchanged.
There was a problem hiding this comment.
Pull request overview
Introduces the first public status-page surface (HTML + RSS) served from a dedicated status subdomain (optionally via custom CNAME hosts), backed by a unified 90-day DailyStatus history that combines today’s live Prometheus state with persisted daily rollups.
Changes:
- Added a public status page (HTML) and RSS feed endpoint gated behind
public_status_enabledand a subdomain route constraint. - Introduced service/status domain objects (
Upright::Service,Upright::Status,Upright::Service::DailyStatus) plus Prometheus-backed live status and a persisted daily rollup model/job. - Updated Prometheus rule templates + dummy dev tooling/fixtures to emit and aggregate 90 days of service-labeled uptime.
Tip
If you aren't ready for review, convert to a draft PR.
Click "Convert to draft" or run gh pr ready --undo.
Click "Ready for review" or run gh pr ready to reengage.
Reviewed changes
Copilot reviewed 43 out of 44 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| test/models/upright/status_test.rb | Adds unit tests for uptime-fraction → status mapping. |
| test/models/upright/service_test.rb | Adds tests for service loading, public scope, and rollup-backed uptime queries. |
| test/models/upright/rollups/probe_rollup_test.rb | Tests rollup persistence, derived status, and rollup-day behavior. |
| test/jobs/upright/rollups/daily_aggregation_job_test.rb | Verifies the aggregation job excludes today and handles empty windows. |
| test/integration/public/services_controller_test.rb | Integration coverage for HTML and RSS responses + caching header. |
| test/fixtures/upright_rollups_probe_rollups.yml | Fixtures for probe rollup history used by service tests. |
| test/dummy/probes/traceroute_probes.yml | Adds service mapping for traceroute probes in the dummy app. |
| test/dummy/probes/smtp_probes.yml | Adds service mapping for SMTP probes in the dummy app. |
| test/dummy/probes/http_probes.yml | Adds service mapping for HTTP probes in the dummy app. |
| test/dummy/docker-compose.yml | Extends Prometheus retention to 90 days for the dummy environment. |
| test/dummy/db/schema.rb | Updates dummy schema with new rollups table. |
| test/dummy/config/services.yml | Defines dummy Upright::Service records (public + internal). |
| test/dummy/config/recurring.yml | Schedules rollup aggregation job in dummy recurring config. |
| test/dummy/config/prometheus/rules/upright.yml | Updates dummy Prometheus rule grouping to keep probe_service. |
| test/dummy/config/initializers/upright.rb | Enables public status in dummy initializer for development/testing. |
| test/dummy/bin/seed-prometheus | Seeds 90 days of service-labeled metrics and runs aggregation. |
| lib/upright/engine.rb | Prints public status URL hint in debug callback when enabled. |
| lib/upright/configuration.rb | Adds public-status config + custom domain host allowlisting. |
| lib/generators/upright/install/templates/upright.rules.yml | Updates install template rules to group by probe_service. |
| db/migrate/20260512000001_create_upright_rollups.rb | Creates the upright_rollups_probe_rollups table and indexes. |
| config/routes.rb | Adds public-status constrained routes for root + /feed. |
| config/initializers/mime_types.rb | Registers application/rss+xml MIME type. |
| config/initializers/date_formats.rb | Adds a :month_day date format for tooltips. |
| CLAUDE.md | Points to AGENTS.md. |
| app/views/upright/public/services/index.rss.builder | RSS feed template for degraded services. |
| app/views/upright/public/services/index.html.erb | Public status index page layout. |
| app/views/upright/public/services/_uptime_bar.html.erb | Renders a single day “bar” in the uptime strip. |
| app/views/upright/public/services/_service.html.erb | Renders a service row + 90-day strip + summary uptime. |
| app/views/upright/public/services/_overall_banner.html.erb | Renders the overall status banner. |
| app/views/upright/public/services/_degraded_list.html.erb | Renders the list of currently degraded services. |
| app/views/layouts/upright/public.rss.builder | RSS layout wrapper. |
| app/views/layouts/upright/public.html.erb | Public HTML layout for status pages. |
| app/models/upright/status.rb | Defines status values/priority + uptime-threshold mapping. |
| app/models/upright/service/daily_status.rb | Value object for a single day’s status, fraction, and tooltip. |
| app/models/upright/service.rb | FrozenRecord-backed service model + history/degraded/overall logic. |
| app/models/upright/rollups/probe_rollup.rb | Rollup model + Prometheus fetch and persistence behavior. |
| app/models/concerns/upright/services/live_status.rb | Live Prometheus status + outage start inference. |
| app/models/concerns/upright/probeable.rb | Adds probe-class tracking and maps probes to a service. |
| app/jobs/upright/rollups/daily_aggregation_job.rb | Job to roll up completed days into probe rollups. |
| app/helpers/upright/public/services_helper.rb | Labels, outage phrasing, and uptime-average helpers. |
| app/controllers/upright/public/services_controller.rb | Public controller index with short expires_in caching. |
| app/controllers/upright/public/base_controller.rb | Base controller for public pages with dedicated layout. |
| app/assets/stylesheets/upright/public_status.css | Styling for the public status page components. |
| AGENTS.md | Contributor guidance for engine-local workflows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def self.overall_status | ||
| Upright::Status::PRIORITY.find { |status| all.any? { |service| service.live_status == status } } || :operational | ||
| end |
There was a problem hiding this comment.
Leaving as-is for now. The action is cached (expires_in 15.seconds, public: true) and there's only a handful of public_facing services, so the per-render Prometheus calls are bounded and cheap. A request-scoped snapshot object would remove the duplication but isn't worth the extra indirection at this size — happy to revisit if it ever shows up in practice.
| def self.degraded | ||
| all.filter_map do |service| | ||
| status = service.live_status | ||
| unless status == :operational | ||
| { service: service, status: status, started_at: service.current_outage_started_at } | ||
| end | ||
| end |
There was a problem hiding this comment.
Same call as the overall_status thread — not worth a shared snapshot object yet given the 15s cache and small service count. Note the extra range query (current_outage_started_at) only runs for services that are actually degraded, which is the rare case.
| def self.rollup_day(day) | ||
| fetch_uptime_for(day).each do |probe_uptime| | ||
| find_or_create_by(probe_name: probe_uptime.fetch(:probe_name), period_start: day.beginning_of_day) do |rollup| | ||
| rollup.probe_service = probe_uptime[:probe_service] | ||
| rollup.uptime_fraction = probe_uptime.fetch(:uptime_fraction) | ||
| end | ||
| end | ||
| end | ||
|
|
||
| def self.fetch_uptime_for(day) | ||
| query_time = [ day.end_of_day, Time.current ].min | ||
|
|
||
| response = prometheus_client.query(query: PROMETHEUS_METRIC, time: query_time.iso8601).deep_symbolize_keys |
There was a problem hiding this comment.
Not a bug: query_time.iso8601 emits an offset-aware timestamp, which Prometheus resolves to an unambiguous absolute instant regardless of the host's Time.zone. And upright:probe_uptime_daily is a sliding avg_over_time(...[1d:]), not a fixed UTC-day bucket, so querying at the app's local end-of-day rolls up "the day" as the app defines it. period_start and the retrieval side (uptime_for/daily_uptime) all use the same Time.zone consistently, so it's internally consistent. Forcing UTC would remove that timezone-awareness rather than add correctness.
| def configure_allowed_hosts | ||
| port_suffix = Rails.env.local? ? "(:\\d+)?" : "" | ||
| Rails.application.config.hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ] | ||
| hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ] | ||
| Array(@public_status_custom_domains).each do |domain| | ||
| hosts << /\A#{Regexp.escape(domain)}#{port_suffix}\z/ | ||
| end | ||
| Rails.application.config.hosts = hosts | ||
| Rails.application.config.action_dispatch.tld_length = 1 | ||
| end |
There was a problem hiding this comment.
Real gap, deferring to a follow-up. Custom-domain routing isn't actually wired up in v1 — the route constraint only matches req.subdomain == public_status_subdomain, and public_status_custom_domains currently just allowlists hosts. The follow-up that turns CNAMEs on will match req.host against the configured domains directly (which also sidesteps the multi-part-TLD tld_length problem you flagged). Until then, custom domains are config-only.
| xml.rss(version: "2.0") do | ||
| xml.channel do | ||
| xml.title "Upright Status" | ||
| xml.link upright.public_services_root_url |
There was a problem hiding this comment.
Fixed in 2fa954b — the feed <link> now uses request.base_url so it targets the requesting status host instead of the admin app.<host> from default_url_options.
| started_at = issue[:started_at] || Time.current | ||
| xml.item do | ||
| xml.title "#{issue[:service].name} — #{status_label(issue[:status])}" | ||
| xml.description "#{issue[:service].name} is currently #{status_label(issue[:status]).downcase} #{outage_duration_phrase(started_at: issue[:started_at])}." | ||
| xml.pubDate started_at.rfc822 | ||
| xml.guid "#{issue[:service].code}-#{started_at.to_i}", isPermaLink: "false" |
There was a problem hiding this comment.
Fixed in 2fa954b — the guid is now stable (feed_item_guid, keyed on service code plus the outage start when known) and pubDate is only emitted when started_at is known, so a sustained >24h outage no longer changes guid/date every request.
The public status page reads live per-service status from upright:probe_down_fraction filtered by probe_service, but that recording rule grouped only by (name, type, probe_target, alert_severity) — the probe_service label was dropped, so the selector matched nothing and live_status was always :operational. Add probe_service to the rule grouping in both the install template and the dummy rules. RSS feed fixes: - <link> used default_url_options (the admin app.<host>); point it at the requesting status host via request.base_url. - guid/pubDate fell back to Time.current when an outage predates the 24h live lookback (started_at nil), so a sustained outage got a fresh guid every request and spammed feed readers. Use a stable guid keyed on service code (+ start time when known) and only emit pubDate when the start is known. Drop a stray .freeze on a constant.
The Prometheus base URL was inlined as ENV.fetch("PROMETHEUS_URL",
"http://localhost:9090") in five places, and four classes each built an
identical Prometheus::ApiClient. Capture the URL as
Upright.configuration.prometheus_url (overridable, env default) and the
client as Upright.prometheus_client, then point every call site at them.
| def configure_allowed_hosts | ||
| port_suffix = Rails.env.local? ? "(:\\d+)?" : "" | ||
| Rails.application.config.hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ] | ||
| hosts = [ /.*\.#{Regexp.escape(hostname)}#{port_suffix}/, /#{Regexp.escape(hostname)}#{port_suffix}/ ] | ||
| Array(@public_status_custom_domains).each do |domain| | ||
| hosts << /\A#{Regexp.escape(domain)}#{port_suffix}\z/ | ||
| end | ||
| Rails.application.config.hosts = hosts |
| <title>Status</title> | ||
| <meta name="viewport" content="width=device-width,initial-scale=1"> | ||
| <%= csrf_meta_tags %> | ||
| <%= csp_meta_tag %> | ||
| <%= upright_stylesheet_link_tag "data-turbo-track": "reload" %> |
| def self.rollup_day(day) | ||
| fetch_uptime_for(day).each do |probe_uptime| | ||
| find_or_create_by(probe_name: probe_uptime.fetch(:probe_name), period_start: day.beginning_of_day) do |rollup| | ||
| rollup.probe_service = probe_uptime[:probe_service] | ||
| rollup.uptime_fraction = probe_uptime.fetch(:uptime_fraction) | ||
| end | ||
| end |
Summary
First PR for the status-page feature — a read-only view that
public users can hit at
status.<host>(or a CNAMEd custom domain) tosee whether each user-facing service is operational. Off by default;
opt in via
Upright.configuration.public_status_enabled = true.The page renders an overall banner, a degraded-services list with outage
durations, and a 90-day uptime bar per service. Today's status comes
live from Prometheus, past days from the daily rollup table — both
flow through a single
DailyStatuscollection so the view is agnosticto the source.
Data model
Upright::Service— FrozenRecord loaded fromconfig/services.yml.New
public:flag drives thepublic_facingscope. Class methods(
overall_status,by_history,degraded) own the collection logicso controllers/views stay thin.
Upright::Status— plain module withVALUES,PRIORITY, and apure
for(uptime_fraction)threshold mapper. Both ProbeRollup andService depend on it; nobody reaches through ProbeRollup to map a
fraction.
Upright::Rollups::ProbeRollup— one row per (probe, day) withuptime_fractionand an enumstatusderived in abefore_savecallback (so status can't drift from fraction).
Upright::Rollups::DailyAggregationJob— recurring hourly jobiterating
past.ago.to_date..Date.yesterday(today is in progressand represented live, not persisted).
Upright::Services::LiveStatus(concern) — readsupright:probe_down_fractionfrom Prometheus for today's status andthe most recent outage's start time.
Upright::Service::DailyStatus— value object representing oneday; carries status + optional fraction + a tooltip helper.
Page UI
Upright::Public::ServicesController#indexpowers everything. Viewsunder
app/views/upright/public/services/:_overall_banner— worst current status across services_degraded_list— currently-degraded services with outage duration_service+_uptime_bar— per-service row + 90-day stripHelpers (
Upright::Public::ServicesHelper) own the status-to-labelmapping and outage-duration phrasing.
RSS feed
Same action serves RSS at
/feed(route forcesformat: :rss,template at
index.rss.builder). One item per currently-degradedservice, keyed on outage start time so feed readers see each new
outage. Empty channel envelope when all clear. The layout dispatches
by format —
public.html.erbfor the page,public.rss.builderforthe feed.
HTTP caching
expires_in 15.seconds, public: trueon the action sendsCache-Control: max-age=15, publicfor both formats; Rails' defaultETag (body-derived) handles conditional GETs. Load-bearing — SQLite +
an outage-driven traffic spike was the failure mode the original todo
flagged.
Routes
Subdomain constraint requires both
public_status_enabledAND therequest to hit
Upright.configuration.public_status_subdomain(or aconfigured CNAME). Routes don't exist on other subdomains, so the
admin app is unaffected.
What's deferred
These were called out in the original UI todo but depend on later work:
ComponentsController#show— per-service 90-day page. Theper-row bars in the index already render the same data; standalone
per-service pages can land later.
IncidentsController#index/show+ richer RSS items — need theincidents domain (separate todo).
Test plan
bin/rails testpasses (174 tests)public_status_enabled=true, hitstatus.<host>/and confirm the page renderscurl -i status.<host>/showsCache-Control: max-age=15, publicand a
text/htmlContent-Typecurl -i status.<host>/feedshowsContent-Type: application/rss+xmland a valid<rss>envelope
config.public_status_enabled = falseand confirm bothURLs 404 (route falls through the subdomain constraint)