Skip to content

ARC runners did not recover automatically after Github Outage #4396

@rob-howie-depop

Description

@rob-howie-depop

Checks

Controller Version

0.13.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This is not reproducible directly as it is the byproduct of an incident.
see https://www.githubstatus.com/incidents/g9j4tmfqdd09

Describe the bug

During the GitHub Actions degraded availability incident on 2026-03-05 githubstatus.com/incidents/g9j4tmfqdd09), our ARC deployment experienced runners that became stuck in a bad state and were not automatically recovered after GitHub's services came back online.

These runners were registered before the incident, lost their registration during GitHub's degradation, and ARC never reconciled them back to a healthy state.

This was resolved by manually deleting all of the stuck ARC runner pods.

Describe the expected behavior

ARC has no garbage collection loop that reconciles GitHub-side runner registrations against actual Kubernetes pod state. The EphemeralRunnerReconciler only handles the forward path (create secret -> create pod -> monitor pod). It does not:

  1. Periodically verify that a runner's GitHub registration is still valid
  2. Detect runners whose registration was invalidated by a GitHub-side incident
  3. Clean up and re-provision runners that are running but no longer recognized by GitHub

I suggest that we should add periodic health check to the EphemeralRunnerReconciler that verifies the GitHub-side registration is still valid for running EphemeralRunners.

Additional Context

1. GitHub's `api.github.com/actions/runner-registration` endpoint became unavailable.
2. The ARC `EphemeralRunnerReconciler` failed to generate JIT configs for new runners, retrying 5 times with backoff before giving up.
3. Already-running runners lost their registrations on the GitHub side (`Registration <uuid> was not found`).
4. Runner pods entered `BrokerServer` backoff loops, unable to communicate with `broker.actions.githubusercontent.com`.
5. **After GitHub recovered at ~23:55 UTC, errors and backoff warnings continued until at least 00:17 UTC** — over 20 minutes of lingering failures. Runners that were mid-registration during the incident remained in a bad state with no automatic recovery.

Controller Logs

related logs:
https://gist.github.com/rob-howie-depop/27b15fd387ffc5f8f36e838614ffefc0

Runner Pod Logs

related logs:
https://gist.github.com/rob-howie-depop/27b15fd387ffc5f8f36e838614ffefc0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions