-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Checks
- I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- I am using charts that are officially provided
Controller Version
0.13.1
Deployment Method
Helm
Checks
- This isn't a question or user support case (For Q&A and community support, go to Discussions).
- I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
To Reproduce
This is not reproducible directly as it is the byproduct of an incident.
see https://www.githubstatus.com/incidents/g9j4tmfqdd09Describe the bug
During the GitHub Actions degraded availability incident on 2026-03-05 githubstatus.com/incidents/g9j4tmfqdd09), our ARC deployment experienced runners that became stuck in a bad state and were not automatically recovered after GitHub's services came back online.
These runners were registered before the incident, lost their registration during GitHub's degradation, and ARC never reconciled them back to a healthy state.
This was resolved by manually deleting all of the stuck ARC runner pods.
Describe the expected behavior
ARC has no garbage collection loop that reconciles GitHub-side runner registrations against actual Kubernetes pod state. The EphemeralRunnerReconciler only handles the forward path (create secret -> create pod -> monitor pod). It does not:
- Periodically verify that a runner's GitHub registration is still valid
- Detect runners whose registration was invalidated by a GitHub-side incident
- Clean up and re-provision runners that are running but no longer recognized by GitHub
I suggest that we should add periodic health check to the EphemeralRunnerReconciler that verifies the GitHub-side registration is still valid for running EphemeralRunners.
Additional Context
1. GitHub's `api.github.com/actions/runner-registration` endpoint became unavailable.
2. The ARC `EphemeralRunnerReconciler` failed to generate JIT configs for new runners, retrying 5 times with backoff before giving up.
3. Already-running runners lost their registrations on the GitHub side (`Registration <uuid> was not found`).
4. Runner pods entered `BrokerServer` backoff loops, unable to communicate with `broker.actions.githubusercontent.com`.
5. **After GitHub recovered at ~23:55 UTC, errors and backoff warnings continued until at least 00:17 UTC** — over 20 minutes of lingering failures. Runners that were mid-registration during the incident remained in a bad state with no automatic recovery.Controller Logs
related logs:
https://gist.github.com/rob-howie-depop/27b15fd387ffc5f8f36e838614ffefc0Runner Pod Logs
related logs:
https://gist.github.com/rob-howie-depop/27b15fd387ffc5f8f36e838614ffefc0