fix(status-reconciler): fail closed on config load errors and add operational metrics by miltalex · Pull Request #628 · kubernetes-sigs/prow

miltalex · 2026-02-20T11:20:59Z

What this PR changes

Makes status-reconciler fail closed on bad config/state loads:
- corrupted saved status now fails Load()
- config-agent load failures are propagated via error handlers
- on load failure, controller is marked unhealthy and stops reconciling
Adds status-reconciler metrics:
- status_reconciler_loaded_presubmit_count (Gauge)
- status_reconciler_contexts_retired_total{org,repo} (Counter)
Refactors config-agent wiring for simplicity:
- ConfigOptions.ConfigAgent(...) now uses options:
  - WithErrorHandler(...)
  - WithAdditionals(...)
  - WithReuseAgent(...)
- migrates call sites in status-reconciler and deck
Keeps health wiring explicit with ServeLive(...):
- status-reconciler uses ServeLive(c.Healthy) for liveness checks
- other binaries keep the existing explicit ServeLive() approach
Keeps status-reconciler liveness/readiness probes in starter manifests.

Outcome

Status-reconciler no longer reconciles from failed/corrupted loads.
On load failures, liveness/metrics clearly signal degraded state and reconciliation is stopped.

Fixes #540

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

netlify · 2026-02-20T11:21:04Z

✅ Deploy Preview for k8s-prow ready!

Name	Link
🔨 Latest commit	`642b6eb`
🔍 Latest deploy log	https://app.netlify.com/projects/k8s-prow/deploys/69b18af0e4bee50008349f91
😎 Deploy Preview	https://deploy-preview-628--k8s-prow.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2026-02-20T11:21:09Z

Hi @miltalex. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

petr-muller · 2026-02-20T21:11:49Z

/ok-to-test

Prucek · 2026-02-23T15:10:13Z

+	if countPresubmits(delta.Before.PresubmitsStatic) > 0 && countPresubmits(delta.After.PresubmitsStatic) == 0 {
+		return fmt.Errorf("refusing to reconcile config update because all presubmits disappeared")
+	}
+


I think this is only one part of the story. Try to add the other recomendations from: #540

Status reconciler should be fixed so never actuate unless it has a provably good config loaded. If its dynamic config loading is failing (like it was in the example above), it should cease reconciling, and strongly signal, through metrics at least, that it is not actuating right now. We may consider flipping its liveness probe endpoint to false, to signal kubernetes that the pod is not healthy and cause it to restart.

I pushed one commit to setup liveness/readiness for status-reconciler, and flipping it to false when the config Agent is failing. I am looking into pushing another commit to add metrics, since currently there are no metrics will try to cover only metrics related to this functionality. Happy to adjust/rework based on your feedback.

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

k8s-ci-robot · 2026-02-26T13:39:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: miltalex
Once this PR has been reviewed and has the lgtm label, please assign smg247 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

petr-muller · 2026-02-27T18:17:26Z

@Prucek I'm leaving this one on you, feel free to ping me when this needs approved

Prucek

It is going in the right direction, but we need to improve:

the healthiness handling- I don't think the config agent should handle it. This is a tricky one: the error comes from the configAgent (,"msg":"Error loading config."), but it comes from a goroutine, so we can't handle it directly in the code. I think we could add a callback onError func(error) to the func (ca *Agent) Start and then a func (o *ConfigOptions) ConfigAgentWithErrorHandling(onError func(error), reuse ...*config.Agent) and have a onError method in the controller, that will flip the healthiness to false. WDYT?
unit tests
the metrics - still missing
Also, could you please update the description to reflect what was done in this PR?

Prucek · 2026-03-04T08:13:02Z

+            port: 8081
+          initialDelaySeconds: 10
+          periodSeconds: 3
+          timeoutSeconds: 600


I think 600s is too long for a health check

Prucek · 2026-03-04T08:38:53Z

+
+	livenessLock   sync.RWMutex
+	livenessChecks []LivenessCheck


This is not needed

Prucek · 2026-03-04T08:41:54Z

+	h := &Health{healthMux: http.NewServeMux()}
+	h.healthMux.HandleFunc("/healthz", h.serveLive)
+	server := &http.Server{Addr: ":" + strconv.Itoa(port), Handler: h.healthMux}
 	interrupts.ListenAndServe(server, 5*time.Second)
-	return &Health{
-		healthMux: healthMux,
-	}
+	return h


I wouldn't make changes here

Prucek · 2026-03-04T08:44:29Z

@@ -66,3 +78,17 @@ func (h *Health) ServeReady(readinessChecks ...ReadinessCheck) {
 		fmt.Fprint(w, "OK")
 	})
 }
+
+func (h *Health) serveLive(w http.ResponseWriter, r *http.Request) {
+	h.livenessLock.RLock()
+	livenessChecks := append([]LivenessCheck(nil), h.livenessChecks...)
+	h.livenessLock.RUnlock()
+	for _, livenessCheck := range livenessChecks {
+		if !livenessCheck() {
+			w.WriteHeader(http.StatusServiceUnavailable)
+			fmt.Fprint(w, "LivenessCheck failed")
+			return
+		}
+	}
+	fmt.Fprint(w, "OK")
+}


This is correct, but I would reregister the /healthz endpoint in a single ServeLive method and add the same logic as ServeReady, but with the LivenessCheck
Then this can be called like:

// in a controller var healthy atomic.Bool healthy.Store(true) health := pjutil.NewHealth() health.ServeReady() health.ServeLive(healthy.Load) // passes the method as the check func // anywhere in your code: healthy.Store(false) // liveness probe now returns 503 healthy.Store(true) // recovers

Thank you very much for the tip, Indeed this is much cleaner

Prucek · 2026-03-04T08:44:59Z

I think tests are not needed here.

Prucek · 2026-03-04T08:45:20Z

These tests are also not needed

Prucek · 2026-03-04T08:46:46Z

 	}
 }

+func TestControllerReconcileRefusesDroppingAllPresubmits(t *testing.T) {


Check other tests in this file, and look at how we write unit tests. Usually, you don't want to write just a single test

Prucek · 2026-03-04T08:49:05Z

 type statusClient interface {
 	Load() (chan config.Delta, error)
 	Save() error
+	Healthy() bool


This does not belong here. If you change the code according to the comment in pkg/pjutil/health.go, everything will be simplified.

stevekuznetsov · 2026-03-06T18:23:35Z

+}
+
 func (c *Controller) reconcile(delta config.Delta, log *logrus.Entry) error {
+	if countPresubmits(delta.Before.PresubmitsStatic) > 0 && countPresubmits(delta.After.PresubmitsStatic) == 0 {


This seems like the wrong place - we should not transmit if we cannot load, or if the load is corrupted. You would risk not catching real deletes with this logic. The go-unavailable-on-failed-load logic mentioned below is a good idea, as well as root-causing why corrupted loads are sending deltas.

Thank you very much @stevekuznetsov . I moved the logic to the Load instead

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

…isappear Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

Prucek · 2026-03-11T08:22:17Z

+	return ca.StartWatchWithErrorHandler(nil, prowConfig, jobConfig, supplementalProwConfigDirs, supplementalProwConfigsFileNameSuffix, additionals...)
+}
+
+// StartWatchWithErrorHandler behaves like StartWatch and calls onError on config load failures.
+func (ca *Agent) StartWatchWithErrorHandler(onError func(error), prowConfig, jobConfig string, supplementalProwConfigDirs []string, supplementalProwConfigsFileNameSuffix string, additionals ...func(*Config) error) error {


Is the change to StartWatch needed? It is not used anywhere

Prucek · 2026-03-11T08:24:16Z

 		statusClient: sc,
 	}
+	controller.setHealthy(true)
+	controller.setActuationEnabled(false)


I don't understand the purpose of this variable; mostly, it is set to the same as healthy. Could you explain?

Removed it, was trying to measure when the app was indeed not healthy. Initially when I started was trying to calculate the actuation of each config.

Prucek · 2026-03-11T08:25:20Z


 	pprof.Instrument(o.instrumentationOptions)
 	health := pjutil.NewHealthOnPort(o.instrumentationOptions.HealthPort)
+	health.ServeLive()


If other components would like to add live health checks, they can do that on a follow-up PR. I wouldn't put it here.

As discussed in slack, this is due to the changes in the NewHealthOnPort func

Prucek · 2026-03-11T08:39:16Z

+	return o.ConfigAgentWithAdditionalsAndErrorHandling(ca, additionals, nil)
+}
+
+// ConfigAgentWithAdditionalsAndErrorHandling starts the config agent with custom additionals and optional error handling.
+func (o *ConfigOptions) ConfigAgentWithAdditionalsAndErrorHandling(ca *config.Agent, additionals []func(*config.Config) error, onError func(error)) (*config.Agent, error) {
+	return ca, ca.StartWithErrorHandler(onError, o.ConfigPath, o.JobConfigPath, o.SupplementalProwConfigDirs.Strings(), o.SupplementalProwConfigsFileNameSuffix, additionals...)


I think we are adding too many methods here. Also, the name gets pretty long 😄
I'd suggest something like:

type ConfigAgentOption func(*configAgentOptions) type configAgentOptions struct { onError func(error) additionals []func(*config.Config) error reuse *config.Agent } func WithErrorHandler(fn func(error)) ConfigAgentOption { return func(o *configAgentOptions) { o.onError = fn } } func WithAdditionals(a ...func(*config.Config) error) ConfigAgentOption { return func(o *configAgentOptions) { o.additionals = a } } func WithReuseAgent(a *config.Agent) ConfigAgentOption { return func(o *configAgentOptions) { o.reuse = a } }

Then we could just do:

agent, err := opts.ConfigAgent() agent, err := opts.ConfigAgent( WithErrorHandler(handleErr), ) agent, err := opts.ConfigAgent( WithReuseAgent(existing), WithAdditionals(extraFn), )

Thank you very much I find this indeed much cleaner

Prucek · 2026-03-11T08:59:58Z

+var statusReconcilerMetrics = struct {
+	controllerHealthy prometheus.Gauge
+	actuationEnabled  prometheus.Gauge


I think it does not make sense to store healthyness to metrics. With the metrics, we want to measure some value over time. This could be:

status_reconciler_loaded_presubmit_count — Gauge
Set to len(config.PresubmitsStatic) each time a Delta is received.

status_reconciler_contexts_retired_total{org, repo} — Counter
Incremented in retireRemovedContexts() for each context actually retired

Thanks tbh I wasnt sure what kind of metrics we needed to expose

…bmits disappear Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

Prucek · 2026-03-17T09:59:19Z

/test pull-prow-integration
/lgtm
/label tide/merge-method-squash

@miltalex could you update the PR title and description? Otherwise, it looks OK.

miltalex · 2026-03-19T07:29:28Z

@Prucek thank you very much for the review. I have updated the title and description, I kept the convo fix(status-reconciler) since the changes are mostly targeted to that package.

fix(status-reconciler): reject reconcile when all presubmits disappear

325eb27

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2026

k8s-ci-robot added the area/status-reconciler Issues or PRs related to reconciling status when jobs change label Feb 20, 2026

k8s-ci-robot requested review from cjwagner and stevekuznetsov February 20, 2026 11:21

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2026

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 20, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2026

miltalex marked this pull request as ready for review February 21, 2026 13:33

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 21, 2026

k8s-ci-robot requested a review from matthyx February 21, 2026 13:33

Prucek reviewed Feb 23, 2026

View reviewed changes

feat(status-reconciler): enable liveness check

f4b4e7b

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2026

fixup! feat(status-reconciler): enable liveness check

0f266e8

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

Prucek suggested changes Mar 4, 2026

View reviewed changes

k8s-ci-robot assigned Prucek Mar 4, 2026

k8s-ci-robot added the area/prowcm Issues or PRs related to prow's controller manager component label Mar 6, 2026

miltalex force-pushed the fix/status-reconciler branch from a2e47fb to 8cd377f Compare March 6, 2026 18:16

stevekuznetsov reviewed Mar 6, 2026

View reviewed changes

fixup! fixup! feat(status-reconciler): enable liveness check

2828ed9

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

miltalex force-pushed the fix/status-reconciler branch from 8cd377f to 2828ed9 Compare March 6, 2026 18:42

fixup! fix(status-reconciler): reject reconcile when all presubmits d…

0ae4572

…isappear Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

miltalex requested a review from Prucek March 10, 2026 17:49

Prucek reviewed Mar 11, 2026

View reviewed changes

fixup! fixup! fix(status-reconciler): reject reconcile when all presu…

642b6eb

…bmits disappear Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>

k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 17, 2026

miltalex changed the title ~~fix(status-reconciler): reject reconcile when all presubmits disappear~~ fix(status-reconciler): fail closed on config load errors and add operational metrics Mar 17, 2026

Conversation

miltalex commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR changes

Outcome

Uh oh!

netlify Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for k8s-prow ready!

Uh oh!

k8s-ci-robot commented Feb 20, 2026

Uh oh!

petr-muller commented Feb 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Feb 26, 2026

Uh oh!

petr-muller commented Feb 27, 2026

Uh oh!

Prucek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Prucek commented Mar 17, 2026

Uh oh!

miltalex commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

miltalex commented Feb 20, 2026 •

edited

Loading

netlify Bot commented Feb 20, 2026 •

edited

Loading