Skip to content

fix(status-reconciler): fail closed on config load errors and add operational metrics#628

Open
miltalex wants to merge 6 commits intokubernetes-sigs:mainfrom
miltalex:fix/status-reconciler
Open

fix(status-reconciler): fail closed on config load errors and add operational metrics#628
miltalex wants to merge 6 commits intokubernetes-sigs:mainfrom
miltalex:fix/status-reconciler

Conversation

@miltalex
Copy link
Copy Markdown
Contributor

@miltalex miltalex commented Feb 20, 2026

What this PR changes

  • Makes status-reconciler fail closed on bad config/state loads:

    • corrupted saved status now fails Load()
    • config-agent load failures are propagated via error handlers
    • on load failure, controller is marked unhealthy and stops reconciling
  • Adds status-reconciler metrics:

    • status_reconciler_loaded_presubmit_count (Gauge)
    • status_reconciler_contexts_retired_total{org,repo} (Counter)
  • Refactors config-agent wiring for simplicity:

    • ConfigOptions.ConfigAgent(...) now uses options:
      • WithErrorHandler(...)
      • WithAdditionals(...)
      • WithReuseAgent(...)
    • migrates call sites in status-reconciler and deck
  • Keeps health wiring explicit with ServeLive(...):

    • status-reconciler uses ServeLive(c.Healthy) for liveness checks
    • other binaries keep the existing explicit ServeLive() approach
  • Keeps status-reconciler liveness/readiness probes in starter manifests.

Outcome

  • Status-reconciler no longer reconciles from failed/corrupted loads.
  • On load failures, liveness/metrics clearly signal degraded state and reconciliation is stopped.

Fixes #540

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 20, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 20, 2026

Deploy Preview for k8s-prow ready!

Name Link
🔨 Latest commit 642b6eb
🔍 Latest deploy log https://app.netlify.com/projects/k8s-prow/deploys/69b18af0e4bee50008349f91
😎 Deploy Preview https://deploy-preview-628--k8s-prow.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the area/status-reconciler Issues or PRs related to reconciling status when jobs change label Feb 20, 2026
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @miltalex. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 20, 2026
@petr-muller
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 20, 2026
@miltalex miltalex marked this pull request as ready for review February 21, 2026 13:33
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 21, 2026
@k8s-ci-robot k8s-ci-robot requested a review from matthyx February 21, 2026 13:33
Comment thread pkg/statusreconciler/controller.go Outdated
Comment on lines +186 to +189
if countPresubmits(delta.Before.PresubmitsStatic) > 0 && countPresubmits(delta.After.PresubmitsStatic) == 0 {
return fmt.Errorf("refusing to reconcile config update because all presubmits disappeared")
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only one part of the story. Try to add the other recomendations from: #540

Status reconciler should be fixed so never actuate unless it has a provably good config loaded. If its dynamic config loading is failing (like it was in the example above), it should cease reconciling, and strongly signal, through metrics at least, that it is not actuating right now. We may consider flipping its liveness probe endpoint to false, to signal kubernetes that the pod is not healthy and cause it to restart.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed one commit to setup liveness/readiness for status-reconciler, and flipping it to false when the config Agent is failing. I am looking into pushing another commit to add metrics, since currently there are no metrics will try to cover only metrics related to this functionality. Happy to adjust/rework based on your feedback.

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: miltalex
Once this PR has been reviewed and has the lgtm label, please assign smg247 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2026
Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
@petr-muller
Copy link
Copy Markdown
Contributor

@Prucek I'm leaving this one on you, feel free to ping me when this needs approved

Copy link
Copy Markdown
Member

@Prucek Prucek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is going in the right direction, but we need to improve:

  • the healthiness handling- I don't think the config agent should handle it. This is a tricky one: the error comes from the configAgent (,"msg":"Error loading config."), but it comes from a goroutine, so we can't handle it directly in the code. I think we could add a callback onError func(error) to the func (ca *Agent) Start and then a func (o *ConfigOptions) ConfigAgentWithErrorHandling(onError func(error), reuse ...*config.Agent) and have a onError method in the controller, that will flip the healthiness to false. WDYT?
  • unit tests
  • the metrics - still missing
    Also, could you please update the description to reflect what was done in this PR?

port: 8081
initialDelaySeconds: 10
periodSeconds: 3
timeoutSeconds: 600
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 600s is too long for a health check

Comment thread pkg/pjutil/health.go Outdated
Comment on lines +35 to +37

livenessLock sync.RWMutex
livenessChecks []LivenessCheck
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed

Comment thread pkg/pjutil/health.go Outdated
Comment on lines +51 to +55
h := &Health{healthMux: http.NewServeMux()}
h.healthMux.HandleFunc("/healthz", h.serveLive)
server := &http.Server{Addr: ":" + strconv.Itoa(port), Handler: h.healthMux}
interrupts.ListenAndServe(server, 5*time.Second)
return &Health{
healthMux: healthMux,
}
return h
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't make changes here

Comment thread pkg/pjutil/health.go Outdated
Comment on lines +60 to +94
@@ -66,3 +78,17 @@ func (h *Health) ServeReady(readinessChecks ...ReadinessCheck) {
fmt.Fprint(w, "OK")
})
}

func (h *Health) serveLive(w http.ResponseWriter, r *http.Request) {
h.livenessLock.RLock()
livenessChecks := append([]LivenessCheck(nil), h.livenessChecks...)
h.livenessLock.RUnlock()
for _, livenessCheck := range livenessChecks {
if !livenessCheck() {
w.WriteHeader(http.StatusServiceUnavailable)
fmt.Fprint(w, "LivenessCheck failed")
return
}
}
fmt.Fprint(w, "OK")
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but I would reregister the /healthz endpoint in a single ServeLive method and add the same logic as ServeReady, but with the LivenessCheck
Then this can be called like:

 // in a controller
  var healthy atomic.Bool
  healthy.Store(true)

  health := pjutil.NewHealth()
  health.ServeReady()
  health.ServeLive(healthy.Load)   // passes the method as the check func

  // anywhere in your code:
  healthy.Store(false)  // liveness probe now returns 503
  healthy.Store(true)   // recovers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the tip, Indeed this is much cleaner

Comment thread pkg/pjutil/health_test.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tests are not needed here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are also not needed

Comment thread pkg/statusreconciler/controller_test.go Outdated
}
}

func TestControllerReconcileRefusesDroppingAllPresubmits(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check other tests in this file, and look at how we write unit tests. Usually, you don't want to write just a single test

Comment thread pkg/statusreconciler/status.go Outdated
type statusClient interface {
Load() (chan config.Delta, error)
Save() error
Healthy() bool
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not belong here. If you change the code according to the comment in pkg/pjutil/health.go, everything will be simplified.

@k8s-ci-robot k8s-ci-robot added area/deck Issues or PRs related to prow's deck component area/gangway Issues or PRs related to prow's gangway component area/ghproxy Issues or PRs related to prow's ghproxy component area/hook Issues or PRs related to prow's hook component area/moonraker Issues or PRs related to prow's moonraker component labels Mar 6, 2026
@k8s-ci-robot k8s-ci-robot added the area/prowcm Issues or PRs related to prow's controller manager component label Mar 6, 2026
@miltalex miltalex force-pushed the fix/status-reconciler branch from a2e47fb to 8cd377f Compare March 6, 2026 18:16
Comment thread pkg/statusreconciler/controller.go Outdated
}

func (c *Controller) reconcile(delta config.Delta, log *logrus.Entry) error {
if countPresubmits(delta.Before.PresubmitsStatic) > 0 && countPresubmits(delta.After.PresubmitsStatic) == 0 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong place - we should not transmit if we cannot load, or if the load is corrupted. You would risk not catching real deletes with this logic. The go-unavailable-on-failed-load logic mentioned below is a good idea, as well as root-causing why corrupted loads are sending deltas.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @stevekuznetsov . I moved the logic to the Load instead

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
@miltalex miltalex force-pushed the fix/status-reconciler branch from 8cd377f to 2828ed9 Compare March 6, 2026 18:42
…isappear

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
@miltalex miltalex requested a review from Prucek March 10, 2026 17:49
Comment thread pkg/config/agent.go Outdated
Comment on lines +311 to +315
return ca.StartWatchWithErrorHandler(nil, prowConfig, jobConfig, supplementalProwConfigDirs, supplementalProwConfigsFileNameSuffix, additionals...)
}

// StartWatchWithErrorHandler behaves like StartWatch and calls onError on config load failures.
func (ca *Agent) StartWatchWithErrorHandler(onError func(error), prowConfig, jobConfig string, supplementalProwConfigDirs []string, supplementalProwConfigsFileNameSuffix string, additionals ...func(*Config) error) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change to StartWatch needed? It is not used anywhere

Comment thread pkg/statusreconciler/controller.go Outdated
statusClient: sc,
}
controller.setHealthy(true)
controller.setActuationEnabled(false)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the purpose of this variable; mostly, it is set to the same as healthy. Could you explain?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it, was trying to measure when the app was indeed not healthy. Initially when I started was trying to calculate the actuation of each config.

Comment thread cmd/admission/main.go

pprof.Instrument(o.instrumentationOptions)
health := pjutil.NewHealthOnPort(o.instrumentationOptions.HealthPort)
health.ServeLive()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other components would like to add live health checks, they can do that on a follow-up PR. I wouldn't put it here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in slack, this is due to the changes in the NewHealthOnPort func

Comment thread pkg/flagutil/config/config.go Outdated
Comment on lines +100 to +105
return o.ConfigAgentWithAdditionalsAndErrorHandling(ca, additionals, nil)
}

// ConfigAgentWithAdditionalsAndErrorHandling starts the config agent with custom additionals and optional error handling.
func (o *ConfigOptions) ConfigAgentWithAdditionalsAndErrorHandling(ca *config.Agent, additionals []func(*config.Config) error, onError func(error)) (*config.Agent, error) {
return ca, ca.StartWithErrorHandler(onError, o.ConfigPath, o.JobConfigPath, o.SupplementalProwConfigDirs.Strings(), o.SupplementalProwConfigsFileNameSuffix, additionals...)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are adding too many methods here. Also, the name gets pretty long 😄
I'd suggest something like:

type ConfigAgentOption func(*configAgentOptions)

type configAgentOptions struct {
    onError     func(error)
    additionals []func(*config.Config) error
    reuse       *config.Agent
}

func WithErrorHandler(fn func(error)) ConfigAgentOption {
    return func(o *configAgentOptions) {
        o.onError = fn
    }
}

func WithAdditionals(a ...func(*config.Config) error) ConfigAgentOption {
    return func(o *configAgentOptions) {
        o.additionals = a
    }
}

func WithReuseAgent(a *config.Agent) ConfigAgentOption {
    return func(o *configAgentOptions) {
        o.reuse = a
    }
}

Then we could just do:

agent, err := opts.ConfigAgent()

agent, err := opts.ConfigAgent(
    WithErrorHandler(handleErr),
)

agent, err := opts.ConfigAgent(
    WithReuseAgent(existing),
    WithAdditionals(extraFn),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much I find this indeed much cleaner

Comment thread pkg/statusreconciler/metrics.go Outdated
Comment on lines +21 to +23
var statusReconcilerMetrics = struct {
controllerHealthy prometheus.Gauge
actuationEnabled prometheus.Gauge
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does not make sense to store healthyness to metrics. With the metrics, we want to measure some value over time. This could be:

  1. status_reconciler_loaded_presubmit_count — Gauge
    Set to len(config.PresubmitsStatic) each time a Delta is received.
  2. status_reconciler_contexts_retired_total{org, repo} — Counter
    Incremented in retireRemovedContexts() for each context actually retired

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks tbh I wasnt sure what kind of metrics we needed to expose

…bmits disappear

Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
@Prucek
Copy link
Copy Markdown
Member

Prucek commented Mar 17, 2026

/test pull-prow-integration
/lgtm
/label tide/merge-method-squash

@miltalex could you update the PR title and description? Otherwise, it looks OK.

@k8s-ci-robot k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 17, 2026
@miltalex miltalex changed the title fix(status-reconciler): reject reconcile when all presubmits disappear fix(status-reconciler): fail closed on config load errors and add operational metrics Mar 17, 2026
@miltalex
Copy link
Copy Markdown
Contributor Author

@Prucek thank you very much for the review. I have updated the title and description, I kept the convo fix(status-reconciler) since the changes are mostly targeted to that package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/deck Issues or PRs related to prow's deck component area/gangway Issues or PRs related to prow's gangway component area/ghproxy Issues or PRs related to prow's ghproxy component area/hook Issues or PRs related to prow's hook component area/moonraker Issues or PRs related to prow's moonraker component area/prowcm Issues or PRs related to prow's controller manager component area/status-reconciler Issues or PRs related to reconciling status when jobs change cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

status-reconciler started retiring whole world when its configuration became corrupted

5 participants