fix(status-reconciler): fail closed on config load errors and add operational metrics#628
fix(status-reconciler): fail closed on config load errors and add operational metrics#628miltalex wants to merge 6 commits intokubernetes-sigs:mainfrom
Conversation
Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
✅ Deploy Preview for k8s-prow ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @miltalex. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
| if countPresubmits(delta.Before.PresubmitsStatic) > 0 && countPresubmits(delta.After.PresubmitsStatic) == 0 { | ||
| return fmt.Errorf("refusing to reconcile config update because all presubmits disappeared") | ||
| } | ||
|
|
There was a problem hiding this comment.
I think this is only one part of the story. Try to add the other recomendations from: #540
Status reconciler should be fixed so never actuate unless it has a provably good config loaded. If its dynamic config loading is failing (like it was in the example above), it should cease reconciling, and strongly signal, through metrics at least, that it is not actuating right now. We may consider flipping its liveness probe endpoint to false, to signal kubernetes that the pod is not healthy and cause it to restart.
There was a problem hiding this comment.
I pushed one commit to setup liveness/readiness for status-reconciler, and flipping it to false when the config Agent is failing. I am looking into pushing another commit to add metrics, since currently there are no metrics will try to cover only metrics related to this functionality. Happy to adjust/rework based on your feedback.
Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: miltalex The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
|
@Prucek I'm leaving this one on you, feel free to ping me when this needs |
Prucek
left a comment
There was a problem hiding this comment.
It is going in the right direction, but we need to improve:
- the healthiness handling- I don't think the config agent should handle it. This is a tricky one: the error comes from the configAgent (
,"msg":"Error loading config."), but it comes from a goroutine, so we can't handle it directly in the code. I think we could add a callbackonError func(error)to thefunc (ca *Agent) Startand then afunc (o *ConfigOptions) ConfigAgentWithErrorHandling(onError func(error), reuse ...*config.Agent)and have aonErrormethod in the controller, that will flip the healthiness to false. WDYT? - unit tests
- the metrics - still missing
Also, could you please update the description to reflect what was done in this PR?
| port: 8081 | ||
| initialDelaySeconds: 10 | ||
| periodSeconds: 3 | ||
| timeoutSeconds: 600 |
There was a problem hiding this comment.
I think 600s is too long for a health check
|
|
||
| livenessLock sync.RWMutex | ||
| livenessChecks []LivenessCheck |
| h := &Health{healthMux: http.NewServeMux()} | ||
| h.healthMux.HandleFunc("/healthz", h.serveLive) | ||
| server := &http.Server{Addr: ":" + strconv.Itoa(port), Handler: h.healthMux} | ||
| interrupts.ListenAndServe(server, 5*time.Second) | ||
| return &Health{ | ||
| healthMux: healthMux, | ||
| } | ||
| return h |
| @@ -66,3 +78,17 @@ func (h *Health) ServeReady(readinessChecks ...ReadinessCheck) { | |||
| fmt.Fprint(w, "OK") | |||
| }) | |||
| } | |||
|
|
|||
| func (h *Health) serveLive(w http.ResponseWriter, r *http.Request) { | |||
| h.livenessLock.RLock() | |||
| livenessChecks := append([]LivenessCheck(nil), h.livenessChecks...) | |||
| h.livenessLock.RUnlock() | |||
| for _, livenessCheck := range livenessChecks { | |||
| if !livenessCheck() { | |||
| w.WriteHeader(http.StatusServiceUnavailable) | |||
| fmt.Fprint(w, "LivenessCheck failed") | |||
| return | |||
| } | |||
| } | |||
| fmt.Fprint(w, "OK") | |||
| } | |||
There was a problem hiding this comment.
This is correct, but I would reregister the /healthz endpoint in a single ServeLive method and add the same logic as ServeReady, but with the LivenessCheck
Then this can be called like:
// in a controller
var healthy atomic.Bool
healthy.Store(true)
health := pjutil.NewHealth()
health.ServeReady()
health.ServeLive(healthy.Load) // passes the method as the check func
// anywhere in your code:
healthy.Store(false) // liveness probe now returns 503
healthy.Store(true) // recovers
There was a problem hiding this comment.
Thank you very much for the tip, Indeed this is much cleaner
There was a problem hiding this comment.
I think tests are not needed here.
| } | ||
| } | ||
|
|
||
| func TestControllerReconcileRefusesDroppingAllPresubmits(t *testing.T) { |
There was a problem hiding this comment.
Check other tests in this file, and look at how we write unit tests. Usually, you don't want to write just a single test
| type statusClient interface { | ||
| Load() (chan config.Delta, error) | ||
| Save() error | ||
| Healthy() bool |
There was a problem hiding this comment.
This does not belong here. If you change the code according to the comment in pkg/pjutil/health.go, everything will be simplified.
a2e47fb to
8cd377f
Compare
| } | ||
|
|
||
| func (c *Controller) reconcile(delta config.Delta, log *logrus.Entry) error { | ||
| if countPresubmits(delta.Before.PresubmitsStatic) > 0 && countPresubmits(delta.After.PresubmitsStatic) == 0 { |
There was a problem hiding this comment.
This seems like the wrong place - we should not transmit if we cannot load, or if the load is corrupted. You would risk not catching real deletes with this logic. The go-unavailable-on-failed-load logic mentioned below is a good idea, as well as root-causing why corrupted loads are sending deltas.
There was a problem hiding this comment.
Thank you very much @stevekuznetsov . I moved the logic to the Load instead
Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
8cd377f to
2828ed9
Compare
…isappear Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
| return ca.StartWatchWithErrorHandler(nil, prowConfig, jobConfig, supplementalProwConfigDirs, supplementalProwConfigsFileNameSuffix, additionals...) | ||
| } | ||
|
|
||
| // StartWatchWithErrorHandler behaves like StartWatch and calls onError on config load failures. | ||
| func (ca *Agent) StartWatchWithErrorHandler(onError func(error), prowConfig, jobConfig string, supplementalProwConfigDirs []string, supplementalProwConfigsFileNameSuffix string, additionals ...func(*Config) error) error { |
There was a problem hiding this comment.
Is the change to StartWatch needed? It is not used anywhere
| statusClient: sc, | ||
| } | ||
| controller.setHealthy(true) | ||
| controller.setActuationEnabled(false) |
There was a problem hiding this comment.
I don't understand the purpose of this variable; mostly, it is set to the same as healthy. Could you explain?
There was a problem hiding this comment.
Removed it, was trying to measure when the app was indeed not healthy. Initially when I started was trying to calculate the actuation of each config.
|
|
||
| pprof.Instrument(o.instrumentationOptions) | ||
| health := pjutil.NewHealthOnPort(o.instrumentationOptions.HealthPort) | ||
| health.ServeLive() |
There was a problem hiding this comment.
If other components would like to add live health checks, they can do that on a follow-up PR. I wouldn't put it here.
There was a problem hiding this comment.
As discussed in slack, this is due to the changes in the NewHealthOnPort func
| return o.ConfigAgentWithAdditionalsAndErrorHandling(ca, additionals, nil) | ||
| } | ||
|
|
||
| // ConfigAgentWithAdditionalsAndErrorHandling starts the config agent with custom additionals and optional error handling. | ||
| func (o *ConfigOptions) ConfigAgentWithAdditionalsAndErrorHandling(ca *config.Agent, additionals []func(*config.Config) error, onError func(error)) (*config.Agent, error) { | ||
| return ca, ca.StartWithErrorHandler(onError, o.ConfigPath, o.JobConfigPath, o.SupplementalProwConfigDirs.Strings(), o.SupplementalProwConfigsFileNameSuffix, additionals...) |
There was a problem hiding this comment.
I think we are adding too many methods here. Also, the name gets pretty long 😄
I'd suggest something like:
type ConfigAgentOption func(*configAgentOptions)
type configAgentOptions struct {
onError func(error)
additionals []func(*config.Config) error
reuse *config.Agent
}
func WithErrorHandler(fn func(error)) ConfigAgentOption {
return func(o *configAgentOptions) {
o.onError = fn
}
}
func WithAdditionals(a ...func(*config.Config) error) ConfigAgentOption {
return func(o *configAgentOptions) {
o.additionals = a
}
}
func WithReuseAgent(a *config.Agent) ConfigAgentOption {
return func(o *configAgentOptions) {
o.reuse = a
}
}
Then we could just do:
agent, err := opts.ConfigAgent()
agent, err := opts.ConfigAgent(
WithErrorHandler(handleErr),
)
agent, err := opts.ConfigAgent(
WithReuseAgent(existing),
WithAdditionals(extraFn),
)
There was a problem hiding this comment.
Thank you very much I find this indeed much cleaner
| var statusReconcilerMetrics = struct { | ||
| controllerHealthy prometheus.Gauge | ||
| actuationEnabled prometheus.Gauge |
There was a problem hiding this comment.
I think it does not make sense to store healthyness to metrics. With the metrics, we want to measure some value over time. This could be:
- status_reconciler_loaded_presubmit_count — Gauge
Set to len(config.PresubmitsStatic) each time a Delta is received. - status_reconciler_contexts_retired_total{org, repo} — Counter
Incremented in retireRemovedContexts() for each context actually retired
There was a problem hiding this comment.
Thanks tbh I wasnt sure what kind of metrics we needed to expose
…bmits disappear Signed-off-by: Miltiadis Alexis <alexmiltiadis@gmail.com>
|
/test pull-prow-integration @miltalex could you update the PR title and description? Otherwise, it looks OK. |
|
@Prucek thank you very much for the review. I have updated the title and description, I kept the convo |
What this PR changes
Makes
status-reconcilerfail closed on bad config/state loads:Load()Adds status-reconciler metrics:
status_reconciler_loaded_presubmit_count(Gauge)status_reconciler_contexts_retired_total{org,repo}(Counter)Refactors config-agent wiring for simplicity:
ConfigOptions.ConfigAgent(...)now uses options:WithErrorHandler(...)WithAdditionals(...)WithReuseAgent(...)status-reconcileranddeckKeeps health wiring explicit with
ServeLive(...):status-reconcilerusesServeLive(c.Healthy)for liveness checksServeLive()approachKeeps status-reconciler liveness/readiness probes in starter manifests.
Outcome
Fixes #540