Fix watcher restart race that drops mirrors for new namespaces#645
Conversation
The NamespaceWatcher and resource watchers (ConfigMap/Secret) run as independent BackgroundServices but share the same ResourceMirror singleton. When a resource watcher session closes, Handle(WatcherClosed) was clearing _autoSources, _propertiesCache, and _namespaceCache. But the NamespaceWatcher keeps running — namespace-added events arriving during the reconnection window would iterate an empty _autoSources and silently skip mirror creation. Fix: preserve _autoSources, _propertiesCache, and _namespaceCache across resource watcher restarts. Only clear relationship-tracking caches (_autoReflectionCache, _directReflectionCache, _notFoundCache) which get rebuilt from event replay. Stale entries are harmless: the replay overwrites them, and ResourceReflect always re-fetches the source from the API when sourceObj is null. Also separate namespace watcher close handling (only clears namespace cache) from resource watcher close handling, and fix a KeyNotFoundException in the namespace handler when _propertiesCache is stale. Additionally thread CancellationToken through OnResourceWithNameList, OnResourceGet, TryResourceGet, and ResourceReflect to prevent silent consumer stalls when Kubernetes API calls hang.
|
Good catch, but there's an issue: we can't keep the stale ones because it's not just reconnects, it's sometimes permissions issues (IaC or other mechanisms) and it can't immediately reconnect. During reconcilliation having the state properties results in reflections being created in stale locations (no longer allowed). We need those cleared. We need to treate the namespace watcher differently so caches are reset based on the watchers that were reset. Also the singleton kinda has to stay because we have users with A LOT of namespaces (thousands) and having a second watcher would kill k8s API. I'll have to dig into this a bit more during this or the next weekend. But if you wish to continue your work on this, it would be appreciated, just please take into consideration the notes above. |
Thank you, will do. We are actively testing this fix in our environment and would love your input as you did above. I will make the change as you said and see how that changes the PR. |
on resource watcher restart. This means label-selector checks (`CanBeReflectedToNamespaceCached`, `CanBeAutoReflectedToNamespaceCached`) work correctly during the resource replay instead of failing closed.
|
Reverted some of the initial changes and mainly left these three changes:
Does this look like simple enough change to get in first before any larger change on your side? |
|
Thanks for sticking with this — the rework looks good. Walked through both paths to make sure I had the right mental model: The namespace-watcher-close turned out to be the load-bearing bug. That path was wiping Also a nice grab on Couple of small things that are right next door to what you've already changed — would you mind tacking them on while you're in there? 1. Same
Same pattern as your line 131 fix ( 2. Wrap the consumer loop in Happy to merge once those are in. Appreciate the work either way. |
Use GetOrAdd for direct reflection cache updates and skip auto-reflection processing when source properties are missing. Also catch expected OperationCanceledException in watcher event processing during shutdown.
Thank you, made the extra changes, they make sense to me as well. I ended up wrapping the whole Appreciate your detailed reviews! |
We keep trying to fix a problem with the reflector where it seems like it stops mirroring configmaps into new namespaces.
The NamespaceWatcher and resource watchers (ConfigMap/Secret) run as independent BackgroundServices but share the same ResourceMirror singleton. When a resource watcher session closes, Handle(WatcherClosed) was clearing
_autoSources,_propertiesCache, and_namespaceCache. But the NamespaceWatcher keeps running — namespace-added events arriving during the reconnection window would iterate an empty _autoSources and silently skip mirror creation.Fix: preserve
_autoSources,_propertiesCache, and_namespaceCacheacross resource watcher restarts. Only clear relationship-tracking caches (_autoReflectionCache,_directReflectionCache,_notFoundCache) which get rebuilt from event replay. Stale entries are harmless: the replay overwrites them, and ResourceReflect always re-fetches the source from the API when sourceObj is null.Also separate namespace watcher close handling (only clears namespace cache) from resource watcher close handling, and fix a KeyNotFoundException in the namespace handler when
_propertiesCacheis stale.Additionally thread
CancellationTokenthroughOnResourceWithNameList,OnResourceGet,TryResourceGet, andResourceReflectto prevent silent consumer stalls when Kubernetes API calls hang.