Skip to content

Fix watcher restart race that drops mirrors for new namespaces#645

Merged
winromulus merged 3 commits intoemberstack:mainfrom
seatgeek:fix/preserve-cache-across-watcher-restarts
May 8, 2026
Merged

Fix watcher restart race that drops mirrors for new namespaces#645
winromulus merged 3 commits intoemberstack:mainfrom
seatgeek:fix/preserve-cache-across-watcher-restarts

Conversation

@komapa
Copy link
Copy Markdown
Contributor

@komapa komapa commented May 4, 2026

We keep trying to fix a problem with the reflector where it seems like it stops mirroring configmaps into new namespaces.

The NamespaceWatcher and resource watchers (ConfigMap/Secret) run as independent BackgroundServices but share the same ResourceMirror singleton. When a resource watcher session closes, Handle(WatcherClosed) was clearing _autoSources, _propertiesCache, and _namespaceCache. But the NamespaceWatcher keeps running — namespace-added events arriving during the reconnection window would iterate an empty _autoSources and silently skip mirror creation.

Fix: preserve _autoSources, _propertiesCache, and _namespaceCache across resource watcher restarts. Only clear relationship-tracking caches (_autoReflectionCache, _directReflectionCache, _notFoundCache) which get rebuilt from event replay. Stale entries are harmless: the replay overwrites them, and ResourceReflect always re-fetches the source from the API when sourceObj is null.

Also separate namespace watcher close handling (only clears namespace cache) from resource watcher close handling, and fix a KeyNotFoundException in the namespace handler when _propertiesCache is stale.

Additionally thread CancellationToken through OnResourceWithNameList, OnResourceGet, TryResourceGet, and ResourceReflect to prevent silent consumer stalls when Kubernetes API calls hang.

The NamespaceWatcher and resource watchers (ConfigMap/Secret) run as
independent BackgroundServices but share the same ResourceMirror
singleton. When a resource watcher session closes, Handle(WatcherClosed)
was clearing _autoSources, _propertiesCache, and _namespaceCache. But
the NamespaceWatcher keeps running — namespace-added events arriving
during the reconnection window would iterate an empty _autoSources and
silently skip mirror creation.

Fix: preserve _autoSources, _propertiesCache, and _namespaceCache
across resource watcher restarts. Only clear relationship-tracking
caches (_autoReflectionCache, _directReflectionCache, _notFoundCache)
which get rebuilt from event replay. Stale entries are harmless: the
replay overwrites them, and ResourceReflect always re-fetches the source
from the API when sourceObj is null.

Also separate namespace watcher close handling (only clears namespace
cache) from resource watcher close handling, and fix a
KeyNotFoundException in the namespace handler when _propertiesCache is
stale.

Additionally thread CancellationToken through OnResourceWithNameList,
OnResourceGet, TryResourceGet, and ResourceReflect to prevent silent
consumer stalls when Kubernetes API calls hang.
@winromulus
Copy link
Copy Markdown
Contributor

Good catch, but there's an issue: we can't keep the stale ones because it's not just reconnects, it's sometimes permissions issues (IaC or other mechanisms) and it can't immediately reconnect. During reconcilliation having the state properties results in reflections being created in stale locations (no longer allowed). We need those cleared. We need to treate the namespace watcher differently so caches are reset based on the watchers that were reset. Also the singleton kinda has to stay because we have users with A LOT of namespaces (thousands) and having a second watcher would kill k8s API.

I'll have to dig into this a bit more during this or the next weekend. But if you wish to continue your work on this, it would be appreciated, just please take into consideration the notes above.

@komapa
Copy link
Copy Markdown
Contributor Author

komapa commented May 4, 2026

Good catch, but there's an issue: we can't keep the stale ones because it's not just reconnects, it's sometimes permissions issues (IaC or other mechanisms) and it can't immediately reconnect. During reconcilliation having the state properties results in reflections being created in stale locations (no longer allowed). We need those cleared. We need to treate the namespace watcher differently so caches are reset based on the watchers that were reset. Also the singleton kinda has to stay because we have users with A LOT of namespaces (thousands) and having a second watcher would kill k8s API.

I'll have to dig into this a bit more during this or the next weekend. But if you wish to continue your work on this, it would be appreciated, just please take into consideration the notes above.

Thank you, will do. We are actively testing this fix in our environment and would love your input as you did above. I will make the change as you said and see how that changes the PR.

on resource watcher restart. This means label-selector checks
(`CanBeReflectedToNamespaceCached`,
`CanBeAutoReflectedToNamespaceCached`) work correctly during the
resource replay instead of failing closed.
@komapa
Copy link
Copy Markdown
Contributor Author

komapa commented May 4, 2026

Reverted some of the initial changes and mainly left these three changes:

  1. Addition of _directReflectionCache.Clear();
  2. Conditional clearing of the namespace cache
  3. Cancellation tokens

Does this look like simple enough change to get in first before any larger change on your side?

@winromulus
Copy link
Copy Markdown
Contributor

Thanks for sticking with this — the rework looks good. Walked through both paths to make sure I had the right mental model:

The namespace-watcher-close turned out to be the load-bearing bug. That path was wiping _autoSources/_propertiesCache (which it doesn't own), and since the resource watcher sits idle in steady state there was nothing to repopulate them until the next hourly timeout — which lines up with the "stops mirroring for an hour" symptom we've been hearing about. Resource-watcher-close already self-heals via AutoReflectionForSource on its initial-list replay, so that part was a bit of a red herring in the original description, but the fix itself is on point.

Also a nice grab on _directReflectionCache.Clear() — that one was never cleared, ever. Quiet leak across every restart.

Couple of small things that are right next door to what you've already changed — would you mind tacking them on while you're in there?

1. Same TryGetValue defense at the remaining indexer sites. Your fix at line 131 closes one occurrence of the KeyNotFoundException window, but the same shape exists at a few other spots and it'd be nice to close them in one go:

  • AutoReflectionForSource line 412: var sourceProperties = _propertiesCache[sourceNsName];
  • HandleUpsert direct-reflection branch lines 328 / 335: _directReflectionCache[sourceNsName] after TryAdd

Same pattern as your line 131 fix (TryGetValue + continue/return depending on context).

2. Wrap the consumer loop in WatcherBackgroundService with OCE handling. Now that cancellation tokens flow through the API calls (good change), an OperationCanceledException from a cancelled call can escape up into the consumer Task.Run at WatcherBackgroundService.cs:55-68. Only fires at session shutdown, but it'll silently fault that task. A try { ... } catch (OperationCanceledException) { /* expected on session shutdown */ } around the inner while body should cover it.

Happy to merge once those are in. Appreciate the work either way.

Use GetOrAdd for direct reflection cache updates and skip
auto-reflection processing when source properties are missing.
Also catch expected OperationCanceledException in watcher event
processing during shutdown.
@komapa
Copy link
Copy Markdown
Contributor Author

komapa commented May 7, 2026

Thanks for sticking with this — the rework looks good. Walked through both paths to make sure I had the right mental model:

The namespace-watcher-close turned out to be the load-bearing bug. That path was wiping _autoSources/_propertiesCache (which it doesn't own), and since the resource watcher sits idle in steady state there was nothing to repopulate them until the next hourly timeout — which lines up with the "stops mirroring for an hour" symptom we've been hearing about. Resource-watcher-close already self-heals via AutoReflectionForSource on its initial-list replay, so that part was a bit of a red herring in the original description, but the fix itself is on point.

Also a nice grab on _directReflectionCache.Clear() — that one was never cleared, ever. Quiet leak across every restart.

Couple of small things that are right next door to what you've already changed — would you mind tacking them on while you're in there?

1. Same TryGetValue defense at the remaining indexer sites. Your fix at line 131 closes one occurrence of the KeyNotFoundException window, but the same shape exists at a few other spots and it'd be nice to close them in one go:

  • AutoReflectionForSource line 412: var sourceProperties = _propertiesCache[sourceNsName];
  • HandleUpsert direct-reflection branch lines 328 / 335: _directReflectionCache[sourceNsName] after TryAdd

Same pattern as your line 131 fix (TryGetValue + continue/return depending on context).

2. Wrap the consumer loop in WatcherBackgroundService with OCE handling. Now that cancellation tokens flow through the API calls (good change), an OperationCanceledException from a cancelled call can escape up into the consumer Task.Run at WatcherBackgroundService.cs:55-68. Only fires at session shutdown, but it'll silently fault that task. A try { ... } catch (OperationCanceledException) { /* expected on session shutdown */ } around the inner while body should cover it.

Happy to merge once those are in. Appreciate the work either way.

Thank you, made the extra changes, they make sense to me as well. I ended up wrapping the whole while loop, I hope that is okay as well.

Appreciate your detailed reviews!

Copy link
Copy Markdown
Contributor

@winromulus winromulus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@winromulus winromulus merged commit 97bdf5d into emberstack:main May 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants