Reconnect Redis clients on network interface change#5816
Conversation
When the host's active network interface changes (e.g. Ethernet -> Wi-Fi), existing TCP sockets end up half-open: the kernel accepts writes into oblivion and commands on cached clients hang indefinitely. Fix has three parts: - Enable TCP SO_KEEPALIVE (5s idle) on both ioredis and node-redis clients so dead sockets eventually surface an error on their own. - Add NetworkChangeMonitor that polls os.networkInterfaces() every 2s and, when the routable interface set changes, drops every cached Redis client so the next command opens a fresh socket on the currently-active interface instead of reusing a stale one. - Map runtime Redis client errors (EADDRNOTAVAIL, ECONNRESET, Socket closed unexpectedly, etc.) to RedisConnection* exceptions (HTTP 424) in catchAclError so the UI's connectivityErrorsInterceptor shows the "Connection Lost" banner instead of a generic 500. Before: unplugging Ethernet and immediately clicking Refresh spun for the full HTTP request timeout before any error surfaced. After: the monitor drops the cached client within ~2s of the interface change; the next request reconnects cleanly over Wi-Fi. Made-with: Cursor
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
Code Coverage - Backend unit tests
Test suite run success3380 tests passing in 306 suites. Report generated by 🧪jest coverage report action from 555da66 |
Code Coverage - Integration Tests
|
Code Coverage - Frontend unit tests
Test suite run success6651 tests passing in 781 suites. Report generated by 🧪jest coverage report action from 555da66 |
|
Not sure this is a proper fix. I'm not sure we must monitor for network interface change. It looks like workaround. I feel like proper solution is |
Good push-back @ArtemHoruzhenko :) The short answer - there's no TCP keep-alive (which this PR enables) is the closest built-in signal, but Node only exposes the initial delay, not the probe interval (default ~75 s between probes), so detection takes over a minute. So the real options are:
They're complementary - (1) is a cheap fast-path, (2) is the general safety net. |
Can't user actions be a ping-like activity? We don't need to introduce another mechanism. |
The base branch was changed.
Yup, I hear what you say, and in theory it makes sense. However, the problem is that in this bug, the user action is the "ping" - and that's exactly what hangs for 25 s. Walking through the failure:
So "use the user's action as the ping" is literally the current behavior - and it's what produces the 25 s hang. It only works as a detection mechanism if the request itself returns an error quickly, but on a half-open socket it doesn't - the kernel swallows the write silently. For user actions to be a viable signal, we'd need either:
I was also not 100% confident with this change, so we can postpone it and discuss more, or rework it 🙂 |
|
Closing this one in favor of a minimal change #5848. Instead of adding this "real-time" polling, let's leave it to The trade-off is that after unplugging the cable, the user has to wait up to ~10 s before the first Refresh click succeeds (vs sub-second with the monitor approach). But I think, that is acceptable for our case because nobody realistically unplugs the cable and instantly hits Refresh. By the time they reach for the mouse the socket has already failed and the next click opens a fresh connection. |
What
Fixes a bug where the UI spun on a loading indicator after the host switched network interfaces (e.g. Ethernet → Wi-Fi). Clicking Refresh immediately after unplugging the cable would hang for the full HTTP request timeout before any error surfaced; subsequent clicks worked.
Before
Screen.Recording.2026-04-22.at.10.44.11.mov
After
Screen.Recording.2026-04-22.at.10.45.09.mov
Root cause
Cached Redis clients kept TCP sockets that became half-open the moment the OS tore down the old interface. Writes on those sockets are silently queued by the kernel — ioredis/node-redis never see an error, so queued commands sit until something else times out.
Fix
Three layers, backend-only:
redis_clients.keepAlive: 5000is now passed to both ioredis and node-redis connection strategies. Gives the OS a chance to surface a dead socket on its own.redisinsight/api/src/modules/redis/network-change.monitor.ts) — pollsos.networkInterfaces()every 2 s, computes a stable signature (excluding loopback + IPv6 link-local), and when the interface set changes, callsRedisClientStorage.removeAll(). The next request opens a fresh socket on the currently-active interface. Cross-platform, no native deps, tunable viaRI_CLIENTS_NETWORK_WATCHER*env vars.catchAclErrornow converts runtime driver errors (EADDRNOTAVAIL,ECONNRESET,Socket closed unexpectedly, reconnect backoff exhausted, …) intoRedisConnection*exceptions (HTTP 424), which the existing UIconnectivityErrorsInterceptoralready handles to show the "Connection Lost" banner instead of a generic 500.Existing HTTP
requestTimeout(25 s) remains the single per-request cap.Testing
RI_CLIENTS_KEEP_ALIVE=0) → still works because the monitor handles it.RI_CLIENTS_NETWORK_WATCHER=false) → reverts to old behavior.Made with Cursor
Note
Medium Risk
Touches Redis connection setup and runtime error handling, which can affect connectivity and user-facing error behavior if misclassified; also introduces a background polling monitor that clears all cached clients on interface churn.
Overview
Prevents hangs after host network switches by enabling TCP keepalive for both ioredis and node-redis clients (new
redis_clients.keepAlive, default5000, configurable viaRI_CLIENTS_KEEP_ALIVE).Adds a
NetworkChangeMonitor(configurable viaRI_CLIENTS_NETWORK_WATCHER*) that pollsos.networkInterfaces()and callsRedisClientStorage.removeAll()when the routable interface set changes, forcing fresh sockets on next use.Updates error handling so runtime driver/socket disconnects are recognized and re-thrown as
RedisConnection*(HTTP 424) viawrapRedisRuntimeConnectionError, improving UI handling; includes new/expanded unit tests for keepalive passthrough,removeAll, interface-change monitoring, and runtime error mapping.Reviewed by Cursor Bugbot for commit 555da66. Bugbot is set up for automated code reviews on this repo. Configure here.