Skip to content

Reconnect Redis clients on network interface change#5816

Closed
valkirilov wants to merge 1 commit into
mainfrom
be/bugfix/reconnect-on-network-change
Closed

Reconnect Redis clients on network interface change#5816
valkirilov wants to merge 1 commit into
mainfrom
be/bugfix/reconnect-on-network-change

Conversation

@valkirilov
Copy link
Copy Markdown
Member

@valkirilov valkirilov commented Apr 22, 2026

What

Fixes a bug where the UI spun on a loading indicator after the host switched network interfaces (e.g. Ethernet → Wi-Fi). Clicking Refresh immediately after unplugging the cable would hang for the full HTTP request timeout before any error surfaced; subsequent clicks worked.

Before

Screen.Recording.2026-04-22.at.10.44.11.mov

After

Screen.Recording.2026-04-22.at.10.45.09.mov

Root cause

Cached Redis clients kept TCP sockets that became half-open the moment the OS tore down the old interface. Writes on those sockets are silently queued by the kernel — ioredis/node-redis never see an error, so queued commands sit until something else times out.

Fix

Three layers, backend-only:

  1. TCP keep-aliveredis_clients.keepAlive: 5000 is now passed to both ioredis and node-redis connection strategies. Gives the OS a chance to surface a dead socket on its own.
  2. NetworkChangeMonitor (redisinsight/api/src/modules/redis/network-change.monitor.ts) — polls os.networkInterfaces() every 2 s, computes a stable signature (excluding loopback + IPv6 link-local), and when the interface set changes, calls RedisClientStorage.removeAll(). The next request opens a fresh socket on the currently-active interface. Cross-platform, no native deps, tunable via RI_CLIENTS_NETWORK_WATCHER* env vars.
  3. 424 error mappingcatchAclError now converts runtime driver errors (EADDRNOTAVAIL, ECONNRESET, Socket closed unexpectedly, reconnect backoff exhausted, …) into RedisConnection* exceptions (HTTP 424), which the existing UI connectivityErrorsInterceptor already handles to show the "Connection Lost" banner instead of a generic 500.

Existing HTTP requestTimeout (25 s) remains the single per-request cap.

Testing

  1. Connect to a Redis DB via wired connection, browse keys → data loads.
  2. Unplug the cable, switch to Wi-Fi (with the Redis host reachable on the new network).
  3. Immediately click Refresh → should return data without a hang (previously spun for ~25 s).
  4. Same with keep-alive disabled (RI_CLIENTS_KEEP_ALIVE=0) → still works because the monitor handles it.
  5. Disable the monitor (RI_CLIENTS_NETWORK_WATCHER=false) → reverts to old behavior.

Made with Cursor


Note

Medium Risk
Touches Redis connection setup and runtime error handling, which can affect connectivity and user-facing error behavior if misclassified; also introduces a background polling monitor that clears all cached clients on interface churn.

Overview
Prevents hangs after host network switches by enabling TCP keepalive for both ioredis and node-redis clients (new redis_clients.keepAlive, default 5000, configurable via RI_CLIENTS_KEEP_ALIVE).

Adds a NetworkChangeMonitor (configurable via RI_CLIENTS_NETWORK_WATCHER*) that polls os.networkInterfaces() and calls RedisClientStorage.removeAll() when the routable interface set changes, forcing fresh sockets on next use.

Updates error handling so runtime driver/socket disconnects are recognized and re-thrown as RedisConnection* (HTTP 424) via wrapRedisRuntimeConnectionError, improving UI handling; includes new/expanded unit tests for keepalive passthrough, removeAll, interface-change monitoring, and runtime error mapping.

Reviewed by Cursor Bugbot for commit 555da66. Bugbot is set up for automated code reviews on this repo. Configure here.

When the host's active network interface changes (e.g. Ethernet -> Wi-Fi),
existing TCP sockets end up half-open: the kernel accepts writes into
oblivion and commands on cached clients hang indefinitely.

Fix has three parts:
- Enable TCP SO_KEEPALIVE (5s idle) on both ioredis and node-redis
  clients so dead sockets eventually surface an error on their own.
- Add NetworkChangeMonitor that polls os.networkInterfaces() every 2s
  and, when the routable interface set changes, drops every cached
  Redis client so the next command opens a fresh socket on the
  currently-active interface instead of reusing a stale one.
- Map runtime Redis client errors (EADDRNOTAVAIL, ECONNRESET, Socket
  closed unexpectedly, etc.) to RedisConnection* exceptions (HTTP 424)
  in catchAclError so the UI's connectivityErrorsInterceptor shows the
  "Connection Lost" banner instead of a generic 500.

Before: unplugging Ethernet and immediately clicking Refresh spun for
the full HTTP request timeout before any error surfaced.
After: the monitor drops the cached client within ~2s of the interface
change; the next request reconnects cleanly over Wi-Fi.

Made-with: Cursor
@valkirilov valkirilov requested a review from a team as a code owner April 22, 2026 07:55
@valkirilov valkirilov changed the base branch from main to release/3.4.2 April 22, 2026 07:56
@jit-ci
Copy link
Copy Markdown

jit-ci Bot commented Apr 22, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@valkirilov valkirilov self-assigned this Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Code Coverage - Backend unit tests

St.
Category Percentage Covered / Total
🟢 Statements 92.56% 15430/16671
🟡 Branches 74.54% 4870/6533
🟢 Functions 86.5% 2396/2770
🟢 Lines 92.38% 14747/15963

Test suite run success

3380 tests passing in 306 suites.

Report generated by 🧪jest coverage report action from 555da66

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

Code Coverage - Integration Tests

Status Category Percentage Covered / Total
🟡 Statements 79.74% 17428/21854
🟡 Branches 62.18% 7995/12857
🟡 Functions 68.53% 2424/3537
🟡 Lines 79.3% 16387/20662

@valkirilov valkirilov added the run-all-tests Trigger the CI to run the front-end, back-end and integration tests label Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Code Coverage - Frontend unit tests

St.
Category Percentage Covered / Total
🟢 Statements 82.54% 23942/29007
🟡 Branches 67.7% 10078/14886
🟡 Functions 77.62% 6451/8311
🟢 Lines 83% 23419/28215

Test suite run success

6651 tests passing in 781 suites.

Report generated by 🧪jest coverage report action from 555da66

@ArtemHoruzhenko
Copy link
Copy Markdown
Contributor

Not sure this is a proper fix. I'm not sure we must monitor for network interface change. It looks like workaround. I feel like proper solution is connection lost + retry. Meaning that we must better (earlier) understand that connection was lost and then retry. I think all this could be configured in the ioredis conf. Not sure. But if it is possible to go this way we might cover all the cases with connection lost, not only when interface changed.

@valkirilov
Copy link
Copy Markdown
Member Author

valkirilov commented Apr 22, 2026

Not sure this is a proper fix. I'm not sure we must monitor for network interface change. It looks like workaround. I feel like proper solution is connection lost + retry. Meaning that we must better (earlier) understand that connection was lost and then retry. I think all this could be configured in the ioredis conf. Not sure. But if it is possible to go this way we might cover all the cases with connection lost, not only when interface changed.

Good push-back @ArtemHoruzhenko :)

The short answer - there's no ioredis setting that detects this. When the OS drops the old interface, the TCP socket stays in ESTABLISHED state, writes succeed into the kernel buffer, and ioredis never sees an error - so there's nothing for it to time out against.

TCP keep-alive (which this PR enables) is the closest built-in signal, but Node only exposes the initial delay, not the probe interval (default ~75 s between probes), so detection takes over a minute.

So the real options are:

  1. What this PR does - watch os.networkInterfaces() (a ~0.1 ms in-memory kernel read every 2s) and drop cached clients when the interface set changes. Fixes the reported bug exactly, doesn't help if the server dies with no interface change.

  2. Application-level heartbeat PING on idle connections, force reconnect on failure. Covers all "connection lost" cases (what you're asking for) but adds ping traffic, tuning, and false-positive reconnects on transient server blips. It's a bigger change and deserves its own PR.

They're complementary - (1) is a cheap fast-path, (2) is the general safety net.

dantovska
dantovska previously approved these changes Apr 22, 2026
@ArtemHoruzhenko
Copy link
Copy Markdown
Contributor

Not sure this is a proper fix. I'm not sure we must monitor for network interface change. It looks like workaround. I feel like proper solution is connection lost + retry. Meaning that we must better (earlier) understand that connection was lost and then retry. I think all this could be configured in the ioredis conf. Not sure. But if it is possible to go this way we might cover all the cases with connection lost, not only when interface changed.

Good push-back @ArtemHoruzhenko :)

The short answer - there's no ioredis setting that detects this. When the OS drops the old interface, the TCP socket stays in ESTABLISHED state, writes succeed into the kernel buffer, and ioredis never sees an error - so there's nothing for it to time out against.

TCP keep-alive (which this PR enables) is the closest built-in signal, but Node only exposes the initial delay, not the probe interval (default ~75 s between probes), so detection takes over a minute.

So the real options are:

  1. What this PR does - watch os.networkInterfaces() (a ~0.1 ms in-memory kernel read every 2s) and drop cached clients when the interface set changes. Fixes the reported bug exactly, doesn't help if the server dies with no interface change.

  2. Application-level heartbeat PING on idle connections, force reconnect on failure. Covers all "connection lost" cases (what you're asking for) but adds ping traffic, tuning, and false-positive reconnects on transient server blips. It's a bigger change and deserves its own PR.

They're complementary - (1) is a cheap fast-path, (2) is the general safety net.

Can't user actions be a ping-like activity? We don't need to introduce another mechanism.

@valkirilov valkirilov changed the base branch from release/3.4.2 to main April 22, 2026 12:16
@valkirilov valkirilov dismissed stale reviews from ArtemHoruzhenko and dantovska April 22, 2026 12:16

The base branch was changed.

@valkirilov
Copy link
Copy Markdown
Member Author

Can't user actions be a ping-like activity? We don't need to introduce another mechanism.

Yup, I hear what you say, and in theory it makes sense. However, the problem is that in this bug, the user action is the "ping" - and that's exactly what hangs for 25 s.

Walking through the failure:

  1. User unplugs Ethernet.
  2. User clicks Refreshioredis writes the command to the cached socket.
  3. The socket still looks ESTABLISHED to the kernel, so the write succeeds into the send buffer. No error is returned.
  4. The packet is queued for retransmission on a dead route and never acked.
  5. ioredis has no way to time this out, so it waits. The HTTP requestTimeout fires at 25 s.
  6. Now ioredis knows the socket is bad, reconnects, and the next click works.

So "use the user's action as the ping" is literally the current behavior - and it's what produces the 25 s hang. It only works as a detection mechanism if the request itself returns an error quickly, but on a half-open socket it doesn't - the kernel swallows the write silently.

For user actions to be a viable signal, we'd need either:

  • A short per-command timeout (commandTimeout ~2–3 s) so a hung write surfaces quickly - but that breaks long-running commands in CLI / Workbench / Profiler, and I don't think it's OK
  • or something proactive that runs before the user clicks, so the stale client is already gone. That's what NetworkChangeMonitor does (via a free kernel signal) and what a heartbeat PING would do (via a Redis round-trip).

I was also not 100% confident with this change, so we can postpone it and discuss more, or rework it 🙂

@valkirilov
Copy link
Copy Markdown
Member Author

Closing this one in favor of a minimal change #5848.

Instead of adding this "real-time" polling, let's leave it to ioredis/node-redis to handle the lifecycle of the TCP connection.

The trade-off is that after unplugging the cable, the user has to wait up to ~10 s before the first Refresh click succeeds (vs sub-second with the monitor approach).

But I think, that is acceptable for our case because nobody realistically unplugs the cable and instantly hits Refresh. By the time they reach for the mouse the socket has already failed and the next click opens a fresh connection.

@valkirilov valkirilov closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI-Made run-all-tests Trigger the CI to run the front-end, back-end and integration tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants