Reconnect Redis clients on network interface change by valkirilov · Pull Request #5816 · redis/RedisInsight

valkirilov · 2026-04-22T07:55:58Z

What

Fixes a bug where the UI spun on a loading indicator after the host switched network interfaces (e.g. Ethernet → Wi-Fi). Clicking Refresh immediately after unplugging the cable would hang for the full HTTP request timeout before any error surfaced; subsequent clicks worked.

Before

Screen.Recording.2026-04-22.at.10.44.11.mov

After

Screen.Recording.2026-04-22.at.10.45.09.mov

Root cause

Cached Redis clients kept TCP sockets that became half-open the moment the OS tore down the old interface. Writes on those sockets are silently queued by the kernel — ioredis/node-redis never see an error, so queued commands sit until something else times out.

Fix

Three layers, backend-only:

TCP keep-alive — redis_clients.keepAlive: 5000 is now passed to both ioredis and node-redis connection strategies. Gives the OS a chance to surface a dead socket on its own.
NetworkChangeMonitor (redisinsight/api/src/modules/redis/network-change.monitor.ts) — polls os.networkInterfaces() every 2 s, computes a stable signature (excluding loopback + IPv6 link-local), and when the interface set changes, calls RedisClientStorage.removeAll(). The next request opens a fresh socket on the currently-active interface. Cross-platform, no native deps, tunable via RI_CLIENTS_NETWORK_WATCHER* env vars.
424 error mapping — catchAclError now converts runtime driver errors (EADDRNOTAVAIL, ECONNRESET, Socket closed unexpectedly, reconnect backoff exhausted, …) into RedisConnection* exceptions (HTTP 424), which the existing UI connectivityErrorsInterceptor already handles to show the "Connection Lost" banner instead of a generic 500.

Existing HTTP requestTimeout (25 s) remains the single per-request cap.

Testing

Connect to a Redis DB via wired connection, browse keys → data loads.
Unplug the cable, switch to Wi-Fi (with the Redis host reachable on the new network).
Immediately click Refresh → should return data without a hang (previously spun for ~25 s).
Same with keep-alive disabled (RI_CLIENTS_KEEP_ALIVE=0) → still works because the monitor handles it.
Disable the monitor (RI_CLIENTS_NETWORK_WATCHER=false) → reverts to old behavior.

Made with Cursor

Note

Medium Risk
Touches Redis connection setup and runtime error handling, which can affect connectivity and user-facing error behavior if misclassified; also introduces a background polling monitor that clears all cached clients on interface churn.

Overview
Prevents hangs after host network switches by enabling TCP keepalive for both ioredis and node-redis clients (new redis_clients.keepAlive, default 5000, configurable via RI_CLIENTS_KEEP_ALIVE).

Adds a NetworkChangeMonitor (configurable via RI_CLIENTS_NETWORK_WATCHER*) that polls os.networkInterfaces() and calls RedisClientStorage.removeAll() when the routable interface set changes, forcing fresh sockets on next use.

Updates error handling so runtime driver/socket disconnects are recognized and re-thrown as RedisConnection* (HTTP 424) via wrapRedisRuntimeConnectionError, improving UI handling; includes new/expanded unit tests for keepalive passthrough, removeAll, interface-change monitoring, and runtime error mapping.

^{Reviewed by Cursor Bugbot for commit 555da66. Bugbot is set up for automated code reviews on this repo. Configure here.}

When the host's active network interface changes (e.g. Ethernet -> Wi-Fi), existing TCP sockets end up half-open: the kernel accepts writes into oblivion and commands on cached clients hang indefinitely. Fix has three parts: - Enable TCP SO_KEEPALIVE (5s idle) on both ioredis and node-redis clients so dead sockets eventually surface an error on their own. - Add NetworkChangeMonitor that polls os.networkInterfaces() every 2s and, when the routable interface set changes, drops every cached Redis client so the next command opens a fresh socket on the currently-active interface instead of reusing a stale one. - Map runtime Redis client errors (EADDRNOTAVAIL, ECONNRESET, Socket closed unexpectedly, etc.) to RedisConnection* exceptions (HTTP 424) in catchAclError so the UI's connectivityErrorsInterceptor shows the "Connection Lost" banner instead of a generic 500. Before: unplugging Ethernet and immediately clicking Refresh spun for the full HTTP request timeout before any error surfaced. After: the monitor drops the cached client within ~2s of the interface change; the next request reconnects cleanly over Wi-Fi. Made-with: Cursor

jit-ci · 2026-04-22T07:57:39Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

github-actions · 2026-04-22T08:03:23Z

Code Coverage - Backend unit tests

St.^❔	Category	Percentage	Covered / Total
🟢	Statements	92.56%	15430/16671
🟡	Branches	74.54%	4870/6533
🟢	Functions	86.5%	2396/2770
🟢	Lines	92.38%	14747/15963

Test suite run success

3380 tests passing in 306 suites.

Report generated by 🧪jest coverage report action from 555da66

github-actions · 2026-04-22T08:04:42Z

Code Coverage - Integration Tests

Status	Category	Percentage	Covered / Total
🟡	Statements	79.74%	17428/21854
🟡	Branches	62.18%	7995/12857
🟡	Functions	68.53%	2424/3537
🟡	Lines	79.3%	16387/20662

github-actions · 2026-04-22T08:55:07Z

Code Coverage - Frontend unit tests

St.^❔	Category	Percentage	Covered / Total
🟢	Statements	82.54%	23942/29007
🟡	Branches	67.7%	10078/14886
🟡	Functions	77.62%	6451/8311
🟢	Lines	83%	23419/28215

Test suite run success

6651 tests passing in 781 suites.

Report generated by 🧪jest coverage report action from 555da66

ArtemHoruzhenko · 2026-04-22T09:55:15Z

Not sure this is a proper fix. I'm not sure we must monitor for network interface change. It looks like workaround. I feel like proper solution is connection lost + retry. Meaning that we must better (earlier) understand that connection was lost and then retry. I think all this could be configured in the ioredis conf. Not sure. But if it is possible to go this way we might cover all the cases with connection lost, not only when interface changed.

valkirilov · 2026-04-22T10:07:47Z

Not sure this is a proper fix. I'm not sure we must monitor for network interface change. It looks like workaround. I feel like proper solution is connection lost + retry. Meaning that we must better (earlier) understand that connection was lost and then retry. I think all this could be configured in the ioredis conf. Not sure. But if it is possible to go this way we might cover all the cases with connection lost, not only when interface changed.

Good push-back @ArtemHoruzhenko :)

The short answer - there's no ioredis setting that detects this. When the OS drops the old interface, the TCP socket stays in ESTABLISHED state, writes succeed into the kernel buffer, and ioredis never sees an error - so there's nothing for it to time out against.

TCP keep-alive (which this PR enables) is the closest built-in signal, but Node only exposes the initial delay, not the probe interval (default ~75 s between probes), so detection takes over a minute.

So the real options are:

What this PR does - watch os.networkInterfaces() (a ~0.1 ms in-memory kernel read every 2s) and drop cached clients when the interface set changes. Fixes the reported bug exactly, doesn't help if the server dies with no interface change.
Application-level heartbeat PING on idle connections, force reconnect on failure. Covers all "connection lost" cases (what you're asking for) but adds ping traffic, tuning, and false-positive reconnects on transient server blips. It's a bigger change and deserves its own PR.

They're complementary - (1) is a cheap fast-path, (2) is the general safety net.

ArtemHoruzhenko · 2026-04-22T11:55:46Z

Not sure this is a proper fix. I'm not sure we must monitor for network interface change. It looks like workaround. I feel like proper solution is connection lost + retry. Meaning that we must better (earlier) understand that connection was lost and then retry. I think all this could be configured in the ioredis conf. Not sure. But if it is possible to go this way we might cover all the cases with connection lost, not only when interface changed.

Good push-back @ArtemHoruzhenko :)

The short answer - there's no ioredis setting that detects this. When the OS drops the old interface, the TCP socket stays in ESTABLISHED state, writes succeed into the kernel buffer, and ioredis never sees an error - so there's nothing for it to time out against.

TCP keep-alive (which this PR enables) is the closest built-in signal, but Node only exposes the initial delay, not the probe interval (default ~75 s between probes), so detection takes over a minute.

So the real options are:

What this PR does - watch os.networkInterfaces() (a ~0.1 ms in-memory kernel read every 2s) and drop cached clients when the interface set changes. Fixes the reported bug exactly, doesn't help if the server dies with no interface change.

Application-level heartbeat PING on idle connections, force reconnect on failure. Covers all "connection lost" cases (what you're asking for) but adds ping traffic, tuning, and false-positive reconnects on transient server blips. It's a bigger change and deserves its own PR.

They're complementary - (1) is a cheap fast-path, (2) is the general safety net.

Can't user actions be a ping-like activity? We don't need to introduce another mechanism.

The base branch was changed.

valkirilov · 2026-04-22T13:05:03Z

Can't user actions be a ping-like activity? We don't need to introduce another mechanism.

Yup, I hear what you say, and in theory it makes sense. However, the problem is that in this bug, the user action is the "ping" - and that's exactly what hangs for 25 s.

Walking through the failure:

User unplugs Ethernet.
User clicks Refresh → ioredis writes the command to the cached socket.
The socket still looks ESTABLISHED to the kernel, so the write succeeds into the send buffer. No error is returned.
The packet is queued for retransmission on a dead route and never acked.
ioredis has no way to time this out, so it waits. The HTTP requestTimeout fires at 25 s.
Now ioredis knows the socket is bad, reconnects, and the next click works.

So "use the user's action as the ping" is literally the current behavior - and it's what produces the 25 s hang. It only works as a detection mechanism if the request itself returns an error quickly, but on a half-open socket it doesn't - the kernel swallows the write silently.

For user actions to be a viable signal, we'd need either:

A short per-command timeout (commandTimeout ~2–3 s) so a hung write surfaces quickly - but that breaks long-running commands in CLI / Workbench / Profiler, and I don't think it's OK
or something proactive that runs before the user clicks, so the stale client is already gone. That's what NetworkChangeMonitor does (via a free kernel signal) and what a heartbeat PING would do (via a Redis round-trip).

I was also not 100% confident with this change, so we can postpone it and discuss more, or rework it 🙂

valkirilov · 2026-04-30T07:00:52Z

Closing this one in favor of a minimal change #5848.

Instead of adding this "real-time" polling, let's leave it to ioredis/node-redis to handle the lifecycle of the TCP connection.

The trade-off is that after unplugging the cable, the user has to wait up to ~10 s before the first Refresh click succeeds (vs sub-second with the monitor approach).

But I think, that is acceptable for our case because nobody realistically unplugs the cable and instantly hits Refresh. By the time they reach for the mouse the socket has already failed and the next click opens a fresh connection.

valkirilov requested a review from a team as a code owner April 22, 2026 07:55

valkirilov added the AI-Made label Apr 22, 2026

valkirilov changed the base branch from main to release/3.4.2 April 22, 2026 07:56

valkirilov self-assigned this Apr 22, 2026

valkirilov added the run-all-tests Trigger the CI to run the front-end, back-end and integration tests label Apr 22, 2026

dantovska previously approved these changes Apr 22, 2026

View reviewed changes

ArtemHoruzhenko previously approved these changes Apr 22, 2026

View reviewed changes

valkirilov changed the base branch from release/3.4.2 to main April 22, 2026 12:16

valkirilov mentioned this pull request Apr 30, 2026

Enable TCP keep-alive on Redis client sockets #5848

Open

valkirilov closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconnect Redis clients on network interface change#5816

Reconnect Redis clients on network interface change#5816
valkirilov wants to merge 1 commit into
mainfrom
be/bugfix/reconnect-on-network-change

valkirilov commented Apr 22, 2026 •

edited

Loading

Uh oh!

jit-ci Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

ArtemHoruzhenko commented Apr 22, 2026

Uh oh!

valkirilov commented Apr 22, 2026 •

edited

Loading

Uh oh!

ArtemHoruzhenko commented Apr 22, 2026

Uh oh!

valkirilov commented Apr 22, 2026

Uh oh!

valkirilov commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

valkirilov commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Before

After

Root cause

Fix

Testing

Uh oh!

jit-ci Bot commented Apr 22, 2026

🛡️ Jit Security Scan Results

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage - Backend unit tests

Test suite run success

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage - Integration Tests

Uh oh!

github-actions Bot commented Apr 22, 2026

Code Coverage - Frontend unit tests

Test suite run success

Uh oh!

ArtemHoruzhenko commented Apr 22, 2026

Uh oh!

valkirilov commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArtemHoruzhenko commented Apr 22, 2026

Uh oh!

valkirilov commented Apr 22, 2026

Uh oh!

valkirilov commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

valkirilov commented Apr 22, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading

valkirilov commented Apr 22, 2026 •

edited

Loading