Skip to content

[multistage] Reset GRPC mailbox channel backoff on server instance config changes#17892

Merged
Jackie-Jiang merged 2 commits intoapache:masterfrom
dang-stripe:dang-server-grpc-connection-reset
Mar 24, 2026
Merged

[multistage] Reset GRPC mailbox channel backoff on server instance config changes#17892
Jackie-Jiang merged 2 commits intoapache:masterfrom
dang-stripe:dang-server-grpc-connection-reset

Conversation

@dang-stripe
Copy link
Copy Markdown
Contributor

@dang-stripe dang-stripe commented Mar 17, 2026

Problem

This addresses #17870 which is similar in shape to #17465 except it handles resetting GRPC connection backoff for server <> server connections when a server is killed.

Solution

This adds a new ServerGrpcChannelBackoffResetHandler on servers that listens to instance config changes. On the broker, we have the failure detector which excludes servers from routing and adds them back in when connections go healthy.

Servers don't detect failures like the brokers do today, so I've made the handler rely on checking for instances to transition between SHUTDOWN_IN_PROGRESS=true -> SHUTDOWN_IN_PROGRESS=false. Even when a server crashes, it still sets SHUTDOWN_IN_PROGRESS=true on startup so we can rely on this to identify servers that recently came up healthy.

Testing

Without the fix

  • Ran a consistent query load that does a distributed join
  • Forcefully killed a server qa-pinotdbstreaming--1234
  • Waited a few minutes
  • Brought the server back up
  • Confirmed we see query failures when it comes back up for ~2 minutes (matches grpc backoff window)
Screenshot 2026-03-16 at 5 06 35 PM
WARN [MailboxStatusObserver] [grpc-default-executor-34:442] Sending mailbox received an error from receiving side
[2026-03-17 00:04:32.416349] io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
[2026-03-17 00:04:32.416361] 	at io.grpc.Status.asRuntimeException(Status.java:532)
[2026-03-17 00:04:32.416374] 	at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:581)
[2026-03-17 00:04:32.416385] 	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:566)
[2026-03-17 00:04:32.416395] 	at io.grpc.internal.ClientCallImpl.access$100(ClientCallImpl.java:72)
[2026-03-17 00:04:32.416409] 	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:734)
[2026-03-17 00:04:32.416423] 	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:715)
[2026-03-17 00:04:32.416434] 	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
[2026-03-17 00:04:32.416445] 	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
[2026-03-17 00:04:32.416457] 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
[2026-03-17 00:04:32.416469] 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
[2026-03-17 00:04:32.416479] 	at java.base/java.lang.Thread.run(Thread.java:1583)
[2026-03-17 00:04:32.416497] Caused by: io.grpc.netty.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: qa-pinotdbstreaming--1234/10.20.30.40:1234
[2026-03-17 00:04:32.416514] Caused by: java.net.ConnectException: Connection refused

With the fix

  • Deployed to new servers and confirmed we see Registering ServerGrpcChannelBackoffResetHandler in logs
  • Ran a consistent query load that does a distributed join
  • Forcefully killed a server qa-pinotdbstreaming--1234
  • Waited a few minutes
  • Brought the server back up
  • Confirmed we didn't see any query errors when server comes online
  • Confirmed we see logs for servers resetting backoff on that server channel
INFO [ServerGrpcChannelBackoffResetHandler] [ZkClient-EventThread-120:120] Server Server_st-canary__p1-2_8098 completed startup, resetting mailbox channel backoff to qa-pinotdbstreaming--1234

cc @Jackie-Jiang @yashmayya @suvodeep-pyne @jadami10

@dang-stripe dang-stripe force-pushed the dang-server-grpc-connection-reset branch 2 times, most recently from 26c2d52 to c6cec50 Compare March 17, 2026 18:19
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 65.00000% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.21%. Comparing base (be3ad3b) to head (a06e7b3).
⚠️ Report is 6 commits behind head on master.

Files with missing lines Patch % Lines
...er/helix/ServerGrpcChannelBackoffResetHandler.java 69.49% 12 Missing and 6 partials ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 10 Missing ⚠️
.../apache/pinot/server/worker/WorkerQueryServer.java 0.00% 4 Missing ⚠️
...org/apache/pinot/query/mailbox/MailboxService.java 0.00% 1 Missing ⚠️
...va/org/apache/pinot/query/runtime/QueryRunner.java 0.00% 1 Missing ⚠️
...rg/apache/pinot/server/starter/ServerInstance.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             master   #17892    +/-   ##
==========================================
  Coverage     63.21%   63.21%            
  Complexity     1525     1525            
==========================================
  Files          3194     3194            
  Lines        193239   193645   +406     
  Branches      29706    29787    +81     
==========================================
+ Hits         122161   122421   +260     
- Misses        61494    61615   +121     
- Partials       9584     9609    +25     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.20% <65.00%> (+0.01%) ⬆️
java-21 63.18% <65.00%> (-0.02%) ⬇️
temurin 63.21% <65.00%> (+<0.01%) ⬆️
unittests 63.21% <65.00%> (+<0.01%) ⬆️
unittests1 55.52% <92.30%> (-0.01%) ⬇️
unittests2 34.18% <57.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 added multi-stage Related to the multi-stage query engine GRPC Related to gRPC transport labels Mar 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a server-side Helix instance-config listener to proactively reset gRPC mailbox channel connection backoff for server↔server multistage mailbox connections when a server transitions from IS_SHUTDOWN_IN_PROGRESS=truefalse (i.e., completes startup), addressing the post-restart query-failure window caused by exponential backoff.

Changes:

  • Introduces ServerGrpcChannelBackoffResetHandler and registers it during server startup when the multistage worker is enabled.
  • Adds a targeted ChannelManager.resetConnectBackoff() API (surfaced via MailboxService) that only resets backoff when the channel is in TRANSIENT_FAILURE.
  • Adds unit tests covering the handler behavior and channel backoff reset behavior, plus hostname extraction utility tests.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pinot-server/src/main/java/org/apache/pinot/server/starter/helix/ServerGrpcChannelBackoffResetHandler.java New Helix InstanceConfigChangeListener to detect server startup completion and trigger mailbox channel backoff reset
pinot-server/src/main/java/org/apache/pinot/server/starter/helix/BaseServerStarter.java Registers the new handler on servers when the worker query server (multistage) is enabled
pinot-server/src/main/java/org/apache/pinot/server/worker/WorkerQueryServer.java Persists QueryRunner as a field and exposes it for handler wiring
pinot-server/src/main/java/org/apache/pinot/server/starter/ServerInstance.java Exposes WorkerQueryServer via a nullable getter
pinot-query-runtime/src/main/java/org/apache/pinot/query/mailbox/channel/ChannelManager.java Adds resetConnectBackoff(host, port) gated on TRANSIENT_FAILURE
pinot-query-runtime/src/main/java/org/apache/pinot/query/mailbox/MailboxService.java Exposes backoff reset through MailboxService
pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/QueryRunner.java Exposes MailboxService for server startup integration
pinot-core/src/main/java/org/apache/pinot/core/transport/ServerInstance.java Refactors hostname parsing into extractHostnameFromConfig() utility
pinot-server/src/test/java/org/apache/pinot/server/starter/helix/ServerGrpcChannelBackoffResetHandlerTest.java Unit tests for the new backoff reset handler logic
pinot-query-runtime/src/test/java/org/apache/pinot/query/mailbox/channel/ChannelManagerTest.java Unit tests for channel backoff reset behavior
pinot-core/src/test/java/org/apache/pinot/core/transport/ServerInstanceTest.java Unit tests for hostname extraction helper

Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done. Just minor comments

}
return hostname;
}
String instanceName = instanceConfig.getInstanceName();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the same comment on line 72

instanceName = instanceName.substring(Helix.SERVER_INSTANCE_PREFIX_LENGTH);
}
String[] parts = StringUtils.split(instanceName, HOSTNAME_PORT_DELIMITER);
return parts.length > 0 ? parts[0] : null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd throw exception to keep the same behavior. The new code can catch the exception then log warning

_hostname = hostname;
}
_hostname = extractHostnameFromConfig(instanceConfig);
String rawHostname = instanceConfig.getHostName();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also refactor this part:

  • When instanceConfig.getPort() is not null, use it
  • Other wise, get the last part of instanceName and parse it

}

// Register handler to reset GRPC mailbox channel backoff when servers come online
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Remove the redundant braces

Comment on lines +87 to +92
// Only process INIT (listener registration) and CALLBACK (ZK data/child change).
// Ignore FINALIZE (listener unregistration) and other types.
NotificationContext.Type type = context.getType();
if (type != NotificationContext.Type.INIT && type != NotificationContext.Type.CALLBACK) {
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) This check is redundant

@Nullable
private HelixAdmin _helixAdmin;

public ServerGrpcChannelBackoffResetHandler(HelixManager helixManager, String selfInstanceId,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) You can directly pass in HelixAdmin

@dang-stripe dang-stripe force-pushed the dang-server-grpc-connection-reset branch from c6cec50 to a06e7b3 Compare March 24, 2026 17:12
@Jackie-Jiang Jackie-Jiang added the enhancement Improvement to existing functionality label Mar 24, 2026
@Jackie-Jiang Jackie-Jiang merged commit 43c8ecc into apache:master Mar 24, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Improvement to existing functionality GRPC Related to gRPC transport multi-stage Related to the multi-stage query engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants