[multistage] Reset GRPC mailbox channel backoff on server instance config changes#17892
Conversation
26c2d52 to
c6cec50
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17892 +/- ##
==========================================
Coverage 63.21% 63.21%
Complexity 1525 1525
==========================================
Files 3194 3194
Lines 193239 193645 +406
Branches 29706 29787 +81
==========================================
+ Hits 122161 122421 +260
- Misses 61494 61615 +121
- Partials 9584 9609 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds a server-side Helix instance-config listener to proactively reset gRPC mailbox channel connection backoff for server↔server multistage mailbox connections when a server transitions from IS_SHUTDOWN_IN_PROGRESS=true → false (i.e., completes startup), addressing the post-restart query-failure window caused by exponential backoff.
Changes:
- Introduces
ServerGrpcChannelBackoffResetHandlerand registers it during server startup when the multistage worker is enabled. - Adds a targeted
ChannelManager.resetConnectBackoff()API (surfaced viaMailboxService) that only resets backoff when the channel is inTRANSIENT_FAILURE. - Adds unit tests covering the handler behavior and channel backoff reset behavior, plus hostname extraction utility tests.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pinot-server/src/main/java/org/apache/pinot/server/starter/helix/ServerGrpcChannelBackoffResetHandler.java | New Helix InstanceConfigChangeListener to detect server startup completion and trigger mailbox channel backoff reset |
| pinot-server/src/main/java/org/apache/pinot/server/starter/helix/BaseServerStarter.java | Registers the new handler on servers when the worker query server (multistage) is enabled |
| pinot-server/src/main/java/org/apache/pinot/server/worker/WorkerQueryServer.java | Persists QueryRunner as a field and exposes it for handler wiring |
| pinot-server/src/main/java/org/apache/pinot/server/starter/ServerInstance.java | Exposes WorkerQueryServer via a nullable getter |
| pinot-query-runtime/src/main/java/org/apache/pinot/query/mailbox/channel/ChannelManager.java | Adds resetConnectBackoff(host, port) gated on TRANSIENT_FAILURE |
| pinot-query-runtime/src/main/java/org/apache/pinot/query/mailbox/MailboxService.java | Exposes backoff reset through MailboxService |
| pinot-query-runtime/src/main/java/org/apache/pinot/query/runtime/QueryRunner.java | Exposes MailboxService for server startup integration |
| pinot-core/src/main/java/org/apache/pinot/core/transport/ServerInstance.java | Refactors hostname parsing into extractHostnameFromConfig() utility |
| pinot-server/src/test/java/org/apache/pinot/server/starter/helix/ServerGrpcChannelBackoffResetHandlerTest.java | Unit tests for the new backoff reset handler logic |
| pinot-query-runtime/src/test/java/org/apache/pinot/query/mailbox/channel/ChannelManagerTest.java | Unit tests for channel backoff reset behavior |
| pinot-core/src/test/java/org/apache/pinot/core/transport/ServerInstanceTest.java | Unit tests for hostname extraction helper |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Well done. Just minor comments
| } | ||
| return hostname; | ||
| } | ||
| String instanceName = instanceConfig.getInstanceName(); |
There was a problem hiding this comment.
Please keep the same comment on line 72
| instanceName = instanceName.substring(Helix.SERVER_INSTANCE_PREFIX_LENGTH); | ||
| } | ||
| String[] parts = StringUtils.split(instanceName, HOSTNAME_PORT_DELIMITER); | ||
| return parts.length > 0 ? parts[0] : null; |
There was a problem hiding this comment.
I'd throw exception to keep the same behavior. The new code can catch the exception then log warning
| _hostname = hostname; | ||
| } | ||
| _hostname = extractHostnameFromConfig(instanceConfig); | ||
| String rawHostname = instanceConfig.getHostName(); |
There was a problem hiding this comment.
Let's also refactor this part:
- When
instanceConfig.getPort()is notnull, use it - Other wise, get the last part of
instanceNameand parse it
| } | ||
|
|
||
| // Register handler to reset GRPC mailbox channel backoff when servers come online | ||
| { |
There was a problem hiding this comment.
(nit) Remove the redundant braces
| // Only process INIT (listener registration) and CALLBACK (ZK data/child change). | ||
| // Ignore FINALIZE (listener unregistration) and other types. | ||
| NotificationContext.Type type = context.getType(); | ||
| if (type != NotificationContext.Type.INIT && type != NotificationContext.Type.CALLBACK) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
(minor) This check is redundant
| @Nullable | ||
| private HelixAdmin _helixAdmin; | ||
|
|
||
| public ServerGrpcChannelBackoffResetHandler(HelixManager helixManager, String selfInstanceId, |
There was a problem hiding this comment.
(minor) You can directly pass in HelixAdmin
c6cec50 to
a06e7b3
Compare
Problem
This addresses #17870 which is similar in shape to #17465 except it handles resetting GRPC connection backoff for server <> server connections when a server is killed.
Solution
This adds a new ServerGrpcChannelBackoffResetHandler on servers that listens to instance config changes. On the broker, we have the failure detector which excludes servers from routing and adds them back in when connections go healthy.
Servers don't detect failures like the brokers do today, so I've made the handler rely on checking for instances to transition between
SHUTDOWN_IN_PROGRESS=true->SHUTDOWN_IN_PROGRESS=false. Even when a server crashes, it still setsSHUTDOWN_IN_PROGRESS=trueon startup so we can rely on this to identify servers that recently came up healthy.Testing
Without the fix
qa-pinotdbstreaming--1234With the fix
Registering ServerGrpcChannelBackoffResetHandlerin logsqa-pinotdbstreaming--1234cc @Jackie-Jiang @yashmayya @suvodeep-pyne @jadami10