fix(mcp): Harden lifecycle teardown and observability#274
fix(mcp): Harden lifecycle teardown and observability#274cameroncooke wants to merge 4 commits intomainfrom
Conversation
Attach MCP shutdown handlers before async startup, add broken-pipe and crash handling, and explicitly stop the Xcode watcher during teardown. This closes lifecycle gaps that could leave MCP processes running after clients disappear and adds concrete shutdown reasons for investigation. Add lifecycle Sentry metrics, richer session diagnostics, and a repro script so process age, peer counts, memory, and active runtime state are visible when this issue recurs in the wild. Fixes #273 Co-Authored-By: Codex <noreply@openai.com>
commit: |
When the MCP parent process exits abruptly with in-flight requests, shutdown can block indefinitely waiting for server.close(). This keeps npm exec wrapper processes alive and allows the child node process to climb in RSS. Add a transport-disconnect fast-exit path that detaches lifecycle handlers, races server.close() against a short timeout, and then exits immediately. This preserves normal shutdown behavior for other reasons and keeps Sentry enabled. Also add focused unit tests for the bounded close helper outcomes. Fixes #273
| if (isTransportDisconnectReason(reason)) { | ||
| lifecycle.detachProcessHandlers(); | ||
| await __closeServerForFastExitForTests(closeableServer); | ||
| process.exit(exitCode); |
There was a problem hiding this comment.
Bug: For transport disconnects, the server's new fast-exit path calls process.exit() before flushAndCloseSentry(), risking the loss of diagnostic logs that have not yet been sent.
Severity: MEDIUM
Suggested Fix
To ensure diagnostic data is preserved, consider replacing the immediate process.exit() with a mechanism that waits for the Sentry transport to flush. One option is to await flushAndCloseSentry() before exiting. If the current fast-exit is a deliberate trade-off to prevent hanging processes, this change should be carefully evaluated against that risk.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/server/start-mcp-server.ts#L101-L104
Potential issue: In the `onShutdown` handler, a new "fast-exit" path was added for
transport disconnect events like `stdin-end` and `stdin-close`. This path calls
`process.exit()` at line 104. Consequently, the Sentry flushing logic at line 175
(`flushAndCloseSentry()`) is never reached for these common shutdown scenarios. While
lifecycle data is captured in a log message with a `{ sentry: true }` flag on line 97,
the process may terminate before this log is successfully transmitted to Sentry, leading
to a potential loss of critical diagnostic data intended to be captured by this pull
request.
Did we get this right? 👍 / 👎 to inform future reviews.
Update @modelcontextprotocol/sdk from ^1.25.1 to ^1.27.1 so local and harness testing run on the latest SDK behavior. Add /repros to .gitignore to keep lifecycle repro artifacts out of source control during repeated leak investigations. Refs #273 Co-Authored-By: Codex <noreply@openai.com>
| const handleUncaughtException = (error: unknown): void => { | ||
| void coordinator.shutdown('uncaught-exception', error); | ||
| }; |
There was a problem hiding this comment.
Bug: Process event handlers like handleUncaughtException call coordinator.shutdown() without await, causing the process to exit before async cleanup and telemetry flushing completes.
Severity: HIGH
Suggested Fix
Modify the process event handlers to ensure the process does not exit until the asynchronous coordinator.shutdown() promise resolves. This may involve using a synchronous IPC channel to signal completion to a parent process or using a library that facilitates synchronous waiting on async operations in exit handlers, if possible. The goal is to block the handler's return until cleanup is finished.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: src/server/mcp-lifecycle.ts#L340-L342
Potential issue: Process event handlers for `uncaughtException`, `unhandledRejection`,
`SIGTERM`, and `SIGINT` are synchronous and call the `async` function
`coordinator.shutdown()` in a fire-and-forget manner. According to Node.js behavior, the
process will exit immediately after these synchronous handlers complete, without waiting
for any asynchronous operations they initiate. This creates a race condition where the
process terminates before the critical asynchronous shutdown logic, including
`flushAndCloseSentry()`, can finish. As a result, crash telemetry, logs, and other
cleanup operations will be lost, defeating the purpose of adding remote diagnostics.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| lifecycle.detachProcessHandlers(); | ||
| await __closeServerForFastExitForTests(closeableServer); | ||
| process.exit(exitCode); | ||
| } |
There was a problem hiding this comment.
Fast-exit path skips Sentry flush losing telemetry
Medium Severity
The transport-disconnect fast-exit path calls process.exit() without calling flushAndCloseSentry() or recordMcpLifecycleMetric(). Since process.exit() doesn't trigger async flush handlers, the lifecycle telemetry added by this PR to diagnose orphaned processes won't reach Sentry for the most common shutdown path (client closing stdin).


Harden MCP teardown and add lifecycle observability for process-leak diagnosis.
Issue #273 reports orphaned MCP processes and very high memory growth after client shutdown in some client paths. This PR makes teardown deterministic by wiring shutdown handling earlier in startup, forcing bounded shutdown on stdio disconnect, and explicitly stopping lifecycle watchers during teardown instead of relying on process exit timing.
I also added lifecycle diagnostics so recurrence can be identified remotely: process age, RSS, and peer MCP-process counts are now recorded in MCP lifecycle telemetry and exposed in session diagnostics. This gives us concrete signals in Sentry when shutdown stalls or abnormal process growth appears.
I considered leaving shutdown fully graceful-only, but that keeps the failure mode where teardown can stall indefinitely under client disconnect races. The bounded shutdown path is intentionally narrow and only used for disconnect/crash-style termination paths.
This branch also upgrades
@modelcontextprotocol/sdkto^1.27.1so validation runs on the latest SDK behavior.Fixes #273