Skip to content

fix(mcp): Harden lifecycle teardown and observability#274

Open
cameroncooke wants to merge 4 commits intomainfrom
cameroncooke/fix/mcp-lifecycle-process-leak
Open

fix(mcp): Harden lifecycle teardown and observability#274
cameroncooke wants to merge 4 commits intomainfrom
cameroncooke/fix/mcp-lifecycle-process-leak

Conversation

@cameroncooke
Copy link
Collaborator

@cameroncooke cameroncooke commented Mar 14, 2026

Harden MCP teardown and add lifecycle observability for process-leak diagnosis.

Issue #273 reports orphaned MCP processes and very high memory growth after client shutdown in some client paths. This PR makes teardown deterministic by wiring shutdown handling earlier in startup, forcing bounded shutdown on stdio disconnect, and explicitly stopping lifecycle watchers during teardown instead of relying on process exit timing.

I also added lifecycle diagnostics so recurrence can be identified remotely: process age, RSS, and peer MCP-process counts are now recorded in MCP lifecycle telemetry and exposed in session diagnostics. This gives us concrete signals in Sentry when shutdown stalls or abnormal process growth appears.

I considered leaving shutdown fully graceful-only, but that keeps the failure mode where teardown can stall indefinitely under client disconnect races. The bounded shutdown path is intentionally narrow and only used for disconnect/crash-style termination paths.

This branch also upgrades @modelcontextprotocol/sdk to ^1.27.1 so validation runs on the latest SDK behavior.

Fixes #273

Attach MCP shutdown handlers before async startup, add broken-pipe and
crash handling, and explicitly stop the Xcode watcher during teardown.
This closes lifecycle gaps that could leave MCP processes running after
clients disappear and adds concrete shutdown reasons for investigation.

Add lifecycle Sentry metrics, richer session diagnostics, and a repro
script so process age, peer counts, memory, and active runtime state are
visible when this issue recurs in the wild.

Fixes #273
Co-Authored-By: Codex <noreply@openai.com>
@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 14, 2026

Open in StackBlitz

npm i https://pkg.pr.new/xcodebuildmcp@274

commit: 28bd7c4

When the MCP parent process exits abruptly with in-flight requests, shutdown can
block indefinitely waiting for server.close(). This keeps npm exec wrapper
processes alive and allows the child node process to climb in RSS.

Add a transport-disconnect fast-exit path that detaches lifecycle handlers,
races server.close() against a short timeout, and then exits immediately.
This preserves normal shutdown behavior for other reasons and keeps Sentry
enabled.

Also add focused unit tests for the bounded close helper outcomes.

Fixes #273
Comment on lines +101 to +104
if (isTransportDisconnectReason(reason)) {
lifecycle.detachProcessHandlers();
await __closeServerForFastExitForTests(closeableServer);
process.exit(exitCode);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: For transport disconnects, the server's new fast-exit path calls process.exit() before flushAndCloseSentry(), risking the loss of diagnostic logs that have not yet been sent.
Severity: MEDIUM

Suggested Fix

To ensure diagnostic data is preserved, consider replacing the immediate process.exit() with a mechanism that waits for the Sentry transport to flush. One option is to await flushAndCloseSentry() before exiting. If the current fast-exit is a deliberate trade-off to prevent hanging processes, this change should be carefully evaluated against that risk.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/server/start-mcp-server.ts#L101-L104

Potential issue: In the `onShutdown` handler, a new "fast-exit" path was added for
transport disconnect events like `stdin-end` and `stdin-close`. This path calls
`process.exit()` at line 104. Consequently, the Sentry flushing logic at line 175
(`flushAndCloseSentry()`) is never reached for these common shutdown scenarios. While
lifecycle data is captured in a log message with a `{ sentry: true }` flag on line 97,
the process may terminate before this log is successfully transmitted to Sentry, leading
to a potential loss of critical diagnostic data intended to be captured by this pull
request.

Did we get this right? 👍 / 👎 to inform future reviews.

cameroncooke and others added 2 commits March 15, 2026 20:56
Update @modelcontextprotocol/sdk from ^1.25.1 to ^1.27.1 so local
and harness testing run on the latest SDK behavior.

Add /repros to .gitignore to keep lifecycle repro artifacts out of
source control during repeated leak investigations.

Refs #273
Co-Authored-By: Codex <noreply@openai.com>
@cameroncooke cameroncooke changed the title fix(mcp): Harden server lifecycle shutdown fix(mcp): Harden lifecycle teardown and observability Mar 15, 2026
Comment on lines +340 to +342
const handleUncaughtException = (error: unknown): void => {
void coordinator.shutdown('uncaught-exception', error);
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Process event handlers like handleUncaughtException call coordinator.shutdown() without await, causing the process to exit before async cleanup and telemetry flushing completes.
Severity: HIGH

Suggested Fix

Modify the process event handlers to ensure the process does not exit until the asynchronous coordinator.shutdown() promise resolves. This may involve using a synchronous IPC channel to signal completion to a parent process or using a library that facilitates synchronous waiting on async operations in exit handlers, if possible. The goal is to block the handler's return until cleanup is finished.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: src/server/mcp-lifecycle.ts#L340-L342

Potential issue: Process event handlers for `uncaughtException`, `unhandledRejection`,
`SIGTERM`, and `SIGINT` are synchronous and call the `async` function
`coordinator.shutdown()` in a fire-and-forget manner. According to Node.js behavior, the
process will exit immediately after these synchronous handlers complete, without waiting
for any asynchronous operations they initiate. This creates a race condition where the
process terminates before the critical asynchronous shutdown logic, including
`flushAndCloseSentry()`, can finish. As a result, crash telemetry, logs, and other
cleanup operations will be lost, defeating the purpose of adding remote diagnostics.

Copy link
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

lifecycle.detachProcessHandlers();
await __closeServerForFastExitForTests(closeableServer);
process.exit(exitCode);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast-exit path skips Sentry flush losing telemetry

Medium Severity

The transport-disconnect fast-exit path calls process.exit() without calling flushAndCloseSentry() or recordMcpLifecycleMetric(). Since process.exit() doesn't trigger async flush handlers, the lifecycle telemetry added by this PR to diagnose orphaned processes won't reach Sentry for the most common shutdown path (client closing stdin).

Additional Locations (2)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP server processes orphaned when parent (Claude CLI) dies — 4GB memory leak

1 participant