Skip to content

agent: Opus 4.7 impl sessions hit exit 143 on Node tooling #86

@alex-fedotyev

Description

@alex-fedotyev

Summary

Opus 4.7 impl sessions running HyperDX Node tooling (jest,
tsc, next build) intermittently die with
Agent error: Command failed with exit code 143 (SIGTERM).
Four sessions hit this on 2026-05-28 between roughly 21:30 and
21:55 UTC. The shape is: long-running Node child gets SIGTERMed,
the agent surfaces exit 143, and the session stalls mid-task.

Affected sessions on 2026-05-28

  • impl-af6f01dd (21:37:45Z last assistant turn was the literal
    exit-143 error)
  • impl-2d3f4375 (21:39:12Z, same string; recovered manually by
    switching to a sibling worktree with surviving node_modules)
  • impl-247fdb3b (likely, by symptom)
  • impl-5f51ff89 (suspected)

Diagnostics

From inside the container right now:

/sys/fs/cgroup/memory.max     max
/sys/fs/cgroup/memory.current ~2.04 GB
/sys/fs/cgroup/memory.peak    8.31 GB
/sys/fs/cgroup/memory.events  oom_kill 0

So the in-container cgroup is not the killer. memory.peak of
8.3 GB shows the agent has used well past 2 GB during this
session, with zero oom_kill events recorded inside.

The SIGTERM source is therefore one of:

  1. An outer cgroup or Docker container limit applied by the
    host, killing the Node child but not the agent itself.
  2. The Node process self-terminating after hitting its V8
    --max-old-space-size ceiling (Node's default on 64-bit is
    roughly 4 GB but varies; next build and big jest runs can
    blow past it).
  3. An external watchdog sending SIGTERM (less likely).

Impact

When this fires, the impl session can't make further progress on
its task without manual help. Affects long-running HyperDX work
in particular because make ci-unit, make dev-int, and
next build all spin up large Node processes.

Proposed routes

Pick one (or both). The first is a one-line workaround, the
second is the real fix.

Route A: cap Node heap per repo. Export
NODE_OPTIONS=--max-old-space-size=1500 for HyperDX runs in the
validate-after-change skill, so jest/tsc voluntarily stay
under whatever the outer ceiling is. Cheap, repo-local, no infra
change. Downside: 1500 MB is tight for next build; some test
shards may still trip.

Route B: bump the outer container memory ceiling. Identify
where the ~2 GB outer ceiling comes from (Docker daemon or host
cgroup) and raise it to 6 to 8 GB. Real fix. Needs whoever
owns the agent host to adjust the container spec.

Suggested next step

Confirm where the outer limit lives (host docker inspect <agent-container> should show HostConfig.Memory). If the limit
is in our control, Route B. If it's host policy, Route A as a
stopgap while we negotiate the host policy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions