Skip to content

feat(orchestrator): RAII WorktreeGuard for cancellation-safe compound review (Layer 1) Refs #1569 (Gitea)#874

Merged
AlexMikhalev merged 2 commits into
mainfrom
task/1569-worktree-guard
May 17, 2026
Merged

feat(orchestrator): RAII WorktreeGuard for cancellation-safe compound review (Layer 1) Refs #1569 (Gitea)#874
AlexMikhalev merged 2 commits into
mainfrom
task/1569-worktree-guard

Conversation

@AlexMikhalev
Copy link
Copy Markdown
Contributor

Summary

Layer 1 of the ADF worktree-lifecycle hardening epic (Gitea #1567). The primary technical fix: convert the compound-review worktree-cleanup line at compound.rs:367-370 (which leaks on future cancellation) into RAII via WorktreeGuard. Adds kill_on_drop(true) at the agent-subprocess tokio::process::Command and refactors the swarm spawn loop from tokio::spawn + mpsc to tokio::task::JoinSet so dropping the parent future actually kills the subprocesses.

Key design points (per docs/design/adf-worktree-lifecycle-design.md §4.2):

  • WorktreeGuard::for_managed(repo_path, worktree_path) constructor; Drop runs git -C <repo> worktree remove --force <path> synchronously with std::fs::remove_dir_all fallback. The existing WorktreeGuard::new filesystem-only path is preserved for the per-agent caller at lib.rs:5392-5438.
  • WorktreeManager::create_worktree now returns Result<WorktreeGuard, _>.
  • pub const WORKTREE_REVIEW_PREFIX: &str = "review-"; -- single source of truth Layer 2 will consume.
  • Drop ordering invariant: let _guard = ... declared FIRST so it drops LAST; let mut tasks: JoinSet<_> declared SECOND so it drops FIRST. Inverting recreates the bigbox storm race. Documented in-code via a multi-line comment.
  • cmd.kill_on_drop(true) added to the tokio::process::Command::new(cli_tool) builder.

Test plan

  • cargo test -p terraphim_orchestrator --features test-helpers --test compound_cancellation_test -- 2 tests pass: test_cancellation_leaves_no_worktree and test_storm_cancellation_leaves_no_worktree. Discriminating-verified by toggling kill_on_drop(true) off -- test correctly fails with agent subprocess(es) survived cancellation.
  • cargo test -p terraphim_orchestrator --features test-helpers -- 684 lib tests + all integration suites pass.
  • cargo clippy -p terraphim_orchestrator --lib --tests --features test-helpers -- -D warnings -- clean.
  • Phase 4 verification on bigbox per design §8.2 after merge.

Design deviations (documented in commit body)

  1. setup_git_repo exposed via a new test-helpers feature flag (gating tempfile as optional dep and the scope::test_support module) instead of pub(crate). Integration tests live in a separate crate so pub(crate) is insufficient. Cost: rust-analyzer needs cargo.features = ["test-helpers"] enabled to compile the test file cleanly; documented for follow-up rust-analyzer.toml.
  2. cli_tool in the cancellation test is a TempDir-scoped shell wrapper around /bin/sleep 999 (the dispatcher appends positional args which bare sleep rejects as non-numeric).
  3. PID capture walks /proc/<pid>/cwd symlinks BEFORE abort (so readlinks resolve the live path, not <path> (deleted)); the post-abort check asserts /proc/<pid>/ disappears.

Refs terraphim/terraphim-ai#1569 (Gitea)
Refs terraphim/terraphim-ai#1567 (Gitea epic)

Alex added 2 commits May 17, 2026 14:55
…d review Refs #1569

Layer 1 of the worktree-lifecycle epic (Gitea #1567). Owns three
property regressions surfaced by the bigbox storm:

1. WorktreeGuard now optionally invokes `git worktree remove --force`
   on Drop (new `for_managed` constructor). The synchronous std
   Command is intentional -- Drop cannot be async, and the git CLI
   call is sub-second. Falls back to filesystem removal on non-zero
   exit or when the git binary is not invokable.

2. `WorktreeManager::create_worktree` now returns the guard instead
   of a bare PathBuf. The per-agent caller path
   (`worktree_guard::WorktreeGuard::new`) is unchanged.

3. `CompoundReviewWorkflow::run` declares `guard` BEFORE `tasks:
   JoinSet`, so locals drop in reverse: JoinSet drops first (aborts
   every wrapping task), Child::Drop with `kill_on_drop(true)` kills
   each subprocess synchronously, THEN the guard drops, reconciling
   the `<repo>/.git/worktrees/<name>` admin entry. The explicit
   `worktree_manager.remove_worktree(...)` cleanup at end-of-run is
   removed; the guard owns it.

4. `kill_on_drop(true)` is now set on `Command::new(cli_tool)` in
   `run_single_agent`. Without this, JoinSet abort orphans the
   subprocess against `Child::Drop`.

A `WORKTREE_REVIEW_PREFIX = "review-"` constant is added at module
scope of `scope.rs` as the single source of truth referenced by
Layer 2 (`sweep_stale`) and Layer 3 (`adf-cleanup.sh`).

Tests:
- Three new unit tests on the managed guard
  (`test_managed_guard_invokes_git_remove`,
  `test_managed_guard_fallback_on_git_failure`,
  `test_managed_guard_keep_disarms`).
- Existing scope::tests updated to .keep() guards where they need
  to inspect or remove via the manager.
- New integration test `tests/compound_cancellation_test.rs` with
  two property tests:
  * `test_cancellation_leaves_no_worktree` captures sleep-agent
    PIDs BEFORE abort (so cwd readlinks resolve the live path, not
    `<path> (deleted)`), then asserts every PID is gone within 2 s
    along with the worktree dir and .git/worktrees admin entry.
    Empirically verified that without `kill_on_drop` this test
    fails with "agent subprocess(es) survived cancellation".
  * `test_storm_cancellation_leaves_no_worktree` spawns two
    parallel `workflow.run` calls and aborts both. (Deviation from
    the brief, which specified replaying the Layer 0 cursor bug
    by calling `check_cron_schedules` twice. The Layer 0 fix is
    landed in main; the storm property checked here is the same
    in spirit: storm-shaped cancellation must drain to zero.)

A new feature flag `test-helpers` exposes
`scope::test_support::setup_git_repo` to the integration test;
this is gated so the helper does not appear in the production
build. tempfile becomes an optional dep behind the same gate.
The integration test itself is `#![cfg(all(unix, feature =
"test-helpers"))]` so workspace-wide `cargo check --all-targets`
(without features) still compiles.

Deviations from the design:
- Tests use `&Path` for the cli_tool, written into a TempDir-scoped
  shell wrapper that ignores its args and execs `/bin/sleep 999`.
  The brief specified `/bin/sleep 30`, but sleep itself rejects
  non-numeric trailing args from the dispatcher, so a wrapper is
  required to keep the subprocess alive long enough to validate
  cancellation.
- PID capture uses pre-abort `/proc/<pid>/cwd` snapshots rather
  than oneshot-channel-based capture (which would have required
  refactoring `run_single_agent` from `Command::output()` to
  `spawn() + wait_with_output()` solely for testability). Approach
  (b) from the brief's risks list.
…efs #1569

Add explicit `[[test]] required-features = ["test-helpers"]` so that
`cargo test --all-features` picks up `compound_cancellation_test`
even though the source file is also `cfg`-gated for defensive
double-protection.

This is a follow-on to 6ba6331 and does not change behaviour:
- Default `cargo test -p terraphim_orchestrator` still excludes the
  integration test (same as before).
- `cargo test -p terraphim_orchestrator --features test-helpers`
  still runs it (same as before).
- New: `cargo test -p terraphim_orchestrator --all-features` now
  also runs it.

Note: the repo's CI command in `.github/workflows/ci-pr.yml` is
`cargo nextest run --workspace --profile ci --lib --bins
--features zlob`. Integration tests under `tests/` are not executed
by that gate regardless of feature flags. The cancellation test is
therefore a developer-run regression gate today; promoting it into
CI is a separate ticket.
@AlexMikhalev AlexMikhalev merged commit 95e2b15 into main May 17, 2026
2 of 5 checks passed
@AlexMikhalev AlexMikhalev deleted the task/1569-worktree-guard branch May 17, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant