Skip to content

feat(orchestrator): startup worktree sweep for SIGKILL residue (Layer 2) Refs #1570 (Gitea)#875

Merged
AlexMikhalev merged 1 commit into
mainfrom
task/1570-sweep-stale
May 17, 2026
Merged

feat(orchestrator): startup worktree sweep for SIGKILL residue (Layer 2) Refs #1570 (Gitea)#875
AlexMikhalev merged 1 commit into
mainfrom
task/1570-sweep-stale

Conversation

@AlexMikhalev
Copy link
Copy Markdown
Contributor

Summary

Layer 2 (final) of the ADF worktree-lifecycle hardening epic (Gitea #1567). Closes the gap between Layer 1's WorktreeGuard::Drop (which runs on success / error / cancellation but NOT on SIGKILL / OOM / panic-across-runtime) and Layer 3's root-privileged ExecStartPre (defence-in-depth). At orchestrator startup, sweep stale worktree residue before any agent dispatch loop starts.

Key changes (per docs/design/adf-worktree-lifecycle-design.md §4.3):

  • WorktreeManager::sweep_stale(&self, extra_roots: &[PathBuf]) -> SweepReport -- sync, walks self.worktree_base/review-* and all children of each extra_root, runs git worktree remove --force with std::fs::remove_dir_all fallback, then git worktree prune --verbose.
  • SweepReport: swept_count, failed_count, root_owned_skipped, prune_succeeded, duration_ms. PermissionDenied increments root_owned_skipped without marking the sweep failed -- Layer 3 catches the residue.
  • Wired into AgentOrchestrator::new (lib.rs:661) immediately after CompoundReviewWorkflow construction. extra_roots = vec![PathBuf::from("/tmp/adf-worktrees")] (matches the per-agent worktree root at lib.rs:5393).
  • New CompoundReviewWorkflow::worktree_manager() getter (one-liner).
  • Single info! line per startup with all SweepReport fields + backlog_count = swept_count + root_owned_skipped -- the storm-detection metric for Quickwit adf-logs.

Test plan

  • cargo test -p terraphim_orchestrator --features test-helpers --test sweep_on_startup_test -- 3 variants pass: sweep_removes_review_prefixed_residue_on_startup, sweep_preserves_non_review_siblings_on_startup, sweep_removes_extra_root_children_regardless_of_prefix.
  • cargo test -p terraphim_orchestrator --features test-helpers -- 691 lib tests + integration suites pass.
  • cargo clippy -p terraphim_orchestrator --lib --tests --features test-helpers -- -D warnings -- clean.
  • Phase 4 verification on bigbox per design §8.3 after merge.

Design deviations (documented in commit body)

  1. The Orchestrator::new symbol in the brief is actually AgentOrchestrator::new (verified by grep).
  2. The sweep call is wrapped in #[cfg(not(test))] to avoid racing against the in-lib test_config() (which points repo_path at the live terraphim-ai checkout). Production wiring is exercised by tests/sweep_on_startup_test.rs (separate crate, cfg(test) does not propagate). Follow-up: migrate test_config() to a TempDir so the gate can be removed.
  3. The Linux/root-only contract test uses a raw extern "C" getuid() FFI to avoid pulling nix or libc solely for one gated test.

Refs terraphim/terraphim-ai#1570 (Gitea)
Refs terraphim/terraphim-ai#1567 (Gitea epic)

…1570 (Gitea)

Layer 2 of the worktree lifecycle defence in depth from epic
terraphim/terraphim-ai#1567. Between Layer 1 (WorktreeGuard::Drop
on the happy/cancelled path) and Layer 3 (root-privileged
ExecStartPre), this sweep reconciles residue left behind when the
previous orchestrator process died via SIGKILL, OOM, or a
panic-across-runtime where Drop never ran.

Changes
-------
* scope.rs:
  - New `pub fn WorktreeManager::sweep_stale(&self, extra_roots:
    &[PathBuf]) -> SweepReport`. Synchronous on purpose so it can
    run inline in `AgentOrchestrator::new` before any tick thread
    is spawned.
  - Walks `worktree_base` for `WORKTREE_REVIEW_PREFIX`-prefixed
    children, walks every direct child of each `extra_roots` entry
    (no prefix filter -- the per-agent root convention has no
    prefix), tries `git worktree remove --force` and falls back to
    `std::fs::remove_dir_all`. EACCES/EPERM increments
    `root_owned_skipped` rather than `failed_count` (Layer 3
    territory). Then runs `git worktree prune --verbose` and emits
    a single structured `info!` line with every report field plus
    a synthetic `backlog_count = swept_count + root_owned_skipped`
    for Quickwit alerting.
  - New `pub struct SweepReport` carrying `swept_count`,
    `failed_count`, `root_owned_skipped`, `prune_succeeded`,
    `duration_ms`.
  - Seven new unit tests covering empty/missing base, review
    prefix removal, non-review preservation, prune execution,
    extra-roots no-prefix-filter, and the Linux-only root-owned
    contract test (runtime gated by `getuid() == 0`; documents the
    contract even when it cannot run).

* compound.rs:
  - New `pub fn CompoundReviewWorkflow::worktree_manager(&self)
    -> &WorktreeManager` trivial accessor for the field needed by
    the Layer 2 wiring.

* lib.rs:
  - `AgentOrchestrator::new` now calls
    `compound_workflow.worktree_manager().sweep_stale(&[
    PathBuf::from("/tmp/adf-worktrees") ])` immediately after the
    compound workflow is constructed, before `Ok(Self { ... })`.
    The /tmp/adf-worktrees literal mirrors lib.rs:5393 (the
    per-agent worktree root from `create_agent_worktree`).
  - Logs a `warn!` when the residue backlog at startup exceeds 10,
    which the research doc flagged as the prior-crash-storm
    threshold.

* tests/sweep_on_startup_test.rs (NEW):
  - Three integration tests around an isolated TempDir git repo
    via `scope::test_support::setup_git_repo`. Verifies that
    `AgentOrchestrator::new` does sweep `review-*` residue,
    preserves non-review siblings, and that extra_roots children
    are removed regardless of prefix.

* Cargo.toml:
  - Declares the new integration test with
    `required-features = ["test-helpers"]`, mirroring the Layer 1
    `compound_cancellation_test` entry.

Deviations from the brief
-------------------------
1. Struct name: the brief says `Orchestrator::new`; the actual
   symbol is `AgentOrchestrator::new` (lib.rs:661). Insertion site
   is between `compound_workflow` construction at :687 and the
   `Ok(Self { ... })` block at :786.

2. `#[cfg(not(test))]` gate on the sweep call. The in-lib
   `test_config()` (lib.rs:7926) points `repo_path` at the live
   terraphim-ai checkout. Without the gate, the sweep's
   `git worktree prune --verbose` ran against that real repo and
   raced against `test_orchestrator_compound_review_manual`'s
   concurrent `git worktree add` on the same shared
   `.git/worktrees/` admin registry. Production wiring is fully
   exercised by `tests/sweep_on_startup_test.rs` (a separate
   crate, so `cfg(test)` is not set on the lib's own code path
   when it links the integration test). The gate is documented
   inline with a pointer to the integration test.

3. SweepReport is placed at module scope in scope.rs (as the
   brief preferred), not wrapped in `WorktreeManager`.

4. The Linux/root-only contract test is included as the brief
   requested. It uses a raw `extern "C" getuid()` FFI to avoid
   pulling in `nix` or `libc` solely for a single gated test.

Verification
------------
* `cargo test -p terraphim_orchestrator --features test-helpers
   --test sweep_on_startup_test -- --nocapture`: 3 passed.
* `cargo test -p terraphim_orchestrator --features test-helpers`:
   691 lib tests + all integration tests pass, including Layer 1
   `compound_cancellation_test`.
* `cargo clippy -p terraphim_orchestrator --lib --tests
   --features test-helpers -- -D warnings`: clean.
* `cargo fmt`: clean.

Branched from task/1569-worktree-guard (Layer 1) which provides
the `WORKTREE_REVIEW_PREFIX` constant referenced by the sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@AlexMikhalev AlexMikhalev merged commit 050d496 into main May 17, 2026
2 of 5 checks passed
@AlexMikhalev AlexMikhalev deleted the task/1570-sweep-stale branch May 17, 2026 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant