feat(orchestrator): startup worktree sweep for SIGKILL residue (Layer 2) Refs #1570 (Gitea)#875
Merged
Merged
Conversation
…1570 (Gitea)
Layer 2 of the worktree lifecycle defence in depth from epic
terraphim/terraphim-ai#1567. Between Layer 1 (WorktreeGuard::Drop
on the happy/cancelled path) and Layer 3 (root-privileged
ExecStartPre), this sweep reconciles residue left behind when the
previous orchestrator process died via SIGKILL, OOM, or a
panic-across-runtime where Drop never ran.
Changes
-------
* scope.rs:
- New `pub fn WorktreeManager::sweep_stale(&self, extra_roots:
&[PathBuf]) -> SweepReport`. Synchronous on purpose so it can
run inline in `AgentOrchestrator::new` before any tick thread
is spawned.
- Walks `worktree_base` for `WORKTREE_REVIEW_PREFIX`-prefixed
children, walks every direct child of each `extra_roots` entry
(no prefix filter -- the per-agent root convention has no
prefix), tries `git worktree remove --force` and falls back to
`std::fs::remove_dir_all`. EACCES/EPERM increments
`root_owned_skipped` rather than `failed_count` (Layer 3
territory). Then runs `git worktree prune --verbose` and emits
a single structured `info!` line with every report field plus
a synthetic `backlog_count = swept_count + root_owned_skipped`
for Quickwit alerting.
- New `pub struct SweepReport` carrying `swept_count`,
`failed_count`, `root_owned_skipped`, `prune_succeeded`,
`duration_ms`.
- Seven new unit tests covering empty/missing base, review
prefix removal, non-review preservation, prune execution,
extra-roots no-prefix-filter, and the Linux-only root-owned
contract test (runtime gated by `getuid() == 0`; documents the
contract even when it cannot run).
* compound.rs:
- New `pub fn CompoundReviewWorkflow::worktree_manager(&self)
-> &WorktreeManager` trivial accessor for the field needed by
the Layer 2 wiring.
* lib.rs:
- `AgentOrchestrator::new` now calls
`compound_workflow.worktree_manager().sweep_stale(&[
PathBuf::from("/tmp/adf-worktrees") ])` immediately after the
compound workflow is constructed, before `Ok(Self { ... })`.
The /tmp/adf-worktrees literal mirrors lib.rs:5393 (the
per-agent worktree root from `create_agent_worktree`).
- Logs a `warn!` when the residue backlog at startup exceeds 10,
which the research doc flagged as the prior-crash-storm
threshold.
* tests/sweep_on_startup_test.rs (NEW):
- Three integration tests around an isolated TempDir git repo
via `scope::test_support::setup_git_repo`. Verifies that
`AgentOrchestrator::new` does sweep `review-*` residue,
preserves non-review siblings, and that extra_roots children
are removed regardless of prefix.
* Cargo.toml:
- Declares the new integration test with
`required-features = ["test-helpers"]`, mirroring the Layer 1
`compound_cancellation_test` entry.
Deviations from the brief
-------------------------
1. Struct name: the brief says `Orchestrator::new`; the actual
symbol is `AgentOrchestrator::new` (lib.rs:661). Insertion site
is between `compound_workflow` construction at :687 and the
`Ok(Self { ... })` block at :786.
2. `#[cfg(not(test))]` gate on the sweep call. The in-lib
`test_config()` (lib.rs:7926) points `repo_path` at the live
terraphim-ai checkout. Without the gate, the sweep's
`git worktree prune --verbose` ran against that real repo and
raced against `test_orchestrator_compound_review_manual`'s
concurrent `git worktree add` on the same shared
`.git/worktrees/` admin registry. Production wiring is fully
exercised by `tests/sweep_on_startup_test.rs` (a separate
crate, so `cfg(test)` is not set on the lib's own code path
when it links the integration test). The gate is documented
inline with a pointer to the integration test.
3. SweepReport is placed at module scope in scope.rs (as the
brief preferred), not wrapped in `WorktreeManager`.
4. The Linux/root-only contract test is included as the brief
requested. It uses a raw `extern "C" getuid()` FFI to avoid
pulling in `nix` or `libc` solely for a single gated test.
Verification
------------
* `cargo test -p terraphim_orchestrator --features test-helpers
--test sweep_on_startup_test -- --nocapture`: 3 passed.
* `cargo test -p terraphim_orchestrator --features test-helpers`:
691 lib tests + all integration tests pass, including Layer 1
`compound_cancellation_test`.
* `cargo clippy -p terraphim_orchestrator --lib --tests
--features test-helpers -- -D warnings`: clean.
* `cargo fmt`: clean.
Branched from task/1569-worktree-guard (Layer 1) which provides
the `WORKTREE_REVIEW_PREFIX` constant referenced by the sweep.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Layer 2 (final) of the ADF worktree-lifecycle hardening epic (Gitea #1567). Closes the gap between Layer 1's
WorktreeGuard::Drop(which runs on success / error / cancellation but NOT on SIGKILL / OOM / panic-across-runtime) and Layer 3's root-privilegedExecStartPre(defence-in-depth). At orchestrator startup, sweep stale worktree residue before any agent dispatch loop starts.Key changes (per
docs/design/adf-worktree-lifecycle-design.md§4.3):WorktreeManager::sweep_stale(&self, extra_roots: &[PathBuf]) -> SweepReport-- sync, walksself.worktree_base/review-*and all children of eachextra_root, runsgit worktree remove --forcewithstd::fs::remove_dir_allfallback, thengit worktree prune --verbose.SweepReport:swept_count,failed_count,root_owned_skipped,prune_succeeded,duration_ms.PermissionDeniedincrementsroot_owned_skippedwithout marking the sweep failed -- Layer 3 catches the residue.AgentOrchestrator::new(lib.rs:661) immediately afterCompoundReviewWorkflowconstruction.extra_roots = vec.CompoundReviewWorkflow::worktree_manager()getter (one-liner).info!line per startup with allSweepReportfields +backlog_count = swept_count + root_owned_skipped-- the storm-detection metric for Quickwitadf-logs.Test plan
cargo test -p terraphim_orchestrator --features test-helpers --test sweep_on_startup_test-- 3 variants pass:sweep_removes_review_prefixed_residue_on_startup,sweep_preserves_non_review_siblings_on_startup,sweep_removes_extra_root_children_regardless_of_prefix.cargo test -p terraphim_orchestrator --features test-helpers-- 691 lib tests + integration suites pass.cargo clippy -p terraphim_orchestrator --lib --tests --features test-helpers -- -D warnings-- clean.Design deviations (documented in commit body)
Orchestrator::newsymbol in the brief is actuallyAgentOrchestrator::new(verified by grep).#[cfg(not(test))]to avoid racing against the in-libtest_config()(which pointsrepo_pathat the live terraphim-ai checkout). Production wiring is exercised bytests/sweep_on_startup_test.rs(separate crate,cfg(test)does not propagate). Follow-up: migratetest_config()to aTempDirso the gate can be removed.extern "C" getuid()FFI to avoid pullingnixorlibcsolely for one gated test.Refs terraphim/terraphim-ai#1570 (Gitea)
Refs terraphim/terraphim-ai#1567 (Gitea epic)