From 10e45b0aba8141f7bf03d045f5a50495ff2d4041 Mon Sep 17 00:00:00 2001 From: Alex Date: Sun, 17 May 2026 13:44:07 +0100 Subject: [PATCH] feat(adf-setup): add root-privileged worktree sweep for systemd ExecStartPre MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Layer 3 of the four-layer ADF worktree lifecycle plan (epic #1567). This commit introduces the hot-shippable defence: a POSIX `sh` script that systemd invokes as root via `ExecStartPre=` before the orchestrator starts. Layers 1 and 2 prevent new residue from accumulating but cannot reclaim root-owned trees written by container-build agents; Layer 3 is the only layer that can. Files: * `scripts/adf-setup/adf-cleanup.sh` -- POSIX `sh` (verified with `dash -n`), `set -eu`. Sweeps `${ADF_WORKTREE_ROOT}/review-*` and `${ADF_AGENT_TMP_ROOT}/*` via `git worktree remove --force` with a `/bin/rm -rf` fallback, finishes with `git worktree prune --verbose` and a single summary line `adf-cleanup: swept=N failed=M repo=PATH`. Three environment variables overridable. * `scripts/adf-setup/tests/test_adf_cleanup.sh` -- POSIX shell test driver. Seeds a real git repo under `mktemp -d`, creates three `review-test-*` worktrees plus a `keep-me/` directory, asserts the review entries are removed and `keep-me/` is preserved, and a second run is a no-op. Exits 0 on PASS, 1 on FAIL. * `docs/operations/adf-orchestrator-systemd.md` -- operator-facing doc with the systemd drop-in, the `install` commands (mode 0750, owner root:root), verification recipe (cross-referencing §8.4 of the design), and rollback steps. Deviations from the design verbatim (small portability hardenings): * Test driver passes `-c init.defaultBranch=main` to `git init` to silence the default-branch hint, and `-c user.email/-c user.name` to `git commit --allow-empty` so the test survives on hosts without a global git identity. Verification: * `./scripts/adf-setup/tests/test_adf_cleanup.sh` -- PASS. Output: `adf-cleanup: swept=3 failed=0 ...` then `adf-cleanup: swept=0 failed=0 ...` then `PASS: test_adf_cleanup`. * `dash -n` on both scripts -- POSIX syntax clean. * `shellcheck` not installed locally; skipped. * `git diff --check` -- clean. No Rust files touched; `cargo fmt --check` does not fire. Refs #1571 (Gitea) --- docs/operations/adf-orchestrator-systemd.md | 115 ++++++++++++++++++++ scripts/adf-setup/adf-cleanup.sh | 62 +++++++++++ scripts/adf-setup/tests/test_adf_cleanup.sh | 52 +++++++++ 3 files changed, 229 insertions(+) create mode 100644 docs/operations/adf-orchestrator-systemd.md create mode 100755 scripts/adf-setup/adf-cleanup.sh create mode 100755 scripts/adf-setup/tests/test_adf_cleanup.sh diff --git a/docs/operations/adf-orchestrator-systemd.md b/docs/operations/adf-orchestrator-systemd.md new file mode 100644 index 000000000..f156d9168 --- /dev/null +++ b/docs/operations/adf-orchestrator-systemd.md @@ -0,0 +1,115 @@ +# ADF Orchestrator systemd Pre-Start Sweep + +## Purpose + +The ADF orchestrator runs as a long-lived service and accumulates +stale worktrees when sub-process agents (notably the compound-review +agent and container-build agents) crash or are terminated before +their cleanup hooks run. Some of these residues are owned by `root` +because they include build artefacts written by container processes +that escalated privileges, so the orchestrator running as a service +user cannot reclaim them. + +This document describes the root-privileged `ExecStartPre` hook that +sweeps the residue before the orchestrator starts. It is the Layer 3 +defence from the four-layer worktree lifecycle plan (epic +`terraphim/terraphim-ai#1567`, issue `#1571`); Layers 1 and 2 prevent +new residue from accumulating at the source, and Layer 4 adds a +periodic safety net. Layer 3 is the only layer that can run as root, +so it is the only layer that can reliably reclaim root-owned trees. + +The sweep script lives in-tree at +`scripts/adf-setup/adf-cleanup.sh`. It is POSIX `sh`, idempotent, and +emits a single summary line to stdout in the form +`adf-cleanup: swept=N failed=M repo=PATH`. + +## Drop-in snippet + +Add the following systemd drop-in at +`/etc/systemd/system/adf-orchestrator.service.d/cleanup.conf`: + +```ini +[Service] +Environment=ADF_REPO_PATH=/data/projects/terraphim/terraphim-ai +Environment=ADF_WORKTREE_ROOT=/data/projects/terraphim/terraphim-ai/.worktrees +ExecStartPre=/opt/ai-dark-factory/bin/adf-cleanup.sh +``` + +The three environment variables `ADF_REPO_PATH`, `ADF_WORKTREE_ROOT`, +and `ADF_AGENT_TMP_ROOT` are all overridable. `ADF_AGENT_TMP_ROOT` +defaults to `/tmp/adf-worktrees` and rarely needs an override on the +bigbox host. + +`ExecStartPre` runs synchronously before `ExecStart`, inherits the +service's environment, and runs with the unit's privileges. Because +the `adf-orchestrator.service` unit on bigbox runs as `root` for the +duration of the pre-start hook before dropping privileges for the +main process, the sweep can reclaim root-owned residue. + +## Manual install + +There is currently no in-tree installer. Until one lands, deploy the +script manually from a checkout of `main`: + +```bash +sudo install -m 750 -o root -g root \ + scripts/adf-setup/adf-cleanup.sh \ + /opt/ai-dark-factory/bin/adf-cleanup.sh +sudo systemctl daemon-reload +sudo systemctl restart adf-orchestrator +``` + +The `install` invocation sets ownership to `root:root` and mode +`0750` so the script is readable and executable by root and the +`root` group only; this matches the principle of least privilege for +a script that runs as root. + +## Verification + +The full verification recipe is in §8.4 of +`docs/design/adf-worktree-lifecycle-design.md`. The short version: + +```bash +# 1. Pre-seed a root-owned residue. +ssh bigbox 'sudo mkdir -p \ + /data/projects/terraphim/terraphim-ai/.worktrees/review-rootowned/target && \ + sudo chown -R root:root \ + /data/projects/terraphim/terraphim-ai/.worktrees/review-rootowned' + +# 2. Restart the service. +ssh bigbox 'sudo systemctl restart adf-orchestrator' + +# 3. Confirm the sweep line in the journal. +ssh bigbox 'journalctl -u adf-orchestrator -n 50 | grep adf-cleanup' +# expected: "adf-cleanup: swept=1 failed=0 repo=/data/..." + +# 4. Confirm the residue is gone. +ssh bigbox 'ls /data/projects/terraphim/terraphim-ai/.worktrees/review-rootowned 2>/dev/null' +# expected: empty +``` + +The in-tree shell test at +`scripts/adf-setup/tests/test_adf_cleanup.sh` exercises the same +control flow against a disposable git repo under `mktemp -d` and is +safe to run in CI without privileged residue. Run it directly: + +```bash +./scripts/adf-setup/tests/test_adf_cleanup.sh +``` + +It exits 0 on PASS and 1 on FAIL. + +## Rollback + +To disable the pre-start sweep: + +```bash +sudo rm /etc/systemd/system/adf-orchestrator.service.d/cleanup.conf +sudo rm /opt/ai-dark-factory/bin/adf-cleanup.sh +sudo systemctl daemon-reload +sudo systemctl restart adf-orchestrator +``` + +The orchestrator will restart without the sweep step; stale residue +will then accumulate again until either Layers 1 and 2 prevent it +upstream or the sweep is re-enabled. diff --git a/scripts/adf-setup/adf-cleanup.sh b/scripts/adf-setup/adf-cleanup.sh new file mode 100755 index 000000000..9daee1fe3 --- /dev/null +++ b/scripts/adf-setup/adf-cleanup.sh @@ -0,0 +1,62 @@ +#!/bin/sh +# adf-cleanup.sh -- pre-start sweep of stale ADF worktrees. +# +# Invoked by systemd as `ExecStartPre=` for adf-orchestrator.service. +# Runs as root so it can reclaim worktree contents owned by +# sub-process container builds and other elevated agents. +# +# Cross-reference: WORKTREE_REVIEW_PREFIX in +# crates/terraphim_orchestrator/src/scope.rs. The literal "review-" +# below must stay in sync with that constant. + +set -eu +umask 022 + +ADF_REPO_PATH="${ADF_REPO_PATH:-/data/projects/terraphim/terraphim-ai}" +ADF_WORKTREE_ROOT="${ADF_WORKTREE_ROOT:-${ADF_REPO_PATH}/.worktrees}" +ADF_AGENT_TMP_ROOT="${ADF_AGENT_TMP_ROOT:-/tmp/adf-worktrees}" + +swept=0 +failed=0 + +sweep_one() { + target="$1" + if [ ! -e "$target" ]; then + return 0 + fi + if git -C "$ADF_REPO_PATH" worktree remove --force "$target" >/dev/null 2>&1; then + swept=$((swept + 1)) + return 0 + fi + # Fallback: recursive removal of the worktree directory tree. + if /bin/rm -rf -- "$target"; then + swept=$((swept + 1)) + return 0 + fi + failed=$((failed + 1)) + return 0 +} + +# 1. Compound review residue under ${ADF_WORKTREE_ROOT}/review-*. +if [ -d "$ADF_WORKTREE_ROOT" ]; then + for entry in "$ADF_WORKTREE_ROOT"/review-*; do + [ -e "$entry" ] || continue + sweep_one "$entry" + done +fi + +# 2. Per-agent residue under /tmp/adf-worktrees/*. +if [ -d "$ADF_AGENT_TMP_ROOT" ]; then + for entry in "$ADF_AGENT_TMP_ROOT"/*; do + [ -e "$entry" ] || continue + sweep_one "$entry" + done +fi + +# 3. Reconcile git's admin registry. Failure here is not fatal. +git -C "$ADF_REPO_PATH" worktree prune --verbose 2>&1 || true + +printf 'adf-cleanup: swept=%d failed=%d repo=%s\n' \ + "$swept" "$failed" "$ADF_REPO_PATH" + +exit 0 diff --git a/scripts/adf-setup/tests/test_adf_cleanup.sh b/scripts/adf-setup/tests/test_adf_cleanup.sh new file mode 100755 index 000000000..4031fbd48 --- /dev/null +++ b/scripts/adf-setup/tests/test_adf_cleanup.sh @@ -0,0 +1,52 @@ +#!/bin/sh +# test_adf_cleanup.sh -- POSIX shell driver for adf-cleanup.sh. +# +# Seeds three review-* worktrees plus a keep-me/ directory in a +# disposable git repo, runs the sweep script, and asserts that +# review-* entries are removed while keep-me/ is preserved. A +# second run verifies idempotency. + +set -eu + +THIS_DIR="$(cd "$(dirname "$0")" && pwd)" +CLEANUP_SH="${THIS_DIR}/../adf-cleanup.sh" + +TMP="$(mktemp -d)" +trap '/bin/rm -rf "$TMP"' EXIT + +REPO="${TMP}/repo" +WT_ROOT="${REPO}/.worktrees" +mkdir -p "$REPO" +git -C "$REPO" -c init.defaultBranch=main init -q +git -C "$REPO" -c user.email=test@example.com -c user.name=Test \ + commit --allow-empty -m "seed" -q + +mkdir -p "$WT_ROOT/keep-me" + +for i in 1 2 3; do + git -C "$REPO" worktree add -q "${WT_ROOT}/review-test-${i}" HEAD +done + +[ -d "${WT_ROOT}/review-test-1" ] || { echo "setup failed"; exit 1; } + +ADF_REPO_PATH="$REPO" \ +ADF_WORKTREE_ROOT="$WT_ROOT" \ +ADF_AGENT_TMP_ROOT="${TMP}/agent-tmp-absent" \ + "$CLEANUP_SH" + +for i in 1 2 3; do + if [ -e "${WT_ROOT}/review-test-${i}" ]; then + echo "FAIL: review-test-${i} still present" + exit 1 + fi +done + +[ -d "${WT_ROOT}/keep-me" ] || { echo "FAIL: keep-me removed"; exit 1; } + +# Idempotency: second run. +ADF_REPO_PATH="$REPO" \ +ADF_WORKTREE_ROOT="$WT_ROOT" \ +ADF_AGENT_TMP_ROOT="${TMP}/agent-tmp-absent" \ + "$CLEANUP_SH" + +echo "PASS: test_adf_cleanup"