Skip to content

fix(daemon): wrap sync search file-output in block_in_place#307

Merged
githubrobbi merged 1 commit into
mainfrom
fix/phase-10f-block-in-place
May 20, 2026
Merged

fix(daemon): wrap sync search file-output in block_in_place#307
githubrobbi merged 1 commit into
mainfrom
fix/phase-10f-block-in-place

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

What

Phase 10f blocking-IO-in-async audit surfaced exactly 1 real prod hazard: async fn search (crates/uffs-daemon/src/index/search.rs:351 pre-fix) calls Self::write_rows_to_file(&filtered_rows, output_path, &output_config) synchronously after the spawn_blocking search join, doing std::fs::File::create + BufWriter::write_all + std::fs::rename on the tokio runtime worker thread.

Hazard scale

For --out=<path> exports with 10⁵+ rows × ~200 bytes ≈ tens of MB (and worst-case multi-million-row exports at hundreds of MB), the write blocks the worker for 100-500 ms. During that window, the IPC accept loop, stats heartbeat, per-shard journal loops, idle-demote controller, and memory-pressure subscriber all stall on the same worker.

Fix

Wrap write_rows_to_file in tokio::task::block_in_place.

Why block_in_place and not spawn_blocking: the Err arm of write_result falls through to the IPC payload-building path and reuses filtered_rowsspawn_blocking would force either an expensive clone of the (potentially millions-of-rows) Vec<DisplayRow> or an invasive Arc<Vec<DisplayRow>> refactor. block_in_place keeps the existing borrow shape, just tells the multi-threaded runtime to migrate other tasks off this worker for the duration.

Runtime safety: the daemon uses #[tokio::main] (default multi-threaded runtime; see crates/uffs-daemon/src/main.rs:119). block_in_place requires multi-threaded and panics on current_thread. The two new_current_thread instances in crates/uffs-daemon/src/ipc/windows_unix_bridge.rs:183,237 are separate helper threads driving bridge I/O — not the main runtime hosting async fn search.

// Phase 10f: `write_rows_to_file` does sync `File::create` +
// buffered `write_all` + `rename` on the tokio runtime
// thread.  For large result sets (10⁵+ rows × ~200 bytes ≈
// tens of MB), the write blocks for tens-to-hundreds of ms;
// `block_in_place` tells the multi-threaded runtime to move
// other tasks off this worker for the duration so the IPC
// accept loop, stats heartbeat, and per-shard journal loops
// keep making progress.  Cheaper than `spawn_blocking` here
// because the `Err` arm falls through to the IPC path and
// reuses `filtered_rows` — `spawn_blocking` would force an
// expensive clone or an `Arc<Vec<DisplayRow>>` refactor.
let write_result = tokio::task::block_in_place(|| {
    Self::write_rows_to_file(&filtered_rows, output_path, &output_config)
});

Hand-audit summary

14 candidate files flagged by scripts/dev/concurrency_audit.sh §8. Hand-audit classification:

  • 13 sites justified as (b) sync helper / startup-once / Drop / CLI one-shot — PID-file ops in lifecycle.rs; stale-socket cleanup in ipc.rs / windows_unix_bridge.rs; tracing dir setup; CLI command handlers in uffs-mft/src/commands/windows/*.rs and uffs-mcp/src/process.rs; std::thread::sleep in a dedicated std::thread::spawn watchdog (not async).
  • 1 real prod hazard — fixed here.

Soft stylistic notes (not fixed; deferred):

  • windows_unix_bridge.rs:43 — could move stale-socket cleanup into a sync helper for consistency with the Unix path.
  • connect.rs:736 — could swap std::fs::read_to_stringtokio::fs::read_to_string(...).await on the client shutdown path.
  • mcp/process.rs:229,255 — could swap std::thread::sleeptokio::time::sleep(...).await in async fn mcp_reload.

Per-site inventory in docs/dev/baseline/2026-05-19/phase_10_blocking_io_in_async_audit.md (local; docs/dev/baseline/ is gitignored).

Rule-1 adherence

  • Zero #[allow(...)] introductions.
  • No suppression hacks, no disabled lints, no skipped tests.
  • Observable behavior preserved — return value, file contents, error semantics identical (the wrapper only changes runtime isolation, not the I/O operation itself).
  • Surgical — single-call-site fix + inline rationale comment.

Verification

Local:

  • cargo fmt -p uffs-daemon — clean.
  • cargo clippy -p uffs-daemon --all-targets -- -D warnings — 0 warnings.
  • cargo test -p uffs-daemon298 lib + 0 integration (4 ignored, require pre-built daemon binary) + 0 doctest passed.
  • Pre-commit lint-fast — all 7 stages passed.
  • Pre-push lint-pre-push — all 17 stages passed in 88 s.

Acceptance against playbook §1142-1146

No async fn performs unbounded sync I/O without spawn_blocking / block_in_place.

CLOSED for Phase 10f. The single prod hazard is fixed with block_in_place. All other candidate sites are classified as sync helpers, startup-once, Drop, or CLI one-shot context.

Will be folded into concurrency_policy.md §"Blocking-I/O rule" in Phase 10g.

Tracking

Refs #302 (Phase 10 umbrella). Sub-phases remaining:

  • 10g — concurrency_policy.md + per-crate # Concurrency rustdoc.
  • 10h — CONTRIBUTING cross-link + final report (folded into 10g).

Phase 10f blocking-IO-in-async audit surfaced one real prod hazard:
`async fn search` calls `Self::write_rows_to_file(...)` synchronously
after the spawn_blocking search join, doing `File::create` + buffered
`write_all` + `rename` on the tokio runtime worker.

For `--out=<path>` exports with 10⁵+ rows × ~200 bytes ≈ tens of MB,
the write blocks the worker for 100-500 ms.  During that window the
IPC accept loop, stats heartbeat, per-shard journal loops, idle-demote
controller, and memory-pressure subscriber all stall on the same
worker.

## Fix

Wrap `write_rows_to_file` in `tokio::task::block_in_place`.

Chose `block_in_place` over `spawn_blocking` because the `Err` arm of
`write_result` falls through to the IPC payload-building path and
reuses `filtered_rows` — `spawn_blocking` would force either an
expensive clone of the (potentially millions-of-rows)
`Vec<DisplayRow>` or an invasive `Arc<Vec<DisplayRow>>` refactor.
`block_in_place` keeps the existing borrow shape, just tells the
multi-threaded runtime to migrate other tasks off this worker for
the duration.

Runtime safety: the daemon uses `#[tokio::main]` (default
multi-threaded runtime; see `crates/uffs-daemon/src/main.rs:119`).
`block_in_place` requires multi-threaded and panics on
`current_thread`.  The two `new_current_thread` instances in
`crates/uffs-daemon/src/ipc/windows_unix_bridge.rs:183,237` are
separate helper threads driving bridge I/O — not the main runtime
hosting `async fn search`.

## Hand-audit summary (Phase 10f)

14 candidate files flagged by `scripts/dev/concurrency_audit.sh §8`.
Hand-audit classification:

  * 13 sites justified as (b) sync helper / startup-once / `Drop` /
    CLI one-shot (PID-file ops in `lifecycle.rs`; stale-socket
    cleanup in `ipc.rs`/`windows_unix_bridge.rs`; tracing dir setup
    in `lib.rs`/`uffs-mcp/src/lib.rs`; CLI command handlers in
    `uffs-mft/src/commands/windows/*.rs` and `uffs-mcp/src/process.rs`;
    `std::thread::sleep` in dedicated `std::thread::spawn` watchdog).
  * **1 real prod hazard** — fixed here.

Soft stylistic notes (not fixed; deferred to a future pass):

  * `windows_unix_bridge.rs:43` — could move stale-socket cleanup
    into a sync helper for consistency with the Unix path.
  * `connect.rs:736` — could swap `std::fs::read_to_string` →
    `tokio::fs::read_to_string(...).await` on the client `shutdown`
    path.
  * `mcp/process.rs:229,255` — could swap `std::thread::sleep` →
    `tokio::time::sleep(...).await` in `async fn mcp_reload`.

Per-site inventory in
`docs/dev/baseline/2026-05-19/phase_10_blocking_io_in_async_audit.md`
(local).

## Rule-1 adherence

  * Zero `#[allow(...)]` introductions.
  * No suppression hacks, no skipped tests.
  * Observable behavior preserved — return value, file contents,
    error semantics identical (the wrapper only changes runtime
    isolation, not the I/O operation itself).
  * `cargo fmt -p uffs-daemon` — clean.
  * `cargo clippy -p uffs-daemon --all-targets -- -D warnings` — 0
    warnings.
  * `cargo test -p uffs-daemon` — **298 lib + 0 integration (4
    ignored, require pre-built daemon binary) + 0 doctest passed**.

Refs #302.
@githubrobbi githubrobbi merged commit bc35816 into main May 20, 2026
21 checks passed
@githubrobbi githubrobbi deleted the fix/phase-10f-block-in-place branch May 20, 2026 13:26
githubrobbi added a commit that referenced this pull request May 20, 2026
…s (Phase 10g) (#308)

* docs(concurrency): add concurrency_policy.md + per-crate # Concurrency rustdoc

Phase 10g closes the Phase 10 audit chain with a workspace concurrency
contract doc and per-crate rustdoc sections summarizing each crate's
runtime model.

## What lands

* `docs/architecture/code-quality/concurrency_policy.md` — 7th
  companion to the existing 6 code-quality policy docs (panic,
  allocation, trait, dependency, build/codegen, lint-posture).
  Codifies the five concurrency dimensions:

    - Task ownership (T1/T2/T3 with required four-facet annotation).
    - Lock discipline (L1-L5 patterns; L6 lock-across-await
      forbidden; enforced by three `await_holding_*` clippy lints at
      deny).
    - Channel discipline (C1-C5; C6 undocumented-unbounded forbidden).
    - Timeout policy (W1-W4; W5 unbounded cross-process await
      forbidden).
    - Blocking-IO rule (B1-B4; B5 unbounded sync I/O on async runtime
      worker forbidden).

  Plus shutdown coordination, required annotation templates,
  per-crate posture matrix, verification steps, anti-patterns, and
  the Phase 10 audit trail cross-linking PRs #303-#307 and the local
  per-dimension audit docs in `docs/dev/baseline/2026-05-19/`.

* `crates/uffs-daemon/src/lib.rs` `# Concurrency` rustdoc — runtime
  model + the six named `spawn_*` constructors that form the
  daemon's startup graph + cross-link to the policy doc.

* `crates/uffs-mcp/src/lib.rs` `# Concurrency` rustdoc — stdio
  dispatcher / streamable-http axum server / daemon-bridge
  reader-loop ownership + reload-pipeline-is-CLI-one-shot note.

* `crates/uffs-client/src/lib.rs` `# Concurrency` rustdoc — hybrid
  async + sync runtime model + the deliberate 300 s async vs 60 s
  env-overridable sync timeout asymmetry documented in
  `phase_10_timeout_coverage_audit.md`.

* `crates/uffs-mft/src/lib.rs` `# Concurrency` rustdoc — predominantly
  sync library + CLI binary; tokio only for daemon-embedded
  `spawn_blocking` MFT reads.

* `CONTRIBUTING.md` `## Concurrency policy` section — quotes the
  one-line rule, lists the five dimensions with their taxonomies,
  and cross-links the policy doc.

## Rule-1 adherence

* Zero `#[allow(...)]` introductions.
* No suppression hacks, no skipped tests.
* Doc-only PR — no behavior change.
* `cargo clippy --workspace --all-targets -- -D warnings` — clean.

## Follow-up

A separate commit on this branch decomposes
`crates/uffs-daemon/src/lib.rs` (currently 1066 LOC, over the 800-LOC
file-size policy ceiling) into `tracing.rs`, `startup.rs`, and
`shutdown.rs` sibling modules, keeping the spawn_* cluster cohesive
per the existing rationale comment.

Refs #302 (Phase 10 umbrella).

* refactor(daemon): split lib.rs into log_init / startup / shutdown sibling modules

`crates/uffs-daemon/src/lib.rs` was 1066 LOC (266 over the 800-LOC
file-size policy ceiling).  The existing rationale comment argued
against splitting the `spawn_*` cluster because doing so would
fragment the parent-task lifetime relationships — and that argument
still holds.  But the orchestrator's *setup* and *terminal* phases
have no such coupling and can move out cleanly.

## Extractions

* `crates/uffs-daemon/src/log_init.rs` (105 LOC, new) — `init_tracing`
  + `default_log_file`.  No collision with the `tracing` crate
  because the module is named `log_init`.  `pub use
  log_init::init_tracing;` re-export preserves the public API
  exactly (`uffs_daemon::init_tracing` still works for the daemon
  binary + the embedded `uffs daemon run` subcommand).

* `crates/uffs-daemon/src/startup.rs` (238 LOC, new) — pre-spawn
  helpers:

    - `install_catastrophe_panic_hook`
    - `log_daemon_starting` + `emit_daemon_starting_event`
    - `load_daemon_config`
    - `bootstrap_lifecycle_manager`
    - `gather_mft_files` + `drive_letter_matches` + `resolve_drive_list`
      (windows + non-windows variants)
    - `validate_data_sources`

  All `pub(crate)`.  `drive_letter_matches` is `pub(crate)` (was
  private) because the existing regression-pin test in
  `crates/uffs-daemon/src/tests.rs` exercises it directly — the
  contract pin was preserved by updating the import to
  `super::startup::drive_letter_matches;`.

* `crates/uffs-daemon/src/shutdown.rs` (84 LOC, new) —
  `await_shutdown_then_force_exit` + `force_exit_with_watchdog`.
  All `pub(crate)`.

## lib.rs after

* **689 LOC** (was 1066) — under the 800-LOC ceiling without an
  exception.
* Keeps the `spawn_*` cluster cohesive:
  `spawn_load_task` / `spawn_ipc_servers` /
  `spawn_stats_heartbeat` / `spawn_idle_demote_controller` /
  `spawn_journal_loops_for_warm_shards` /
  `spawn_pressure_subscriber`.
* Keeps `DaemonConfig` (public struct) and `run_daemon` (public
  orchestrator) for binary + embedded use.
* Removed the obsolete `Exception: file_size_policy allows this
  file to exceed 800 LOC` rationale comment from the crate-root
  rustdoc.

## file_size_exceptions.txt

Removed the now-obsolete `crates/uffs-daemon/src/lib.rs|PERMANENT:
…` entry.  Local `bash scripts/ci/check_file_size_policy.sh` passes
clean.

## Rule-1 adherence

* Zero `#[allow(...)]` introductions.
* No suppression hacks, no skipped tests.
* No public API change — `uffs_daemon::{init_tracing, run_daemon,
  DaemonConfig}` resolve to the same items as before via the
  `pub use log_init::init_tracing;` re-export.
* `cargo fmt --all` — clean.
* `cargo clippy --workspace --all-targets -- -D warnings` — clean.
* `cargo test -p uffs-daemon --lib` — **298 passed / 0 failed**.
* `bash scripts/ci/check_file_size_policy.sh` — clean.

Refs #302 (Phase 10 umbrella).

* docs(concurrency): close Phase 10 gaps — §0 model + §6 registry + named clippy entries + audit-script speedup

Closes the 4 gaps surfaced by the Phase 10 fidelity audit against the
playbook §1142-1146 pass criteria + plan §2 acceptance criteria:

## AC3 — `§0 The model at a glance` (was missing)

Adds the playbook's headline pass criterion ("concurrency model can
be explained on one page") to `concurrency_policy.md`:

  * **Daemon task graph** — Mermaid diagram with 11 nodes covering
    the 6 top-level `spawn_*` constructors + 2 subsystem-internal
    spawns (per-shard journal loop + `RegistryPatchSink`) +
    per-connection `handle_connection` + writer/notification
    sub-tasks.
  * **Shard-state lifecycle table** — 6 states (`Unknown`, `Cold`,
    `Parked`, `Warm`, `Hot`, `Evicting`) with bloom/trie/body
    presence + entry triggers; legal transitions cross-link
    `ShardState::can_transition_to`.  Two demote drivers explained:
    idle-TTL (`spawn_idle_demote_controller`, 30 s cadence) +
    memory-pressure cascade (`spawn_pressure_subscriber`, no-op on
    Mac/Linux).
  * **IPC-request lifecycle** — 4 numbered steps: accept →
    per-connection task → reader/writer/notifier → per-RPC
    timeout (search 30 s, drive-load
    `IndexManager::DRIVE_LOAD_TIMEOUT`, refresh fire-and-forget).
  * **Shutdown sequence** — 5 numbered steps: signal source → IPC
    drain → load drain (3 s timeout) → PID + socket cleanup →
    force-exit watchdog (5 s) + `process::exit(0)`.

Word count: ~520 prose words + 1 Mermaid + 1 sub-table — under the
600-word budget per plan §3 risk register.

## AC2b + AC6 — `§6 Spawn-site registry` (was missing)

Adds the workspace-tracked enumeration of all 18 prod
`tokio::spawn(` sites with the four required facets (owner /
shutdown / errors / cancel) — closes plan AC#2's "full inventory in
`concurrency_policy.md §"Spawn-site registry"`" + AC#6's "all 7
required sections".

The table is keyed by group (A = top-level, B = subsystem long-lived,
C = per-connection, D = sub-task, E = one-shot, F = runtime-cleanup
/ external) and references the source-of-truth hand-audit at
`docs/dev/baseline/2026-05-19/phase_10_task_ownership_inventory.md`
(local-only) for per-site nuance.  Adding a new prod spawn site now
requires a matching row in the same PR — the audit-script `§2`
count is the gate; new spawn with no registry row fails review.

Renumbering: §6 (was Verification) → §7, §7 → §8, §8 → §9 to
preserve narrative flow (policy → posture → registry → verification
→ anti-patterns → audit trail).

## G1 — Named `await_holding_*` clippy entries

Pins all 3 lints from clippy's `suspicious` group at deny in the
workspace `Cargo.toml`:

  ```toml
  await_holding_lock         = "deny"
  await_holding_refcell_ref  = "deny"
  await_holding_invalid_type = "deny"
  ```

Effectively no behavior change — they were at warn-by-default,
promoted to error via the workspace's CI `-D warnings` flag.  The
named entries make the policy doc's enforcement claim literal and
survive any future tightening of the `--deny warnings` shape.  An
inline comment block at the entries documents the rationale +
cross-links `concurrency_policy.md §1`.

Also fixes a leaked absolute path in the policy doc (line 36 was
`@/Users/rnio/Private/Github/UltraFastFileSearch/Cargo.toml` — now
just `Cargo.toml`).

## AC5 — Audit script `< 5 s` (was 6.2 s)

Adds `bulk_per_crate` helper to `scripts/dev/concurrency_audit.sh`
that runs **one** `rg` invocation per pattern across all of
`crates/` and buckets per-file counts into per-crate totals via
`awk` + a bash nameref (`local -n`).  Replaces the prior N-crates ×
M-patterns nested loop (~126 rg invocations for §1 alone, ~1.5 s of
process-spawn overhead).

Wall-time impact: **6.2 s → 4.5 s** (verified locally).  Output is
byte-identical (modulo the `Captured:` ISO timestamp).

Requires bash 4.3+ for nameref — script shebang is `#!/usr/bin/env
bash`; macOS `brew bash` ships 5.x; CI runners are Ubuntu 22.04+
with bash 5.x by default.

## Audit trail (§9) update

Updates `concurrency_policy.md §9` to mention 10g's bonus
deliverables (daemon `lib.rs` decomposition, §0 model, §6 registry,
named clippy entries) + adds 10h row pointing at the local-only
final report.

## Final report (G3, local-only)

Adds `docs/dev/baseline/2026-05-20/phase_10_final_report.md` —
mirrors `phase_9_final_report.md` pattern.  ~290 LOC covering:

  * §0 Executive summary
  * §1 Per-dimension outcome table (5 dimensions + cross-cutting
    cancellation infra + Arc<Mutex<…>> nesting depth)
  * §2 Concrete change summary (10 numbered items across the 6 PRs)
  * §3 Acceptance scorecard (8 of 8 ACs + 3 of 3 playbook P-criteria
    green)
  * §4 Test coverage audit (G2 verdict — no real gap; existing
    298 unit + 4 integration + 13 journal-loop tests cover every
    spawn-site cancellation path; soak/stress deferred to Phase 12
    per plan §0.3)
  * §5 Risk-register outcomes (6 of 6 mitigations consumed without
    incident)
  * §6 Cross-references
  * §7 Verification at closing SHA
  * §8 Decisions log (10 entries 10a-10h)
  * §9 Phase 10 closure with #302 close note

Local-only (gitignored under `docs/dev/baseline/`); mirrors the
Phase 0-9 final-report cadence.

## Rule-1 adherence

  * Zero `#[allow(...)]` introductions.
  * No suppression hacks, no skipped tests, no lint disables — the
    3 new clippy entries are *deny*, not allow.
  * No public API change.
  * `cargo fmt --all` — clean.
  * `cargo clippy --workspace --all-targets -- -D warnings` —
    clean (now also enforcing the 3 named `await_holding_*` lints).
  * `cargo test -p uffs-daemon --lib` — 298 passed / 0 failed.
  * `RUSTDOCFLAGS="-D warnings -D rustdoc::broken-intra-doc-links"
    cargo doc --workspace --no-deps` — clean.
  * `bash scripts/dev/concurrency_audit.sh` — runs in 4.5 s,
    output byte-identical to pre-optimization.

Refs #302 (Phase 10 umbrella).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant