Production-readiness audit: P0/P1/P2/P3 fixes (1.2.0) by ceejbot · Pull Request #10 · ceejbot/backfill

ceejbot · 2026-05-07T03:04:27Z

Summary

A focused production-readiness pass on the priority queue, retry, and DLQ
subsystems. Driven by a top-to-bottom code review that surfaced three P0
silent-failure bugs, six P1 correctness/UX bugs, thirteen P2 cleanups, and
several P3 polish items. Every P0 and P1 fix has a regression test that
fails on the prior code; the test suite goes 107 → 115.

This PR is intentionally scoped to bug fixes and doc honesty. Two
follow-up PRs are planned for cleanup work and additional test coverage.

Critical fixes

DLQ no longer loses jobs on worker restart. Startup cleanup used to
delete failed jobs before the DLQ scanner had a chance to capture them.
WorkerError::is_retryable() is wired up. Previously dead code; a
ValidationFailed would burn through ~6h of pointless retries before
reaching DLQ. Now it short-circuits to DLQ immediately.
process_failed_jobs is atomic. UPSERT+DELETE collapsed into one
CTE; no more "row in both tables" failure window.

Breaking changes (gated behind 1.2.0 minor bump)

WorkerRunner::process_available_jobs() -> Result<(), _> (was a fake
Result<usize, _> always returning Ok(0)).
WorkerRunner::worker_count() returns 1 truthfully (was the
configured-but-not-spawned count).

Other fixes

Filtered-pagination total, label-cardinality bounds in metrics, atomic
DLQ requeue, exponential backoff in the DLQ processor, NOT EXISTS instead
of NOT IN, touch-count semantics on failure_count, parameterized cleanup
SQL, louder log when cleanup deletes without DLQ capture, plus various
doc honesty fixes (truthful RetryPolicy docs, EnqueueOutcome footgun
warning, removed dead queries.rs stub, deleted empty .sqlx/sqlx-data.json).

Verification

115 tests passing (was 107). All new regression tests verified to fail
on the prior code via git stash round-trips.
cargo clippy --all-targets -F axum: clean.
cargo test --doc -F axum: 11/11 (1 intentionally ignored — buggy
pattern example in EnqueueOutcome rustdoc).

See CHANGELOG.md for the full per-finding breakdown.

Test plan

CI passes (clippy + nextest + doctests + security audit)
Spot-check the new regression tests in tests/dlq_tests.rs,
tests/plugin_tests.rs, and tests/integration_tests_clean.rs
Review the per-commit diffs — the branch is structured as one
logical change per commit for easy bisecting

`WorkerRunner::run_until_cancelled` called `startup_cleanup_with_timeouts` synchronously at startup, whose third step `cleanup_permanently_failed_jobs` DELETEs from `_private_jobs` where `attempts >= max_attempts AND locked_at IS NULL`. The DLQ processor uses the same predicate to find candidates, but runs as a periodic background task that hadn't ticked yet. Net effect: any job that hit max_attempts during the previous run (or in the gap between its final failure and the next DLQ tick) was silently deleted on every worker restart instead of being captured to the DLQ. Fix: when DLQ is enabled, run `process_failed_jobs()` synchronously before startup cleanup. Also document the ordering invariant on `cleanup_permanently_failed_jobs` and `startup_cleanup_with_timeouts` for direct callers (ad-hoc maintenance scripts). Adds a regression test that stages a permanently-failed row in `_private_jobs`, runs the actual `WorkerRunner` startup path, and asserts the row reaches the DLQ. Verified to fail on the previous code.

`WorkerError::is_retryable()` and `classify_from_message()` advertised that non-retryable error variants (`ValidationFailed`, `Unauthorized`, `InvalidInput`, etc.) would skip retries — but no production code path ever consulted them. graphile_worker treats every `Err` identically and reschedules on its `exp(min(attempts, 10))` second formula until max_attempts is reached. A `ValidationFailed` from a malformed payload would chew through ~6 hours of wasted retries before reaching the DLQ. Adds `PermanentFailurePlugin`, auto-registered alongside `DlqCleanupPlugin` when the DLQ is enabled. On `JobFail` it classifies `ctx.error` via `WorkerError::classify_from_message`. If the result is non-retryable, it runs `UPDATE _private_jobs SET attempts = max_attempts WHERE id = $1 AND locked_by = $worker_id` so the next `get_job()` skips the row and the next DLQ-processor tick captures it. Implementation note: graphile_worker's own `permanently_fail_jobs` SQL function would be the natural fit, but its `WHERE locked_at IS NULL OR locked_at < NOW() - 4h` guard is a no-op inside a `JobFail` hook — the worker (us) still owns the lock at that point. The custom UPDATE filters by `locked_by = ctx.worker_id` instead, which is a tighter safety property (only the lock holder can override its own row). graphile_worker clears locked_at immediately after the hook returns; the `attempts = max_attempts` we wrote sticks because `fail_job` only mutates last_error, run_at, and the lock fields. Two regression tests in plugin_tests.rs: - `test_non_retryable_error_short_circuits_retries`: ValidationFailed reaches max_attempts after a single execution and lands in DLQ. - `test_retryable_error_does_not_short_circuit`: TemporaryUnavailable is left alone (attempts == 1 after one execution). The first test was verified to fail without the plugin.

`RetryPolicy.{initial_delay, max_delay, backoff_multiplier, jitter_factor}` were stored on the struct but never reached graphile_worker — only `max_attempts` is forwarded by `From<JobSpec> for GraphileJobSpec`. graphile_worker uses a hard-coded `exp(min(attempts, 10))` second SQL formula for every retry. So `RetryPolicy::fast()` and `RetryPolicy::conservative()` produced identical retry timing in practice even though the docs promised "100ms-30s delays" vs. "1 min - 8 hour delays". This commit makes the fact match the promise: - Marks the unused math helpers (`RetryPolicy::new`, `with_jitter`, `calculate_delay`, `calculate_retry_time`, and `JobSpec::calculate_retry_time`) as `#[deprecated(since = "1.2.0")]` with notes pointing users at `RetryPolicy { max_attempts: n, ..Default::default() }` or the presets. - Rewrites the rustdoc on `RetryPolicy`, on each preset, on the `with_*_retries` builders, and on the `enqueue_*_with_retries` convenience helpers to describe what actually happens (only `max_attempts` differs across presets, fixed exp-backoff timing). - Updates the lib.rs module-level rustdoc. - Migrates `examples/enqueue_jobs.rs` to the recommended pattern so it doesn't trigger the new deprecation warnings. - Updates README.md to drop the false delay-range claims and replace the wrong "Pre-configured Fast/Bulk queues" / "Custom(name)" listing with the actual `Queue::Parallel` / `Queue::Serial(name)` enum. - Updates docs/02-dlq.md to mark the post-#9 "queue_name shows as default" warning as resolved (it was the change in v1.1.1 that fixed this). The struct fields themselves stay public for source-compatibility with existing struct-literal construction. Per-job backoff customization needs upstream graphile_worker support and is deferred.

The list query was filtering DLQ rows correctly via QueryBuilder, but the companion COUNT query was an unconditional `SELECT COUNT(*) FROM backfill_dlq` with no WHERE clause. Net effect: any paginated admin UI that filtered by task / queue / time range computed wrong page counts because the returned `total` reflected the entire table, not the filtered subset. Apply the same WHERE clauses to the count query. Adds a regression test that stages 5 jobs for task_a and 3 for task_b, then asserts that filtering by task returns the right per-task total (not the unfiltered 8). Verified to fail on the previous code.

`process_available_jobs` was declared `Result<usize, BackfillError>` but unconditionally returned `Ok(0)`. The doc comment admitted this. The existing test even asserted `assert_eq!(processed, 0)` with a comment noting that "job counting isn't implemented yet" — i.e., the test was asserting the bug. Change the signature to `Result<(), BackfillError>`. Truthful, smaller, forces the matching `?`-discarding pattern that nearly every existing caller already uses. Update the doc to point users at lifecycle hook plugins (`JobComplete` / `JobFail`) for actual job-count instrumentation. BREAKING CHANGE: callers that captured the return value as a usize must remove the binding. None of the existing tests or examples relied on the value beyond asserting it was zero (the bug).

Two atomicity holes in the DLQ paths: P1-2 (process_failed_jobs): The reaper did UPSERT into backfill_dlq, then DELETE from _private_jobs as two separate statements. A crash between them (or a connection blip during the DELETE) would leave the job in BOTH tables. The code even acknowledged this with "Consider this a partial failure - job is in DLQ but also still in main table." Fixed by collapsing both into a single statement using a writable CTE: WITH deleted AS ( DELETE FROM _private_jobs WHERE id = $1 RETURNING 1 ) INSERT INTO backfill_dlq (...) SELECT $1::bigint, $2::text, ... FROM deleted ON CONFLICT (job_key) WHERE job_key IS NOT NULL DO UPDATE SET ... If DELETE finds no row (another worker beat us to it), `deleted` is empty, the SELECT yields zero rows, and the INSERT runs zero times — natural no-op. If DELETE finds the row, the INSERT either creates a new DLQ row or ON-CONFLICT-updates an existing one (the requeue-then-fail-again case). The whole thing commits atomically. `rows_affected() > 0` distinguishes "actually moved" from "raced and skipped" for the moved-count metric. Also adds explicit type casts on the SELECT placeholders since they no longer have the INSERT VALUES position to disambiguate types from. P1-3 (requeue_dlq_job): Operations are enqueue (via WorkerUtils::add_raw_job, which hard-codes the pool) followed by UPDATE backfill_dlq. Can't be wrapped in a single transaction without forking add_raw_job. Smaller honest fix: demote the post-enqueue UPDATE error from `?`-propagation to a WARN log and document the residual non-atomicity in the function docstring. Rationale: when the enqueue succeeds, the user's stated intent — "put this job back in the queue" — has been achieved. Failing the operation because the bookkeeping UPDATE blipped would mislead the user into thinking the requeue didn't happen, and possibly trigger a double-requeue (which then errors with AlreadyInProgress). Better to return Ok with a log than to return Err and confuse the caller. No new tests: the atomicity guarantees are not directly observable without a chaos hook. The existing DLQ tests continue to verify happy-path correctness.

The library was using `Queue::as_str()` (which returns the actual queue name for serial queues) as the `queue=` Prometheus label across `backfill_jobs_enqueued`, `backfill_jobs_already_in_progress`, `backfill_dlq_jobs_added`, and `backfill_dlq_jobs_requeued`. The library's own docs recommend `Queue::serial_for("user", id)` for per-entity ordering — which produces a different queue name per user, which produces a different label value per user, which produces unbounded Prometheus time series. Industry best practice for cardinality-sensitive systems (Prometheus, Datadog, etc.) is to keep label cardinality below ~100; per- user labels easily exceed this by orders of magnitude and can OOM the metrics backend or render dashboards unusable. DLQ-side metrics had a related bug: parallel-origin jobs are stored with queue_name = "" (empty string), which is invalid as a Prometheus label in some backends. Adds `Queue::metric_label() -> &'static str` returning "parallel" or "serial" — bounded, two-valued. Built-in metric emission now uses this. On the DLQ side where we have a stored queue_name string (not a Queue enum), `metrics::queue_metric_label_from_name(&str)` does the same two-valued mapping. `Queue::as_str()` is preserved for logging (where unbounded values are fine) and gets a doc comment warning against using it for metrics. If users want per-queue stats for a known small set of named queues, they can emit them themselves via a plugin — that's the correct place to make the cardinality-vs-detail tradeoff per their own setup. Adds a unit test asserting that `serial_for("user", 12345)` and `serial_for("user", 99999)` produce the same label.

The QueueConfig multi-queue API has been misleading since it was introduced: - WorkerConfig::with_queues(Vec<QueueConfig>) accepted multiple configs, but only the FIRST one's `concurrency` was ever used at runtime (worker.rs WorkerOptionsBuilder::new). The rest were silently dropped with a single WARN log. - QueueConfig::priority_queue accepted a (min, max) priority_range that was never read by any code path. - QueueConfig::named_queue accepted a queue name that was logged but never applied to graphile_worker's WorkerOptions (which doesn't expose per-worker queue filtering). - worker_count() returned `queue_configs.len()`, lying when callers passed N configs and exactly 1 worker actually ran. This commit makes the surface match reality without removing it (which would be a bigger break): - Adds `WorkerConfig::with_concurrency(usize)` — the clean direct replacement. - Marks `WorkerConfig::with_queues`, `QueueConfig::named_queue`, and `QueueConfig::priority_queue` as `#[deprecated(since = "1.2.0")]` with notes pointing at the working alternatives. The `name` and `priority_range` fields get doc comments saying they aren't honored. - Fixes `worker_count()` to return 1 (the truth) instead of `queue_configs.len()`. - Renames the existing test `test_worker_runner_with_multiple_queues` to `test_worker_runner_with_multiple_queues_only_first_honored` and pins the actual behaviour (`worker_count() == 1`). The original test's `assert_eq!(worker.worker_count(), 3)` was asserting the bug. - Migrates `examples/basic_worker.rs` from the deprecated multi-queue shape to `with_concurrency`. The example previously claimed to configure separate fast/bulk/dead_letter workers but only the first was ever spawned — now it spawns one worker sized for the largest configured concurrency, with a comment explaining the constraint. Users who actually want multiple specialized workers can spawn multiple WorkerRunner instances themselves. Per-job queue routing still works correctly via Queue::serial(name) at enqueue time — that's unchanged.

Three minor improvements bundled together since each is a few lines: P2-9 (lib.rs enqueue_emergency): drop the redundant `run_at = Some(Utc::now())`. graphile_worker resolves a NULL run_at to NOW() at the SQL layer, so setting it explicitly was just an extra Rust-side clock read with no behavioural difference. Removing it also avoids a tiny clock-skew window where the Rust-computed timestamp could end up a hair in the past. P2-12 (client/enqueue.rs enqueue): the duration histogram was emitted on the success and already-in-progress branches but not the plain-error branch, biasing `backfill_db_operation_duration_seconds` toward fast successes. Move the duration recording to fire once after the call completes so all three outcomes contribute to the histogram. P2-13 (client/dlq.rs delete_dlq_job): replace the "fetch first, then delete" pattern with a single `DELETE … RETURNING task_identifier`. Halves DB round-trips on a hot admin path and removes the small race where the row could vanish between the SELECT and DELETE.

src/client/queries.rs was never `mod`-declared in src/client/mod.rs (or anywhere else), so the Rust compiler ignored it entirely. Its rustdoc-as-documentation pattern can't even render via `cargo doc` because it's not part of the module tree. The defined `JobInfo` / `DlqJobInfo` structs were never instantiated. Its purpose was to demonstrate the sqlx::query!() macro pattern for users wanting compile-time-checked queries — but `docs/01-database-setup.md` already covers SQLx setup (with `cargo sqlx prepare` for offline mode), so the duplicated guidance has no unique value. Delete the file. If we want to ship a real query!() recipe, we'll do it in proper user-facing docs and back it with actual code that exercises the macros.

The "find failed jobs not yet in DLQ" predicate used `id NOT IN (SELECT COALESCE(original_job_id, -1) FROM backfill_dlq)`. The COALESCE was a correct workaround for NOT IN's NULL-poisons-everything semantics, but: - NOT EXISTS doesn't have the NULL trap (correlated check, not a list membership test), so the COALESCE workaround disappears. - NOT EXISTS can be planned as an anti-join in PostgreSQL; NOT IN against a subquery typically can't (especially when the inner column is nullable, even with COALESCE), and degrades into per-row hashing/lookups on large DLQs. Behaviour is identical for the current data (any DLQ row with NULL original_job_id was created via add_to_dlq for a job we don't have an id for, which can never match a `_private_jobs.id` anyway). At scale this should plan and run noticeably faster.

The DLQ processor's background task ticked at a fixed interval regardless of error history. If process_failed_jobs() persistently errored (DB blip, auth issue, schema drift) the loop would log an ERROR every `interval` seconds — spammy and contributing nothing useful. The original code acknowledged this gap with a TODO-shaped comment. Replace the tokio::interval-based loop with a manual sleep loop that applies exponential backoff on consecutive errors: - interval, 2x, 4x, 8x, 16x, 32x — capped at 32x (so default 60s interval caps at ~32min between attempts under sustained failure). - Any success resets the counter back to a 1x interval. Behaviour on the happy path is unchanged: first scan still runs immediately on spawn, subsequent scans run every `interval`. Cancellation remains responsive (the `biased` select! makes the cancel branch win ties so a slow-firing cancel doesn't get queued behind a long sleep).

The previous UPSERT computed `failure_count = old + EXCLUDED.failure_count` where EXCLUDED.failure_count was the failing job's `attempts` value at the moment of capture. Net semantics were "cumulative handler-failure invocations summed across all retry cycles" — defensible but ambiguous, and easily confused with `attempts` on the original job, with `requeued_count` (admin requeue counter), or with the per-cycle attempt count. Switch to a clean touch-count semantics: `failure_count = N` means "this logical job (by job_key) has reached the DLQ N times, regardless of how many handler invocations each cycle involved." Initial INSERT sets it to 1; the ON CONFLICT branch increments by 1. For jobs without a job_key the UPSERT path doesn't apply, so each failure creates a fresh row with failure_count=1. Updated: - The bind list in `add_to_dlq` (loses the `attempts` bind for that field; failure_count is now hard-coded to 1 in the SQL VALUES). - The SELECT in `process_failed_jobs`'s atomic CTE (same change). - `DlqJob::failure_count` rustdoc to explain the new contract clearly. - Existing test `test_dlq_add_job_and_retrieve` (was asserting failure_count == job.attempts(), which was 0 for a freshly-enqueued job — coincidentally always-passing under the old semantics). Adds `test_dlq_failure_count_is_touch_count`: verifies that two add_to_dlq calls with the same job_key produce a single DLQ row with failure_count progressing 1 → 2.

P2-5 (Cargo.toml): graphile_worker version pin warning. The default ^0.11.1 constraint already excludes 0.12.x, but a patch release within 0.11.x can change the internal `_private_*` schema that backfill queries directly (cleanup, DLQ scanner, admin endpoints). Add a comment near the dep flagging this coupling so anyone running `cargo update` knows to re-run integration tests against any new graphile_worker version. Not tightening the version range (e.g., to `=0.11.4`) because that would prevent picking up upstream bug fixes; the test suite is the right backstop here, not a stricter pin. P2-8 (lib.rs EnqueueOutcome rustdoc): the `AlreadyInProgress` variant is a real footgun for callers who chain `.unwrap()` on `enqueue` results. Rewrite the enum's docstring to spell out exactly when it fires, that the new payload is *dropped* (not retried), and that `.unwrap()` / `.expect()` panic on this variant by design. Show the buggy auto-unwrap pattern explicitly so users recognize it in their own code. P2-11 (.sqlx + README + docs/01-database-setup.md): remove the empty `.sqlx/sqlx-data.json` stub and rewrite the SQLx documentation to be truthful. backfill itself uses runtime sqlx::query() — the compile-time sqlx::query!() macros don't fit because backfill targets dynamic schema names (the user's chosen schema). So: - Delete .sqlx/sqlx-data.json (was always `{}` — never populated, never served as an offline-mode cache). - Update README's "SQLx" section to clarify backfill's actual usage (runtime queries, no DATABASE_URL needed at compile time). - Add a callout at the top of docs/01-database-setup.md's SQLx section that the offline-mode setup it describes is for *user* queries against backfill's tables, not for backfill itself.

P3-1 (cleanup.rs): the stale-lock SQL was format-interpolating the timeout in seconds directly into the query string. Safe (it's a u64) but inconsistent with the parameterized norm everywhere else. Replace `INTERVAL '{timeout_secs} seconds'` with a bound parameter: `($1::bigint * interval '1 second')`. Both `release_stale_queue_locks` and `release_stale_job_locks` updated. Behaviour unchanged; existing tests cover this. P3-2 (cleanup.rs cleanup_permanently_failed_jobs): when the DLQ is disabled this function is the only mechanism that removes failed jobs from `_private_jobs` — and those jobs are gone forever, since there's no DLQ to capture them. Previously logged at INFO regardless. Now the function checks `to_regclass('{schema}.backfill_dlq')` and: - INFO if the DLQ table exists ("just GC, jobs were captured earlier") - WARN with explicit "they cannot be recovered" wording if it doesn't, pointing the operator at `dlq_processor_interval` if they didn't mean to opt out of failure inspection. P3-4 (lib.rs): add a doc note above the `pub use graphile_worker::{...}` block flagging that backfill's public surface transitively includes those types — graphile_worker version bumps that touch any of them constitute breaking changes for backfill even when graphile_worker itself doesn't tag them that way (graphile_worker is pre-1.0). Includes a checklist for what to audit when upgrading.

Closes two of the test-coverage gaps from section 6 of the review. test_retry_to_exhaustion_then_dlq_via_worker (plugin_tests.rs): Existing DLQ tests SQL-fake exhaustion via UPDATE _private_jobs SET attempts = max_attempts, which skips the entire worker code path. This new test runs the actual handler multiple times via the real worker -- each call increments attempts via get_job's attempts+1, the handler returns Err, fail_job's SQL fires, run_at gets pushed into the future. Between iterations the test fast-forwards run_at = NOW() so we don't sleep through graphile_worker's exp(attempts) second backoff. After max_attempts iterations the job naturally hits permanently-failed state, then process_failed_jobs() moves it to DLQ via the atomic CTE (P1-2). Asserts: the job lands in DLQ with the right task_identifier and job_key, failure_count is 1 (touch counter, P2-7), and _private_jobs is empty afterward (P1-2's atomic move worked). This exercises the seam between every layer added in the P0/P1 work (P0-1 cleanup ordering, P0-3 plugin classification, P1-2 atomic move, P2-7 touch counter) along with graphile_worker's own retry semantics. test_concurrent_enqueue_under_load (integration_tests_clean.rs): Spawns 10 tokio tasks each enqueueing 50 jobs in parallel against the same pool. Verifies that all 500 enqueues succeed and that all 500 rows appear in _private_jobs. Catches the kind of regression that would appear if a future refactor introduced row-level lock contention or deadlock paths in the enqueue SQL.

CI runs cargo +nightly fmt --check; my edits were formatted under stable fmt which has slightly different defaults around method-chain wrapping. Apply the nightly formatter so CI's fmt step passes. No behavioural change.

ceejbot added 19 commits May 6, 2026 17:51

chore: boop dependencies

0b8f12e

chore: bump version to 1.2.0

4b7daba

chore: cargo +nightly fmt

f91087d

CI runs cargo +nightly fmt --check; my edits were formatted under stable fmt which has slightly different defaults around method-chain wrapping. Apply the nightly formatter so CI's fmt step passes. No behavioural change.

ceejbot merged commit ed910e5 into latest May 7, 2026
1 check passed

ceejbot deleted the ceej/update-it branch May 7, 2026 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production-readiness audit: P0/P1/P2/P3 fixes (1.2.0)#10

Production-readiness audit: P0/P1/P2/P3 fixes (1.2.0)#10
ceejbot merged 19 commits intolatestfrom
ceej/update-it

ceejbot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ceejbot commented May 7, 2026

Summary

Critical fixes

Breaking changes (gated behind 1.2.0 minor bump)

Other fixes

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant