Skip to content

Production-readiness audit: P0/P1/P2/P3 fixes (1.2.0)#10

Merged
ceejbot merged 19 commits intolatestfrom
ceej/update-it
May 7, 2026
Merged

Production-readiness audit: P0/P1/P2/P3 fixes (1.2.0)#10
ceejbot merged 19 commits intolatestfrom
ceej/update-it

Conversation

@ceejbot
Copy link
Copy Markdown
Owner

@ceejbot ceejbot commented May 7, 2026

Summary

A focused production-readiness pass on the priority queue, retry, and DLQ
subsystems. Driven by a top-to-bottom code review that surfaced three P0
silent-failure bugs, six P1 correctness/UX bugs, thirteen P2 cleanups, and
several P3 polish items. Every P0 and P1 fix has a regression test that
fails on the prior code; the test suite goes 107 → 115.

This PR is intentionally scoped to bug fixes and doc honesty. Two
follow-up PRs are planned for cleanup work and additional test coverage.

Critical fixes

  • DLQ no longer loses jobs on worker restart. Startup cleanup used to
    delete failed jobs before the DLQ scanner had a chance to capture them.
  • WorkerError::is_retryable() is wired up. Previously dead code; a
    ValidationFailed would burn through ~6h of pointless retries before
    reaching DLQ. Now it short-circuits to DLQ immediately.
  • process_failed_jobs is atomic. UPSERT+DELETE collapsed into one
    CTE; no more "row in both tables" failure window.

Breaking changes (gated behind 1.2.0 minor bump)

  • WorkerRunner::process_available_jobs() -> Result<(), _> (was a fake
    Result<usize, _> always returning Ok(0)).
  • WorkerRunner::worker_count() returns 1 truthfully (was the
    configured-but-not-spawned count).

Other fixes

Filtered-pagination total, label-cardinality bounds in metrics, atomic
DLQ requeue, exponential backoff in the DLQ processor, NOT EXISTS instead
of NOT IN, touch-count semantics on failure_count, parameterized cleanup
SQL, louder log when cleanup deletes without DLQ capture, plus various
doc honesty fixes (truthful RetryPolicy docs, EnqueueOutcome footgun
warning, removed dead queries.rs stub, deleted empty .sqlx/sqlx-data.json).

Verification

  • 115 tests passing (was 107). All new regression tests verified to fail
    on the prior code via git stash round-trips.
  • cargo clippy --all-targets -F axum: clean.
  • cargo test --doc -F axum: 11/11 (1 intentionally ignored — buggy
    pattern example in EnqueueOutcome rustdoc).

See CHANGELOG.md for the full per-finding breakdown.

Test plan

  • CI passes (clippy + nextest + doctests + security audit)
  • Spot-check the new regression tests in tests/dlq_tests.rs,
    tests/plugin_tests.rs, and tests/integration_tests_clean.rs
  • Review the per-commit diffs — the branch is structured as one
    logical change per commit for easy bisecting

ceejbot added 19 commits May 6, 2026 17:51
`WorkerRunner::run_until_cancelled` called `startup_cleanup_with_timeouts`
synchronously at startup, whose third step `cleanup_permanently_failed_jobs`
DELETEs from `_private_jobs` where `attempts >= max_attempts AND locked_at
IS NULL`. The DLQ processor uses the same predicate to find candidates, but
runs as a periodic background task that hadn't ticked yet. Net effect: any
job that hit max_attempts during the previous run (or in the gap between
its final failure and the next DLQ tick) was silently deleted on every
worker restart instead of being captured to the DLQ.

Fix: when DLQ is enabled, run `process_failed_jobs()` synchronously before
startup cleanup. Also document the ordering invariant on
`cleanup_permanently_failed_jobs` and `startup_cleanup_with_timeouts` for
direct callers (ad-hoc maintenance scripts).

Adds a regression test that stages a permanently-failed row in
`_private_jobs`, runs the actual `WorkerRunner` startup path, and asserts
the row reaches the DLQ. Verified to fail on the previous code.
`WorkerError::is_retryable()` and `classify_from_message()` advertised that
non-retryable error variants (`ValidationFailed`, `Unauthorized`,
`InvalidInput`, etc.) would skip retries — but no production code path ever
consulted them. graphile_worker treats every `Err` identically and
reschedules on its `exp(min(attempts, 10))` second formula until
max_attempts is reached. A `ValidationFailed` from a malformed payload
would chew through ~6 hours of wasted retries before reaching the DLQ.

Adds `PermanentFailurePlugin`, auto-registered alongside `DlqCleanupPlugin`
when the DLQ is enabled. On `JobFail` it classifies `ctx.error` via
`WorkerError::classify_from_message`. If the result is non-retryable, it
runs `UPDATE _private_jobs SET attempts = max_attempts WHERE id = $1 AND
locked_by = $worker_id` so the next `get_job()` skips the row and the next
DLQ-processor tick captures it.

Implementation note: graphile_worker's own `permanently_fail_jobs` SQL
function would be the natural fit, but its `WHERE locked_at IS NULL OR
locked_at < NOW() - 4h` guard is a no-op inside a `JobFail` hook — the
worker (us) still owns the lock at that point. The custom UPDATE filters
by `locked_by = ctx.worker_id` instead, which is a tighter safety property
(only the lock holder can override its own row). graphile_worker clears
locked_at immediately after the hook returns; the `attempts =
max_attempts` we wrote sticks because `fail_job` only mutates last_error,
run_at, and the lock fields.

Two regression tests in plugin_tests.rs:
- `test_non_retryable_error_short_circuits_retries`: ValidationFailed
  reaches max_attempts after a single execution and lands in DLQ.
- `test_retryable_error_does_not_short_circuit`: TemporaryUnavailable is
  left alone (attempts == 1 after one execution).

The first test was verified to fail without the plugin.
`RetryPolicy.{initial_delay, max_delay, backoff_multiplier, jitter_factor}`
were stored on the struct but never reached graphile_worker — only
`max_attempts` is forwarded by `From<JobSpec> for GraphileJobSpec`.
graphile_worker uses a hard-coded `exp(min(attempts, 10))` second SQL
formula for every retry. So `RetryPolicy::fast()` and
`RetryPolicy::conservative()` produced identical retry timing in practice
even though the docs promised "100ms-30s delays" vs. "1 min - 8 hour
delays".

This commit makes the fact match the promise:

- Marks the unused math helpers (`RetryPolicy::new`, `with_jitter`,
  `calculate_delay`, `calculate_retry_time`, and `JobSpec::calculate_retry_time`)
  as `#[deprecated(since = "1.2.0")]` with notes pointing users at
  `RetryPolicy { max_attempts: n, ..Default::default() }` or the presets.
- Rewrites the rustdoc on `RetryPolicy`, on each preset, on the
  `with_*_retries` builders, and on the `enqueue_*_with_retries`
  convenience helpers to describe what actually happens (only
  `max_attempts` differs across presets, fixed exp-backoff timing).
- Updates the lib.rs module-level rustdoc.
- Migrates `examples/enqueue_jobs.rs` to the recommended pattern so it
  doesn't trigger the new deprecation warnings.
- Updates README.md to drop the false delay-range claims and replace
  the wrong "Pre-configured Fast/Bulk queues" / "Custom(name)" listing
  with the actual `Queue::Parallel` / `Queue::Serial(name)` enum.
- Updates docs/02-dlq.md to mark the post-#9 "queue_name shows as
  default" warning as resolved (it was the change in v1.1.1 that fixed
  this).

The struct fields themselves stay public for source-compatibility with
existing struct-literal construction. Per-job backoff customization needs
upstream graphile_worker support and is deferred.
The list query was filtering DLQ rows correctly via QueryBuilder, but the
companion COUNT query was an unconditional `SELECT COUNT(*) FROM
backfill_dlq` with no WHERE clause. Net effect: any paginated admin UI
that filtered by task / queue / time range computed wrong page counts
because the returned `total` reflected the entire table, not the filtered
subset.

Apply the same WHERE clauses to the count query.

Adds a regression test that stages 5 jobs for task_a and 3 for task_b,
then asserts that filtering by task returns the right per-task total
(not the unfiltered 8). Verified to fail on the previous code.
`process_available_jobs` was declared `Result<usize, BackfillError>` but
unconditionally returned `Ok(0)`. The doc comment admitted this. The
existing test even asserted `assert_eq!(processed, 0)` with a comment
noting that "job counting isn't implemented yet" — i.e., the test was
asserting the bug.

Change the signature to `Result<(), BackfillError>`. Truthful, smaller,
forces the matching `?`-discarding pattern that nearly every existing
caller already uses. Update the doc to point users at lifecycle hook
plugins (`JobComplete` / `JobFail`) for actual job-count instrumentation.

BREAKING CHANGE: callers that captured the return value as a usize must
remove the binding. None of the existing tests or examples relied on the
value beyond asserting it was zero (the bug).
Two atomicity holes in the DLQ paths:

P1-2 (process_failed_jobs):
The reaper did UPSERT into backfill_dlq, then DELETE from _private_jobs as
two separate statements. A crash between them (or a connection blip during
the DELETE) would leave the job in BOTH tables. The code even acknowledged
this with "Consider this a partial failure - job is in DLQ but also still
in main table." Fixed by collapsing both into a single statement using a
writable CTE:

    WITH deleted AS (
        DELETE FROM _private_jobs WHERE id = $1 RETURNING 1
    )
    INSERT INTO backfill_dlq (...)
    SELECT $1::bigint, $2::text, ... FROM deleted
    ON CONFLICT (job_key) WHERE job_key IS NOT NULL DO UPDATE SET ...

If DELETE finds no row (another worker beat us to it), `deleted` is empty,
the SELECT yields zero rows, and the INSERT runs zero times — natural
no-op. If DELETE finds the row, the INSERT either creates a new DLQ row or
ON-CONFLICT-updates an existing one (the requeue-then-fail-again case).
The whole thing commits atomically. `rows_affected() > 0` distinguishes
"actually moved" from "raced and skipped" for the moved-count metric.

Also adds explicit type casts on the SELECT placeholders since they no
longer have the INSERT VALUES position to disambiguate types from.

P1-3 (requeue_dlq_job):
Operations are enqueue (via WorkerUtils::add_raw_job, which hard-codes
the pool) followed by UPDATE backfill_dlq. Can't be wrapped in a single
transaction without forking add_raw_job. Smaller honest fix: demote the
post-enqueue UPDATE error from `?`-propagation to a WARN log and document
the residual non-atomicity in the function docstring.

Rationale: when the enqueue succeeds, the user's stated intent — "put this
job back in the queue" — has been achieved. Failing the operation because
the bookkeeping UPDATE blipped would mislead the user into thinking the
requeue didn't happen, and possibly trigger a double-requeue (which then
errors with AlreadyInProgress). Better to return Ok with a log than to
return Err and confuse the caller.

No new tests: the atomicity guarantees are not directly observable without
a chaos hook. The existing DLQ tests continue to verify happy-path
correctness.
The library was using `Queue::as_str()` (which returns the actual queue
name for serial queues) as the `queue=` Prometheus label across
`backfill_jobs_enqueued`, `backfill_jobs_already_in_progress`,
`backfill_dlq_jobs_added`, and `backfill_dlq_jobs_requeued`. The library's
own docs recommend `Queue::serial_for("user", id)` for per-entity ordering
— which produces a different queue name per user, which produces a
different label value per user, which produces unbounded Prometheus time
series. Industry best practice for cardinality-sensitive systems
(Prometheus, Datadog, etc.) is to keep label cardinality below ~100; per-
user labels easily exceed this by orders of magnitude and can OOM the
metrics backend or render dashboards unusable.

DLQ-side metrics had a related bug: parallel-origin jobs are stored with
queue_name = "" (empty string), which is invalid as a Prometheus label
in some backends.

Adds `Queue::metric_label() -> &'static str` returning "parallel" or
"serial" — bounded, two-valued. Built-in metric emission now uses this.
On the DLQ side where we have a stored queue_name string (not a Queue
enum), `metrics::queue_metric_label_from_name(&str)` does the same
two-valued mapping.

`Queue::as_str()` is preserved for logging (where unbounded values are
fine) and gets a doc comment warning against using it for metrics.

If users want per-queue stats for a known small set of named queues, they
can emit them themselves via a plugin — that's the correct place to make
the cardinality-vs-detail tradeoff per their own setup.

Adds a unit test asserting that `serial_for("user", 12345)` and
`serial_for("user", 99999)` produce the same label.
The QueueConfig multi-queue API has been misleading since it was
introduced:

- WorkerConfig::with_queues(Vec<QueueConfig>) accepted multiple configs,
  but only the FIRST one's `concurrency` was ever used at runtime
  (worker.rs WorkerOptionsBuilder::new). The rest were silently dropped
  with a single WARN log.
- QueueConfig::priority_queue accepted a (min, max) priority_range that
  was never read by any code path.
- QueueConfig::named_queue accepted a queue name that was logged but never
  applied to graphile_worker's WorkerOptions (which doesn't expose
  per-worker queue filtering).
- worker_count() returned `queue_configs.len()`, lying when callers
  passed N configs and exactly 1 worker actually ran.

This commit makes the surface match reality without removing it (which
would be a bigger break):

- Adds `WorkerConfig::with_concurrency(usize)` — the clean direct
  replacement.
- Marks `WorkerConfig::with_queues`, `QueueConfig::named_queue`, and
  `QueueConfig::priority_queue` as `#[deprecated(since = "1.2.0")]` with
  notes pointing at the working alternatives. The `name` and
  `priority_range` fields get doc comments saying they aren't honored.
- Fixes `worker_count()` to return 1 (the truth) instead of
  `queue_configs.len()`.
- Renames the existing test `test_worker_runner_with_multiple_queues` to
  `test_worker_runner_with_multiple_queues_only_first_honored` and pins
  the actual behaviour (`worker_count() == 1`). The original test's
  `assert_eq!(worker.worker_count(), 3)` was asserting the bug.
- Migrates `examples/basic_worker.rs` from the deprecated multi-queue
  shape to `with_concurrency`. The example previously claimed to
  configure separate fast/bulk/dead_letter workers but only the first
  was ever spawned — now it spawns one worker sized for the largest
  configured concurrency, with a comment explaining the constraint.

Users who actually want multiple specialized workers can spawn multiple
WorkerRunner instances themselves. Per-job queue routing still works
correctly via Queue::serial(name) at enqueue time — that's unchanged.
Three minor improvements bundled together since each is a few lines:

P2-9 (lib.rs enqueue_emergency): drop the redundant `run_at = Some(Utc::now())`.
graphile_worker resolves a NULL run_at to NOW() at the SQL layer, so
setting it explicitly was just an extra Rust-side clock read with no
behavioural difference. Removing it also avoids a tiny clock-skew window
where the Rust-computed timestamp could end up a hair in the past.

P2-12 (client/enqueue.rs enqueue): the duration histogram was emitted on
the success and already-in-progress branches but not the plain-error
branch, biasing `backfill_db_operation_duration_seconds` toward fast
successes. Move the duration recording to fire once after the call
completes so all three outcomes contribute to the histogram.

P2-13 (client/dlq.rs delete_dlq_job): replace the "fetch first, then
delete" pattern with a single `DELETE … RETURNING task_identifier`. Halves
DB round-trips on a hot admin path and removes the small race where the
row could vanish between the SELECT and DELETE.
src/client/queries.rs was never `mod`-declared in src/client/mod.rs (or
anywhere else), so the Rust compiler ignored it entirely. Its
rustdoc-as-documentation pattern can't even render via `cargo doc`
because it's not part of the module tree. The defined `JobInfo` /
`DlqJobInfo` structs were never instantiated.

Its purpose was to demonstrate the sqlx::query!() macro pattern for
users wanting compile-time-checked queries — but `docs/01-database-setup.md`
already covers SQLx setup (with `cargo sqlx prepare` for offline mode),
so the duplicated guidance has no unique value.

Delete the file. If we want to ship a real query!() recipe, we'll do it
in proper user-facing docs and back it with actual code that exercises
the macros.
The "find failed jobs not yet in DLQ" predicate used `id NOT IN (SELECT
COALESCE(original_job_id, -1) FROM backfill_dlq)`. The COALESCE was a
correct workaround for NOT IN's NULL-poisons-everything semantics, but:

- NOT EXISTS doesn't have the NULL trap (correlated check, not a list
  membership test), so the COALESCE workaround disappears.
- NOT EXISTS can be planned as an anti-join in PostgreSQL; NOT IN against
  a subquery typically can't (especially when the inner column is
  nullable, even with COALESCE), and degrades into per-row hashing/lookups
  on large DLQs.

Behaviour is identical for the current data (any DLQ row with NULL
original_job_id was created via add_to_dlq for a job we don't have an id
for, which can never match a `_private_jobs.id` anyway). At scale this
should plan and run noticeably faster.
The DLQ processor's background task ticked at a fixed interval regardless
of error history. If process_failed_jobs() persistently errored (DB blip,
auth issue, schema drift) the loop would log an ERROR every `interval`
seconds — spammy and contributing nothing useful. The original code
acknowledged this gap with a TODO-shaped comment.

Replace the tokio::interval-based loop with a manual sleep loop that
applies exponential backoff on consecutive errors:
- interval, 2x, 4x, 8x, 16x, 32x — capped at 32x (so default 60s interval
  caps at ~32min between attempts under sustained failure).
- Any success resets the counter back to a 1x interval.

Behaviour on the happy path is unchanged: first scan still runs
immediately on spawn, subsequent scans run every `interval`. Cancellation
remains responsive (the `biased` select! makes the cancel branch win
ties so a slow-firing cancel doesn't get queued behind a long sleep).
The previous UPSERT computed `failure_count = old + EXCLUDED.failure_count`
where EXCLUDED.failure_count was the failing job's `attempts` value at the
moment of capture. Net semantics were "cumulative handler-failure
invocations summed across all retry cycles" — defensible but ambiguous,
and easily confused with `attempts` on the original job, with `requeued_count`
(admin requeue counter), or with the per-cycle attempt count.

Switch to a clean touch-count semantics: `failure_count = N` means "this
logical job (by job_key) has reached the DLQ N times, regardless of how
many handler invocations each cycle involved." Initial INSERT sets it to
1; the ON CONFLICT branch increments by 1. For jobs without a job_key the
UPSERT path doesn't apply, so each failure creates a fresh row with
failure_count=1.

Updated:
- The bind list in `add_to_dlq` (loses the `attempts` bind for that field;
  failure_count is now hard-coded to 1 in the SQL VALUES).
- The SELECT in `process_failed_jobs`'s atomic CTE (same change).
- `DlqJob::failure_count` rustdoc to explain the new contract clearly.
- Existing test `test_dlq_add_job_and_retrieve` (was asserting
  failure_count == job.attempts(), which was 0 for a freshly-enqueued
  job — coincidentally always-passing under the old semantics).

Adds `test_dlq_failure_count_is_touch_count`: verifies that two
add_to_dlq calls with the same job_key produce a single DLQ row with
failure_count progressing 1 → 2.
P2-5 (Cargo.toml): graphile_worker version pin warning. The default
^0.11.1 constraint already excludes 0.12.x, but a patch release within
0.11.x can change the internal `_private_*` schema that backfill queries
directly (cleanup, DLQ scanner, admin endpoints). Add a comment near the
dep flagging this coupling so anyone running `cargo update` knows to
re-run integration tests against any new graphile_worker version. Not
tightening the version range (e.g., to `=0.11.4`) because that would
prevent picking up upstream bug fixes; the test suite is the right
backstop here, not a stricter pin.

P2-8 (lib.rs EnqueueOutcome rustdoc): the `AlreadyInProgress` variant is
a real footgun for callers who chain `.unwrap()` on `enqueue` results.
Rewrite the enum's docstring to spell out exactly when it fires, that
the new payload is *dropped* (not retried), and that `.unwrap()` /
`.expect()` panic on this variant by design. Show the buggy
auto-unwrap pattern explicitly so users recognize it in their own code.

P2-11 (.sqlx + README + docs/01-database-setup.md): remove the empty
`.sqlx/sqlx-data.json` stub and rewrite the SQLx documentation to be
truthful. backfill itself uses runtime sqlx::query() — the compile-time
sqlx::query!() macros don't fit because backfill targets dynamic schema
names (the user's chosen schema). So:
- Delete .sqlx/sqlx-data.json (was always `{}` — never populated, never
  served as an offline-mode cache).
- Update README's "SQLx" section to clarify backfill's actual usage
  (runtime queries, no DATABASE_URL needed at compile time).
- Add a callout at the top of docs/01-database-setup.md's SQLx section
  that the offline-mode setup it describes is for *user* queries against
  backfill's tables, not for backfill itself.
P3-1 (cleanup.rs): the stale-lock SQL was format-interpolating the
timeout in seconds directly into the query string. Safe (it's a u64) but
inconsistent with the parameterized norm everywhere else. Replace
`INTERVAL '{timeout_secs} seconds'` with a bound parameter:
`($1::bigint * interval '1 second')`. Both `release_stale_queue_locks`
and `release_stale_job_locks` updated. Behaviour unchanged; existing
tests cover this.

P3-2 (cleanup.rs cleanup_permanently_failed_jobs): when the DLQ is
disabled this function is the only mechanism that removes failed jobs
from `_private_jobs` — and those jobs are gone forever, since there's no
DLQ to capture them. Previously logged at INFO regardless. Now the
function checks `to_regclass('{schema}.backfill_dlq')` and:
- INFO if the DLQ table exists ("just GC, jobs were captured earlier")
- WARN with explicit "they cannot be recovered" wording if it doesn't,
  pointing the operator at `dlq_processor_interval` if they didn't mean
  to opt out of failure inspection.

P3-4 (lib.rs): add a doc note above the `pub use graphile_worker::{...}`
block flagging that backfill's public surface transitively includes
those types — graphile_worker version bumps that touch any of them
constitute breaking changes for backfill even when graphile_worker
itself doesn't tag them that way (graphile_worker is pre-1.0). Includes
a checklist for what to audit when upgrading.
Closes two of the test-coverage gaps from section 6 of the review.

test_retry_to_exhaustion_then_dlq_via_worker (plugin_tests.rs):
Existing DLQ tests SQL-fake exhaustion via UPDATE _private_jobs SET
attempts = max_attempts, which skips the entire worker code path. This
new test runs the actual handler multiple times via the real worker --
each call increments attempts via get_job's attempts+1, the handler
returns Err, fail_job's SQL fires, run_at gets pushed into the future.
Between iterations the test fast-forwards run_at = NOW() so we don't
sleep through graphile_worker's exp(attempts) second backoff.

After max_attempts iterations the job naturally hits permanently-failed
state, then process_failed_jobs() moves it to DLQ via the atomic CTE
(P1-2). Asserts: the job lands in DLQ with the right task_identifier
and job_key, failure_count is 1 (touch counter, P2-7), and _private_jobs
is empty afterward (P1-2's atomic move worked).

This exercises the seam between every layer added in the P0/P1 work
(P0-1 cleanup ordering, P0-3 plugin classification, P1-2 atomic move,
P2-7 touch counter) along with graphile_worker's own retry semantics.

test_concurrent_enqueue_under_load (integration_tests_clean.rs):
Spawns 10 tokio tasks each enqueueing 50 jobs in parallel against the
same pool. Verifies that all 500 enqueues succeed and that all 500 rows
appear in _private_jobs. Catches the kind of regression that would
appear if a future refactor introduced row-level lock contention or
deadlock paths in the enqueue SQL.
CI runs cargo +nightly fmt --check; my edits were formatted under stable
fmt which has slightly different defaults around method-chain wrapping.
Apply the nightly formatter so CI's fmt step passes. No behavioural
change.
@ceejbot ceejbot merged commit ed910e5 into latest May 7, 2026
1 check passed
@ceejbot ceejbot deleted the ceej/update-it branch May 7, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant