Production-readiness audit: P0/P1/P2/P3 fixes (1.2.0)#10
Merged
Conversation
`WorkerRunner::run_until_cancelled` called `startup_cleanup_with_timeouts` synchronously at startup, whose third step `cleanup_permanently_failed_jobs` DELETEs from `_private_jobs` where `attempts >= max_attempts AND locked_at IS NULL`. The DLQ processor uses the same predicate to find candidates, but runs as a periodic background task that hadn't ticked yet. Net effect: any job that hit max_attempts during the previous run (or in the gap between its final failure and the next DLQ tick) was silently deleted on every worker restart instead of being captured to the DLQ. Fix: when DLQ is enabled, run `process_failed_jobs()` synchronously before startup cleanup. Also document the ordering invariant on `cleanup_permanently_failed_jobs` and `startup_cleanup_with_timeouts` for direct callers (ad-hoc maintenance scripts). Adds a regression test that stages a permanently-failed row in `_private_jobs`, runs the actual `WorkerRunner` startup path, and asserts the row reaches the DLQ. Verified to fail on the previous code.
`WorkerError::is_retryable()` and `classify_from_message()` advertised that non-retryable error variants (`ValidationFailed`, `Unauthorized`, `InvalidInput`, etc.) would skip retries — but no production code path ever consulted them. graphile_worker treats every `Err` identically and reschedules on its `exp(min(attempts, 10))` second formula until max_attempts is reached. A `ValidationFailed` from a malformed payload would chew through ~6 hours of wasted retries before reaching the DLQ. Adds `PermanentFailurePlugin`, auto-registered alongside `DlqCleanupPlugin` when the DLQ is enabled. On `JobFail` it classifies `ctx.error` via `WorkerError::classify_from_message`. If the result is non-retryable, it runs `UPDATE _private_jobs SET attempts = max_attempts WHERE id = $1 AND locked_by = $worker_id` so the next `get_job()` skips the row and the next DLQ-processor tick captures it. Implementation note: graphile_worker's own `permanently_fail_jobs` SQL function would be the natural fit, but its `WHERE locked_at IS NULL OR locked_at < NOW() - 4h` guard is a no-op inside a `JobFail` hook — the worker (us) still owns the lock at that point. The custom UPDATE filters by `locked_by = ctx.worker_id` instead, which is a tighter safety property (only the lock holder can override its own row). graphile_worker clears locked_at immediately after the hook returns; the `attempts = max_attempts` we wrote sticks because `fail_job` only mutates last_error, run_at, and the lock fields. Two regression tests in plugin_tests.rs: - `test_non_retryable_error_short_circuits_retries`: ValidationFailed reaches max_attempts after a single execution and lands in DLQ. - `test_retryable_error_does_not_short_circuit`: TemporaryUnavailable is left alone (attempts == 1 after one execution). The first test was verified to fail without the plugin.
`RetryPolicy.{initial_delay, max_delay, backoff_multiplier, jitter_factor}`
were stored on the struct but never reached graphile_worker — only
`max_attempts` is forwarded by `From<JobSpec> for GraphileJobSpec`.
graphile_worker uses a hard-coded `exp(min(attempts, 10))` second SQL
formula for every retry. So `RetryPolicy::fast()` and
`RetryPolicy::conservative()` produced identical retry timing in practice
even though the docs promised "100ms-30s delays" vs. "1 min - 8 hour
delays".
This commit makes the fact match the promise:
- Marks the unused math helpers (`RetryPolicy::new`, `with_jitter`,
`calculate_delay`, `calculate_retry_time`, and `JobSpec::calculate_retry_time`)
as `#[deprecated(since = "1.2.0")]` with notes pointing users at
`RetryPolicy { max_attempts: n, ..Default::default() }` or the presets.
- Rewrites the rustdoc on `RetryPolicy`, on each preset, on the
`with_*_retries` builders, and on the `enqueue_*_with_retries`
convenience helpers to describe what actually happens (only
`max_attempts` differs across presets, fixed exp-backoff timing).
- Updates the lib.rs module-level rustdoc.
- Migrates `examples/enqueue_jobs.rs` to the recommended pattern so it
doesn't trigger the new deprecation warnings.
- Updates README.md to drop the false delay-range claims and replace
the wrong "Pre-configured Fast/Bulk queues" / "Custom(name)" listing
with the actual `Queue::Parallel` / `Queue::Serial(name)` enum.
- Updates docs/02-dlq.md to mark the post-#9 "queue_name shows as
default" warning as resolved (it was the change in v1.1.1 that fixed
this).
The struct fields themselves stay public for source-compatibility with
existing struct-literal construction. Per-job backoff customization needs
upstream graphile_worker support and is deferred.
The list query was filtering DLQ rows correctly via QueryBuilder, but the companion COUNT query was an unconditional `SELECT COUNT(*) FROM backfill_dlq` with no WHERE clause. Net effect: any paginated admin UI that filtered by task / queue / time range computed wrong page counts because the returned `total` reflected the entire table, not the filtered subset. Apply the same WHERE clauses to the count query. Adds a regression test that stages 5 jobs for task_a and 3 for task_b, then asserts that filtering by task returns the right per-task total (not the unfiltered 8). Verified to fail on the previous code.
`process_available_jobs` was declared `Result<usize, BackfillError>` but unconditionally returned `Ok(0)`. The doc comment admitted this. The existing test even asserted `assert_eq!(processed, 0)` with a comment noting that "job counting isn't implemented yet" — i.e., the test was asserting the bug. Change the signature to `Result<(), BackfillError>`. Truthful, smaller, forces the matching `?`-discarding pattern that nearly every existing caller already uses. Update the doc to point users at lifecycle hook plugins (`JobComplete` / `JobFail`) for actual job-count instrumentation. BREAKING CHANGE: callers that captured the return value as a usize must remove the binding. None of the existing tests or examples relied on the value beyond asserting it was zero (the bug).
Two atomicity holes in the DLQ paths:
P1-2 (process_failed_jobs):
The reaper did UPSERT into backfill_dlq, then DELETE from _private_jobs as
two separate statements. A crash between them (or a connection blip during
the DELETE) would leave the job in BOTH tables. The code even acknowledged
this with "Consider this a partial failure - job is in DLQ but also still
in main table." Fixed by collapsing both into a single statement using a
writable CTE:
WITH deleted AS (
DELETE FROM _private_jobs WHERE id = $1 RETURNING 1
)
INSERT INTO backfill_dlq (...)
SELECT $1::bigint, $2::text, ... FROM deleted
ON CONFLICT (job_key) WHERE job_key IS NOT NULL DO UPDATE SET ...
If DELETE finds no row (another worker beat us to it), `deleted` is empty,
the SELECT yields zero rows, and the INSERT runs zero times — natural
no-op. If DELETE finds the row, the INSERT either creates a new DLQ row or
ON-CONFLICT-updates an existing one (the requeue-then-fail-again case).
The whole thing commits atomically. `rows_affected() > 0` distinguishes
"actually moved" from "raced and skipped" for the moved-count metric.
Also adds explicit type casts on the SELECT placeholders since they no
longer have the INSERT VALUES position to disambiguate types from.
P1-3 (requeue_dlq_job):
Operations are enqueue (via WorkerUtils::add_raw_job, which hard-codes
the pool) followed by UPDATE backfill_dlq. Can't be wrapped in a single
transaction without forking add_raw_job. Smaller honest fix: demote the
post-enqueue UPDATE error from `?`-propagation to a WARN log and document
the residual non-atomicity in the function docstring.
Rationale: when the enqueue succeeds, the user's stated intent — "put this
job back in the queue" — has been achieved. Failing the operation because
the bookkeeping UPDATE blipped would mislead the user into thinking the
requeue didn't happen, and possibly trigger a double-requeue (which then
errors with AlreadyInProgress). Better to return Ok with a log than to
return Err and confuse the caller.
No new tests: the atomicity guarantees are not directly observable without
a chaos hook. The existing DLQ tests continue to verify happy-path
correctness.
The library was using `Queue::as_str()` (which returns the actual queue
name for serial queues) as the `queue=` Prometheus label across
`backfill_jobs_enqueued`, `backfill_jobs_already_in_progress`,
`backfill_dlq_jobs_added`, and `backfill_dlq_jobs_requeued`. The library's
own docs recommend `Queue::serial_for("user", id)` for per-entity ordering
— which produces a different queue name per user, which produces a
different label value per user, which produces unbounded Prometheus time
series. Industry best practice for cardinality-sensitive systems
(Prometheus, Datadog, etc.) is to keep label cardinality below ~100; per-
user labels easily exceed this by orders of magnitude and can OOM the
metrics backend or render dashboards unusable.
DLQ-side metrics had a related bug: parallel-origin jobs are stored with
queue_name = "" (empty string), which is invalid as a Prometheus label
in some backends.
Adds `Queue::metric_label() -> &'static str` returning "parallel" or
"serial" — bounded, two-valued. Built-in metric emission now uses this.
On the DLQ side where we have a stored queue_name string (not a Queue
enum), `metrics::queue_metric_label_from_name(&str)` does the same
two-valued mapping.
`Queue::as_str()` is preserved for logging (where unbounded values are
fine) and gets a doc comment warning against using it for metrics.
If users want per-queue stats for a known small set of named queues, they
can emit them themselves via a plugin — that's the correct place to make
the cardinality-vs-detail tradeoff per their own setup.
Adds a unit test asserting that `serial_for("user", 12345)` and
`serial_for("user", 99999)` produce the same label.
The QueueConfig multi-queue API has been misleading since it was introduced: - WorkerConfig::with_queues(Vec<QueueConfig>) accepted multiple configs, but only the FIRST one's `concurrency` was ever used at runtime (worker.rs WorkerOptionsBuilder::new). The rest were silently dropped with a single WARN log. - QueueConfig::priority_queue accepted a (min, max) priority_range that was never read by any code path. - QueueConfig::named_queue accepted a queue name that was logged but never applied to graphile_worker's WorkerOptions (which doesn't expose per-worker queue filtering). - worker_count() returned `queue_configs.len()`, lying when callers passed N configs and exactly 1 worker actually ran. This commit makes the surface match reality without removing it (which would be a bigger break): - Adds `WorkerConfig::with_concurrency(usize)` — the clean direct replacement. - Marks `WorkerConfig::with_queues`, `QueueConfig::named_queue`, and `QueueConfig::priority_queue` as `#[deprecated(since = "1.2.0")]` with notes pointing at the working alternatives. The `name` and `priority_range` fields get doc comments saying they aren't honored. - Fixes `worker_count()` to return 1 (the truth) instead of `queue_configs.len()`. - Renames the existing test `test_worker_runner_with_multiple_queues` to `test_worker_runner_with_multiple_queues_only_first_honored` and pins the actual behaviour (`worker_count() == 1`). The original test's `assert_eq!(worker.worker_count(), 3)` was asserting the bug. - Migrates `examples/basic_worker.rs` from the deprecated multi-queue shape to `with_concurrency`. The example previously claimed to configure separate fast/bulk/dead_letter workers but only the first was ever spawned — now it spawns one worker sized for the largest configured concurrency, with a comment explaining the constraint. Users who actually want multiple specialized workers can spawn multiple WorkerRunner instances themselves. Per-job queue routing still works correctly via Queue::serial(name) at enqueue time — that's unchanged.
Three minor improvements bundled together since each is a few lines: P2-9 (lib.rs enqueue_emergency): drop the redundant `run_at = Some(Utc::now())`. graphile_worker resolves a NULL run_at to NOW() at the SQL layer, so setting it explicitly was just an extra Rust-side clock read with no behavioural difference. Removing it also avoids a tiny clock-skew window where the Rust-computed timestamp could end up a hair in the past. P2-12 (client/enqueue.rs enqueue): the duration histogram was emitted on the success and already-in-progress branches but not the plain-error branch, biasing `backfill_db_operation_duration_seconds` toward fast successes. Move the duration recording to fire once after the call completes so all three outcomes contribute to the histogram. P2-13 (client/dlq.rs delete_dlq_job): replace the "fetch first, then delete" pattern with a single `DELETE … RETURNING task_identifier`. Halves DB round-trips on a hot admin path and removes the small race where the row could vanish between the SELECT and DELETE.
src/client/queries.rs was never `mod`-declared in src/client/mod.rs (or anywhere else), so the Rust compiler ignored it entirely. Its rustdoc-as-documentation pattern can't even render via `cargo doc` because it's not part of the module tree. The defined `JobInfo` / `DlqJobInfo` structs were never instantiated. Its purpose was to demonstrate the sqlx::query!() macro pattern for users wanting compile-time-checked queries — but `docs/01-database-setup.md` already covers SQLx setup (with `cargo sqlx prepare` for offline mode), so the duplicated guidance has no unique value. Delete the file. If we want to ship a real query!() recipe, we'll do it in proper user-facing docs and back it with actual code that exercises the macros.
The "find failed jobs not yet in DLQ" predicate used `id NOT IN (SELECT COALESCE(original_job_id, -1) FROM backfill_dlq)`. The COALESCE was a correct workaround for NOT IN's NULL-poisons-everything semantics, but: - NOT EXISTS doesn't have the NULL trap (correlated check, not a list membership test), so the COALESCE workaround disappears. - NOT EXISTS can be planned as an anti-join in PostgreSQL; NOT IN against a subquery typically can't (especially when the inner column is nullable, even with COALESCE), and degrades into per-row hashing/lookups on large DLQs. Behaviour is identical for the current data (any DLQ row with NULL original_job_id was created via add_to_dlq for a job we don't have an id for, which can never match a `_private_jobs.id` anyway). At scale this should plan and run noticeably faster.
The DLQ processor's background task ticked at a fixed interval regardless of error history. If process_failed_jobs() persistently errored (DB blip, auth issue, schema drift) the loop would log an ERROR every `interval` seconds — spammy and contributing nothing useful. The original code acknowledged this gap with a TODO-shaped comment. Replace the tokio::interval-based loop with a manual sleep loop that applies exponential backoff on consecutive errors: - interval, 2x, 4x, 8x, 16x, 32x — capped at 32x (so default 60s interval caps at ~32min between attempts under sustained failure). - Any success resets the counter back to a 1x interval. Behaviour on the happy path is unchanged: first scan still runs immediately on spawn, subsequent scans run every `interval`. Cancellation remains responsive (the `biased` select! makes the cancel branch win ties so a slow-firing cancel doesn't get queued behind a long sleep).
The previous UPSERT computed `failure_count = old + EXCLUDED.failure_count` where EXCLUDED.failure_count was the failing job's `attempts` value at the moment of capture. Net semantics were "cumulative handler-failure invocations summed across all retry cycles" — defensible but ambiguous, and easily confused with `attempts` on the original job, with `requeued_count` (admin requeue counter), or with the per-cycle attempt count. Switch to a clean touch-count semantics: `failure_count = N` means "this logical job (by job_key) has reached the DLQ N times, regardless of how many handler invocations each cycle involved." Initial INSERT sets it to 1; the ON CONFLICT branch increments by 1. For jobs without a job_key the UPSERT path doesn't apply, so each failure creates a fresh row with failure_count=1. Updated: - The bind list in `add_to_dlq` (loses the `attempts` bind for that field; failure_count is now hard-coded to 1 in the SQL VALUES). - The SELECT in `process_failed_jobs`'s atomic CTE (same change). - `DlqJob::failure_count` rustdoc to explain the new contract clearly. - Existing test `test_dlq_add_job_and_retrieve` (was asserting failure_count == job.attempts(), which was 0 for a freshly-enqueued job — coincidentally always-passing under the old semantics). Adds `test_dlq_failure_count_is_touch_count`: verifies that two add_to_dlq calls with the same job_key produce a single DLQ row with failure_count progressing 1 → 2.
P2-5 (Cargo.toml): graphile_worker version pin warning. The default
^0.11.1 constraint already excludes 0.12.x, but a patch release within
0.11.x can change the internal `_private_*` schema that backfill queries
directly (cleanup, DLQ scanner, admin endpoints). Add a comment near the
dep flagging this coupling so anyone running `cargo update` knows to
re-run integration tests against any new graphile_worker version. Not
tightening the version range (e.g., to `=0.11.4`) because that would
prevent picking up upstream bug fixes; the test suite is the right
backstop here, not a stricter pin.
P2-8 (lib.rs EnqueueOutcome rustdoc): the `AlreadyInProgress` variant is
a real footgun for callers who chain `.unwrap()` on `enqueue` results.
Rewrite the enum's docstring to spell out exactly when it fires, that
the new payload is *dropped* (not retried), and that `.unwrap()` /
`.expect()` panic on this variant by design. Show the buggy
auto-unwrap pattern explicitly so users recognize it in their own code.
P2-11 (.sqlx + README + docs/01-database-setup.md): remove the empty
`.sqlx/sqlx-data.json` stub and rewrite the SQLx documentation to be
truthful. backfill itself uses runtime sqlx::query() — the compile-time
sqlx::query!() macros don't fit because backfill targets dynamic schema
names (the user's chosen schema). So:
- Delete .sqlx/sqlx-data.json (was always `{}` — never populated, never
served as an offline-mode cache).
- Update README's "SQLx" section to clarify backfill's actual usage
(runtime queries, no DATABASE_URL needed at compile time).
- Add a callout at the top of docs/01-database-setup.md's SQLx section
that the offline-mode setup it describes is for *user* queries against
backfill's tables, not for backfill itself.
P3-1 (cleanup.rs): the stale-lock SQL was format-interpolating the
timeout in seconds directly into the query string. Safe (it's a u64) but
inconsistent with the parameterized norm everywhere else. Replace
`INTERVAL '{timeout_secs} seconds'` with a bound parameter:
`($1::bigint * interval '1 second')`. Both `release_stale_queue_locks`
and `release_stale_job_locks` updated. Behaviour unchanged; existing
tests cover this.
P3-2 (cleanup.rs cleanup_permanently_failed_jobs): when the DLQ is
disabled this function is the only mechanism that removes failed jobs
from `_private_jobs` — and those jobs are gone forever, since there's no
DLQ to capture them. Previously logged at INFO regardless. Now the
function checks `to_regclass('{schema}.backfill_dlq')` and:
- INFO if the DLQ table exists ("just GC, jobs were captured earlier")
- WARN with explicit "they cannot be recovered" wording if it doesn't,
pointing the operator at `dlq_processor_interval` if they didn't mean
to opt out of failure inspection.
P3-4 (lib.rs): add a doc note above the `pub use graphile_worker::{...}`
block flagging that backfill's public surface transitively includes
those types — graphile_worker version bumps that touch any of them
constitute breaking changes for backfill even when graphile_worker
itself doesn't tag them that way (graphile_worker is pre-1.0). Includes
a checklist for what to audit when upgrading.
Closes two of the test-coverage gaps from section 6 of the review. test_retry_to_exhaustion_then_dlq_via_worker (plugin_tests.rs): Existing DLQ tests SQL-fake exhaustion via UPDATE _private_jobs SET attempts = max_attempts, which skips the entire worker code path. This new test runs the actual handler multiple times via the real worker -- each call increments attempts via get_job's attempts+1, the handler returns Err, fail_job's SQL fires, run_at gets pushed into the future. Between iterations the test fast-forwards run_at = NOW() so we don't sleep through graphile_worker's exp(attempts) second backoff. After max_attempts iterations the job naturally hits permanently-failed state, then process_failed_jobs() moves it to DLQ via the atomic CTE (P1-2). Asserts: the job lands in DLQ with the right task_identifier and job_key, failure_count is 1 (touch counter, P2-7), and _private_jobs is empty afterward (P1-2's atomic move worked). This exercises the seam between every layer added in the P0/P1 work (P0-1 cleanup ordering, P0-3 plugin classification, P1-2 atomic move, P2-7 touch counter) along with graphile_worker's own retry semantics. test_concurrent_enqueue_under_load (integration_tests_clean.rs): Spawns 10 tokio tasks each enqueueing 50 jobs in parallel against the same pool. Verifies that all 500 enqueues succeed and that all 500 rows appear in _private_jobs. Catches the kind of regression that would appear if a future refactor introduced row-level lock contention or deadlock paths in the enqueue SQL.
CI runs cargo +nightly fmt --check; my edits were formatted under stable fmt which has slightly different defaults around method-chain wrapping. Apply the nightly formatter so CI's fmt step passes. No behavioural change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A focused production-readiness pass on the priority queue, retry, and DLQ
subsystems. Driven by a top-to-bottom code review that surfaced three P0
silent-failure bugs, six P1 correctness/UX bugs, thirteen P2 cleanups, and
several P3 polish items. Every P0 and P1 fix has a regression test that
fails on the prior code; the test suite goes 107 → 115.
This PR is intentionally scoped to bug fixes and doc honesty. Two
follow-up PRs are planned for cleanup work and additional test coverage.
Critical fixes
delete failed jobs before the DLQ scanner had a chance to capture them.
WorkerError::is_retryable()is wired up. Previously dead code; aValidationFailedwould burn through ~6h of pointless retries beforereaching DLQ. Now it short-circuits to DLQ immediately.
process_failed_jobsis atomic. UPSERT+DELETE collapsed into oneCTE; no more "row in both tables" failure window.
Breaking changes (gated behind 1.2.0 minor bump)
WorkerRunner::process_available_jobs() -> Result<(), _>(was a fakeResult<usize, _>always returningOk(0)).WorkerRunner::worker_count()returns1truthfully (was theconfigured-but-not-spawned count).
Other fixes
Filtered-pagination total, label-cardinality bounds in metrics, atomic
DLQ requeue, exponential backoff in the DLQ processor, NOT EXISTS instead
of NOT IN, touch-count semantics on failure_count, parameterized cleanup
SQL, louder log when cleanup deletes without DLQ capture, plus various
doc honesty fixes (truthful RetryPolicy docs, EnqueueOutcome footgun
warning, removed dead queries.rs stub, deleted empty .sqlx/sqlx-data.json).
Verification
on the prior code via
git stashround-trips.cargo clippy --all-targets -F axum: clean.cargo test --doc -F axum: 11/11 (1 intentionally ignored — buggypattern example in EnqueueOutcome rustdoc).
See CHANGELOG.md for the full per-finding breakdown.
Test plan
tests/dlq_tests.rs,tests/plugin_tests.rs, andtests/integration_tests_clean.rslogical change per commit for easy bisecting