Conversation
Symmetric counterpart to test_retry_to_exhaustion_then_dlq_via_worker. Adds an EventuallySucceedJob handler that fails on its first N-1 attempts and succeeds on attempt N, then runs it through the actual worker (with the same fast-forward-run_at trick to skip exp-backoff sleeps) and asserts that the job ran to completion: row deleted from _private_jobs, and nothing in the DLQ. This validates the retry loop's exit-via-success path. Until now, every worker-level retry test landed the job in DLQ; nothing exercised the much-more-common case of "transient failure recovers naturally." The DLQ assertion uses init_dlq() so it works even though the test doesn't need DLQ otherwise — it's a belt-and-suspenders check that successful jobs don't leak into the DLQ via any path.
Existing stale-lock tests exercise the SQL helpers (release_stale_*) directly. This test instead validates the full crash-recovery loop through the WorkerRunner startup path: stage a job with an old locked_at + fake locked_by (simulating a worker that crashed mid-execution), spawn a fresh WorkerRunner, then poll for the row to disappear. Validates four things in one shot: - WorkerRunner.run_until_cancelled actually runs startup_cleanup_with_timeouts - startup_cleanup releases stale job locks - The newly-unlocked row is fetched by the worker's first poll - The handler runs to completion and the row is deleted A regression in any of those breaks the test. Bounded 5s wait + diagnostic final-row-state assertion (rather than a fixed sleep) so the failure mode is informative when something does break. Adds a trivial OkHandler at the file level for tests that just need a job that runs without doing anything. The test enqueues via enqueue_task to ensure the payload shape matches the handler — an earlier draft of this test enqueued a TestJob payload against the OkHandler identifier, which deserialized wrong inside the worker and silently failed instead of running, masquerading as a crash-recovery failure.
`test_dlq_requeue_job` (in dlq_tests.rs) verifies the requeue *operation* — that requeued_count is incremented, last_requeued_at is set. It does not verify that the requeued job actually runs to completion or that the DLQ entry gets cleaned up afterward. This test closes that gap. Lifecycle covered end-to-end: 1. Enqueue with a job_key (so DlqCleanupPlugin has something to match on). 2. Force the row to permanently-failed state via SQL. 3. process_failed_jobs() moves it to DLQ. 4. requeue_dlq_job() creates a fresh _private_jobs row. 5. Worker runs the requeued job — handler succeeds. 6. Assert: row is gone from _private_jobs (success). 7. Assert: DLQ is empty too — DlqCleanupPlugin auto-registered when DLQ is enabled, and its JobComplete hook deletes DLQ entries by job_key. If DlqCleanupPlugin regresses or stops being auto-registered, the DLQ entry would persist after the requeued job succeeded, and an admin could be tricked into re-requeueing the same now-fixed job indefinitely. Also folds in a clippy fix for the diagnostic tuple type in the worker crash-recovery test (extracted as a local type alias to silence type_complexity).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the three remaining test-coverage gaps from §6 of the
production-readiness review — the bug-fix work landed in 1.2.0 and 2.0.0,
this PR adds the regression coverage that should have been there all along.
No production code changes. Test count goes 108 → 111.
What's covered
test_retry_then_eventual_success(plugin_tests.rs)Symmetric to
test_retry_to_exhaustion_then_dlq_via_worker. A handlerthat fails on its first N-1 attempts and succeeds on attempt N runs to
completion through the actual worker — no DLQ leakage, no leftover row.
Until now every worker-level retry test landed the job in DLQ; nothing
exercised the much-more-common transient-failure-recovers-naturally case.
test_worker_crash_recovery_through_startup(integration_tests_clean.rs)Existing stale-lock tests exercise the SQL helpers directly. This test
validates the full crash-recovery loop through
WorkerRunnerstartup:stage a job with
locked_atfar in the past + fakelocked_by, spawn afresh runner, poll for the row to disappear. Validates that startup
cleanup runs, releases the lock, the worker fetches the now-unlocked
row, runs the handler, and deletes the row — all in one bounded poll
loop with diagnostic state pulled on failure.
test_dlq_requeue_runs_to_completion_and_cleans_dlq(plugin_tests.rs)test_dlq_requeue_job(in dlq_tests.rs) verifies the requeue operation— that requeued_count is incremented, last_requeued_at is set. This
verifies the full lifecycle: enqueue with a job_key → force-fail → DLQ
→ requeue → worker runs → handler succeeds → row gone from main → DLQ
entry also gone (cleaned up by the auto-registered DlqCleanupPlugin
on JobComplete via job_key match). If the cleanup plugin regressed,
this test would catch it.
Verification
cargo clippy --all-targets -F axum: clean.cargo +nightly fmt --check: clean.cargo test --doc -F axum: 11/11 (1 intentionally ignored).No version bump — test-only PR.
Test plan