Test-coverage gaps from review §6 by ceejbot · Pull Request #12 · ceejbot/backfill

ceejbot · 2026-05-07T04:52:53Z

Summary

Closes the three remaining test-coverage gaps from §6 of the
production-readiness review — the bug-fix work landed in 1.2.0 and 2.0.0,
this PR adds the regression coverage that should have been there all along.

No production code changes. Test count goes 108 → 111.

What's covered

test_retry_then_eventual_success (plugin_tests.rs)
Symmetric to test_retry_to_exhaustion_then_dlq_via_worker. A handler
that fails on its first N-1 attempts and succeeds on attempt N runs to
completion through the actual worker — no DLQ leakage, no leftover row.
Until now every worker-level retry test landed the job in DLQ; nothing
exercised the much-more-common transient-failure-recovers-naturally case.

test_worker_crash_recovery_through_startup (integration_tests_clean.rs)
Existing stale-lock tests exercise the SQL helpers directly. This test
validates the full crash-recovery loop through WorkerRunner startup:
stage a job with locked_at far in the past + fake locked_by, spawn a
fresh runner, poll for the row to disappear. Validates that startup
cleanup runs, releases the lock, the worker fetches the now-unlocked
row, runs the handler, and deletes the row — all in one bounded poll
loop with diagnostic state pulled on failure.

test_dlq_requeue_runs_to_completion_and_cleans_dlq (plugin_tests.rs)
test_dlq_requeue_job (in dlq_tests.rs) verifies the requeue operation
— that requeued_count is incremented, last_requeued_at is set. This
verifies the full lifecycle: enqueue with a job_key → force-fail → DLQ
→ requeue → worker runs → handler succeeds → row gone from main → DLQ
entry also gone (cleaned up by the auto-registered DlqCleanupPlugin
on JobComplete via job_key match). If the cleanup plugin regressed,
this test would catch it.

Verification

111 tests passing (was 108).
cargo clippy --all-targets -F axum: clean.
cargo +nightly fmt --check: clean.
cargo test --doc -F axum: 11/11 (1 intentionally ignored).

No version bump — test-only PR.

Test plan

CI passes (clippy + nextest + doctests + security audit + nightly fmt)
Bisect-friendly: each commit adds one self-contained test

Symmetric counterpart to test_retry_to_exhaustion_then_dlq_via_worker. Adds an EventuallySucceedJob handler that fails on its first N-1 attempts and succeeds on attempt N, then runs it through the actual worker (with the same fast-forward-run_at trick to skip exp-backoff sleeps) and asserts that the job ran to completion: row deleted from _private_jobs, and nothing in the DLQ. This validates the retry loop's exit-via-success path. Until now, every worker-level retry test landed the job in DLQ; nothing exercised the much-more-common case of "transient failure recovers naturally." The DLQ assertion uses init_dlq() so it works even though the test doesn't need DLQ otherwise — it's a belt-and-suspenders check that successful jobs don't leak into the DLQ via any path.

Existing stale-lock tests exercise the SQL helpers (release_stale_*) directly. This test instead validates the full crash-recovery loop through the WorkerRunner startup path: stage a job with an old locked_at + fake locked_by (simulating a worker that crashed mid-execution), spawn a fresh WorkerRunner, then poll for the row to disappear. Validates four things in one shot: - WorkerRunner.run_until_cancelled actually runs startup_cleanup_with_timeouts - startup_cleanup releases stale job locks - The newly-unlocked row is fetched by the worker's first poll - The handler runs to completion and the row is deleted A regression in any of those breaks the test. Bounded 5s wait + diagnostic final-row-state assertion (rather than a fixed sleep) so the failure mode is informative when something does break. Adds a trivial OkHandler at the file level for tests that just need a job that runs without doing anything. The test enqueues via enqueue_task to ensure the payload shape matches the handler — an earlier draft of this test enqueued a TestJob payload against the OkHandler identifier, which deserialized wrong inside the worker and silently failed instead of running, masquerading as a crash-recovery failure.

`test_dlq_requeue_job` (in dlq_tests.rs) verifies the requeue *operation* — that requeued_count is incremented, last_requeued_at is set. It does not verify that the requeued job actually runs to completion or that the DLQ entry gets cleaned up afterward. This test closes that gap. Lifecycle covered end-to-end: 1. Enqueue with a job_key (so DlqCleanupPlugin has something to match on). 2. Force the row to permanently-failed state via SQL. 3. process_failed_jobs() moves it to DLQ. 4. requeue_dlq_job() creates a fresh _private_jobs row. 5. Worker runs the requeued job — handler succeeds. 6. Assert: row is gone from _private_jobs (success). 7. Assert: DLQ is empty too — DlqCleanupPlugin auto-registered when DLQ is enabled, and its JobComplete hook deletes DLQ entries by job_key. If DlqCleanupPlugin regresses or stops being auto-registered, the DLQ entry would persist after the requeued job succeeded, and an admin could be tricked into re-requeueing the same now-fixed job indefinitely. Also folds in a clippy fix for the diagnostic tuple type in the worker crash-recovery test (extracted as a local type alias to silence type_complexity).

ceejbot added 3 commits May 6, 2026 21:44

ceejbot merged commit 81add59 into latest May 7, 2026
1 check passed

ceejbot deleted the ceej/test-coverage-gaps branch May 7, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test-coverage gaps from review §6#12

Test-coverage gaps from review §6#12
ceejbot merged 3 commits intolatestfrom
ceej/test-coverage-gaps

ceejbot commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ceejbot commented May 7, 2026

Summary

What's covered

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant