Skip to content

Test-coverage gaps from review §6#12

Merged
ceejbot merged 3 commits intolatestfrom
ceej/test-coverage-gaps
May 7, 2026
Merged

Test-coverage gaps from review §6#12
ceejbot merged 3 commits intolatestfrom
ceej/test-coverage-gaps

Conversation

@ceejbot
Copy link
Copy Markdown
Owner

@ceejbot ceejbot commented May 7, 2026

Summary

Closes the three remaining test-coverage gaps from §6 of the
production-readiness review — the bug-fix work landed in 1.2.0 and 2.0.0,
this PR adds the regression coverage that should have been there all along.

No production code changes. Test count goes 108 → 111.

What's covered

test_retry_then_eventual_success (plugin_tests.rs)
Symmetric to test_retry_to_exhaustion_then_dlq_via_worker. A handler
that fails on its first N-1 attempts and succeeds on attempt N runs to
completion through the actual worker — no DLQ leakage, no leftover row.
Until now every worker-level retry test landed the job in DLQ; nothing
exercised the much-more-common transient-failure-recovers-naturally case.

test_worker_crash_recovery_through_startup (integration_tests_clean.rs)
Existing stale-lock tests exercise the SQL helpers directly. This test
validates the full crash-recovery loop through WorkerRunner startup:
stage a job with locked_at far in the past + fake locked_by, spawn a
fresh runner, poll for the row to disappear. Validates that startup
cleanup runs, releases the lock, the worker fetches the now-unlocked
row, runs the handler, and deletes the row — all in one bounded poll
loop with diagnostic state pulled on failure.

test_dlq_requeue_runs_to_completion_and_cleans_dlq (plugin_tests.rs)
test_dlq_requeue_job (in dlq_tests.rs) verifies the requeue operation
— that requeued_count is incremented, last_requeued_at is set. This
verifies the full lifecycle: enqueue with a job_key → force-fail → DLQ
→ requeue → worker runs → handler succeeds → row gone from main → DLQ
entry also gone (cleaned up by the auto-registered DlqCleanupPlugin
on JobComplete via job_key match). If the cleanup plugin regressed,
this test would catch it.

Verification

  • 111 tests passing (was 108).
  • cargo clippy --all-targets -F axum: clean.
  • cargo +nightly fmt --check: clean.
  • cargo test --doc -F axum: 11/11 (1 intentionally ignored).

No version bump — test-only PR.

Test plan

  • CI passes (clippy + nextest + doctests + security audit + nightly fmt)
  • Bisect-friendly: each commit adds one self-contained test

ceejbot added 3 commits May 6, 2026 21:44
Symmetric counterpart to test_retry_to_exhaustion_then_dlq_via_worker.
Adds an EventuallySucceedJob handler that fails on its first N-1 attempts
and succeeds on attempt N, then runs it through the actual worker (with
the same fast-forward-run_at trick to skip exp-backoff sleeps) and asserts
that the job ran to completion: row deleted from _private_jobs, and
nothing in the DLQ.

This validates the retry loop's exit-via-success path. Until now, every
worker-level retry test landed the job in DLQ; nothing exercised the
much-more-common case of "transient failure recovers naturally."

The DLQ assertion uses init_dlq() so it works even though the test doesn't
need DLQ otherwise — it's a belt-and-suspenders check that successful
jobs don't leak into the DLQ via any path.
Existing stale-lock tests exercise the SQL helpers (release_stale_*)
directly. This test instead validates the full crash-recovery loop
through the WorkerRunner startup path: stage a job with an old locked_at
+ fake locked_by (simulating a worker that crashed mid-execution), spawn
a fresh WorkerRunner, then poll for the row to disappear.

Validates four things in one shot:
- WorkerRunner.run_until_cancelled actually runs startup_cleanup_with_timeouts
- startup_cleanup releases stale job locks
- The newly-unlocked row is fetched by the worker's first poll
- The handler runs to completion and the row is deleted

A regression in any of those breaks the test. Bounded 5s wait + diagnostic
final-row-state assertion (rather than a fixed sleep) so the failure mode
is informative when something does break.

Adds a trivial OkHandler at the file level for tests that just need a
job that runs without doing anything. The test enqueues via enqueue_task
to ensure the payload shape matches the handler — an earlier draft of
this test enqueued a TestJob payload against the OkHandler identifier,
which deserialized wrong inside the worker and silently failed instead
of running, masquerading as a crash-recovery failure.
`test_dlq_requeue_job` (in dlq_tests.rs) verifies the requeue *operation*
— that requeued_count is incremented, last_requeued_at is set. It does
not verify that the requeued job actually runs to completion or that the
DLQ entry gets cleaned up afterward. This test closes that gap.

Lifecycle covered end-to-end:
1. Enqueue with a job_key (so DlqCleanupPlugin has something to match on).
2. Force the row to permanently-failed state via SQL.
3. process_failed_jobs() moves it to DLQ.
4. requeue_dlq_job() creates a fresh _private_jobs row.
5. Worker runs the requeued job — handler succeeds.
6. Assert: row is gone from _private_jobs (success).
7. Assert: DLQ is empty too — DlqCleanupPlugin auto-registered when DLQ
   is enabled, and its JobComplete hook deletes DLQ entries by job_key.

If DlqCleanupPlugin regresses or stops being auto-registered, the DLQ
entry would persist after the requeued job succeeded, and an admin
could be tricked into re-requeueing the same now-fixed job indefinitely.

Also folds in a clippy fix for the diagnostic tuple type in the worker
crash-recovery test (extracted as a local type alias to silence
type_complexity).
@ceejbot ceejbot merged commit 81add59 into latest May 7, 2026
1 check passed
@ceejbot ceejbot deleted the ceej/test-coverage-gaps branch May 7, 2026 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant