Skip to content

fix work queue race condition causing silent data loss (#50)#51

Merged
willr3 merged 1 commit intoHyperfoil:mainfrom
stalep:fix/work-queue-race
Apr 16, 2026
Merged

fix work queue race condition causing silent data loss (#50)#51
willr3 merged 1 commit intoHyperfoil:mainfrom
stalep:fix/work-queue-race

Conversation

@stalep
Copy link
Copy Markdown
Member

@stalep stalep commented Apr 13, 2026

The Race Condition — Why This Fix Matters

The problem: workers start before the DB row is visible

Here's the timeline that causes silent data loss:

Thread A (upload)                    Thread B (worker)
─────────────────                    ─────────────────
@Transactional upload() begins
  ├─ em.merge(work) + em.flush()
  │    → SQL INSERT emitted
  │    → row in DB but NOT YET COMMITTED
  ├─ workQueue.addWorks(work)
  │    → work is now in the in-memory queue
  │                                  workQueue.take() → picks up work
  │                                  @Transactional execute() begins
  │                                    em.merge(w) or em.find(w.id)
  │                                    → StaleObjectStateException!
  │                                    → row not visible (TX A hasn't committed)
  │                                    → RETRY_LIMIT=0, work is dropped
  │                                    → no error shown to user
  ├─ upload() returns
  └─ TX A commits → row now visible
                                     (too late — work already dropped)

FolderService.upload() and WorkService.create() share the same @Transactional(REQUIRED) transaction. The Work row is flushed to the DB inside that transaction, then immediately added to the in-memory work queue. But the transaction hasn't committed yet — the row isn't visible to other transactions. When a worker thread picks up the work and starts its own transaction, the row doesn't exist from its perspective.

With RETRY_LIMIT = 0, the worker prints to stderr and drops the work permanently. The upload returns successfully, the user sees no error, but computed values are missing.

How bad is it?

Issue #50 documented a benchmark run: 10 uploads with a 24-node topology produced 83 values instead of ~250 — 67% data loss. 2,386 Work items failed in that single run, 96.6% from StaleObjectStateException. The cascade effect makes it worse: one failed first-tier node prevents all its downstream nodes from running.

What this PR changes

1. Defer queue insertion until after commit (WorkService.create())

The core fix. Instead of workQueue.addWorks(newWorks) inside the transaction, we register a JTA Synchronization.afterCompletion() callback. Work items only enter the in-memory queue after STATUS_COMMITTED — guaranteeing the DB row is visible to all transactions.

2. Same pattern for retry re-queue (WorkService.execute())

The retry path had the same race: workQueue.add(w) re-queued failed work before the retryCount update committed. Another worker could pick it up, see the old retryCount, and re-execute concurrently. Now retries also use afterCompletion. On rollback, we still re-queue (since the DB state wasn't advanced — the work would otherwise be stuck until restart).

3. em.find() instead of em.merge() (WorkService.execute())

em.merge() triggers dirty-checking and cascade operations on detached entities, which caused StaleObjectStateException via CascadeType.MERGE on the @ManyToMany relations. em.find() does a simple SELECT and returns null if the row doesn't exist — much safer. CascadeType.MERGE was also removed from Work's sourceValues and sourceNodes.

4. @Version on Work entity

Adds optimistic locking so concurrent modifications are detected rather than silently overwriting each other.

5. RETRY_LIMIT = 03

Transient failures (deadlocks, brief connection starvation) now get retried instead of being permanently dropped.

6. delete() uses em.find() + em.remove() instead of deleteById()

deleteById() doesn't respect @Version — it does a direct DELETE by primary key, bypassing optimistic lock checks. em.find() + em.remove() loads the entity (including its version) and checks it during flush.

7. Eager initialization of lazy proxies before afterCompletion

Since afterCompletion runs outside the Hibernate session, any lazy proxies accessed in the callback would throw LazyInitializationException. The fix initializes the fields that WorkQueue.sort()Work.dependsOn() traverses while the session is still active.

8. Observability

  • System.err.println → structured Log.warnf/Log.errorf with work ID and node ID
  • Micrometer counters: h5m.work.completed, h5m.work.failed
  • retryCount persisted to DB so retry state survives restarts

9. Infrastructure

  • WorkQueueExecutor producer changed from @Dependent to @Singleton (was creating separate queue instances per injection point)
  • Startup validation warns if worker thread count exceeds DB connection pool
  • Test JDBC pool increased from 2 to 10 to prevent connection starvation

How the tests prove it

WorkQueueRaceTest has two variants:

  • rapidUploadsShouldProduceAllValues — uploads 5 files back-to-back without draining, maximizing contention
  • sequentialUploadsShouldProduceAllValues — uploads with queue drain between each

Both assert three things that would fail if the race condition existed:

  1. Zero permanently failed work items (counter check)
  2. Zero orphaned Work rows in DB
  3. At least 25 computed values produced

The awaitWorkQueue helper requires both isIdle() AND Work.count() == 0 for 5 consecutive checks — catching the brief gap between a parent work completing and its cascade children being enqueued via afterCompletion.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

@stalep stalep force-pushed the fix/work-queue-race branch from e0c33fe to 91815c3 Compare April 15, 2026 09:54
@stalep stalep requested a review from Copilot April 15, 2026 09:55

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

@stalep stalep force-pushed the fix/work-queue-race branch 2 times, most recently from 531d4ce to 0ad0f4d Compare April 15, 2026 14:48
@stalep stalep requested a review from Copilot April 15, 2026 14:48

This comment was marked as outdated.

@stalep stalep force-pushed the fix/work-queue-race branch from 0ad0f4d to df19c1f Compare April 15, 2026 15:25

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Work items were queued in-memory before the creating transaction
committed, so worker threads hit StaleObjectStateException on the
not-yet-visible DB rows and silently dropped work.

Fixes:
- Defer queue insertion via JTA Synchronization.afterCompletion() so
  workers only see committed DB rows
- Eagerly initialize lazy proxies before afterCompletion since the
  Hibernate session is closed when addWorks runs
- Use em.find() instead of em.merge() in execute() to load fresh
  managed state rather than dirty-marking detached entities
- Increase test connection pool (2 → 10) and transaction timeout
  to support the new concurrency regression test

Follow-up issues for deferred scope:
- Hyperfoil#53 Optimistic locking (@Version) on Work
- Hyperfoil#54 Retry logic and structured error handling
- Hyperfoil#55 Micrometer metrics for work processing
@stalep stalep force-pushed the fix/work-queue-race branch from 387ebf6 to 9fcf300 Compare April 16, 2026 16:53
@willr3 willr3 merged commit c04a4ce into Hyperfoil:main Apr 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants