fix work queue race condition causing silent data loss (#50) by stalep · Pull Request #51 · Hyperfoil/h5m

stalep · 2026-04-13T15:04:41Z

The Race Condition — Why This Fix Matters

The problem: workers start before the DB row is visible

Here's the timeline that causes silent data loss:

Thread A (upload)                    Thread B (worker)
─────────────────                    ─────────────────
@Transactional upload() begins
  ├─ em.merge(work) + em.flush()
  │    → SQL INSERT emitted
  │    → row in DB but NOT YET COMMITTED
  ├─ workQueue.addWorks(work)
  │    → work is now in the in-memory queue
  │                                  workQueue.take() → picks up work
  │                                  @Transactional execute() begins
  │                                    em.merge(w) or em.find(w.id)
  │                                    → StaleObjectStateException!
  │                                    → row not visible (TX A hasn't committed)
  │                                    → RETRY_LIMIT=0, work is dropped
  │                                    → no error shown to user
  ├─ upload() returns
  └─ TX A commits → row now visible
                                     (too late — work already dropped)

FolderService.upload() and WorkService.create() share the same @Transactional(REQUIRED) transaction. The Work row is flushed to the DB inside that transaction, then immediately added to the in-memory work queue. But the transaction hasn't committed yet — the row isn't visible to other transactions. When a worker thread picks up the work and starts its own transaction, the row doesn't exist from its perspective.

With RETRY_LIMIT = 0, the worker prints to stderr and drops the work permanently. The upload returns successfully, the user sees no error, but computed values are missing.

How bad is it?

Issue #50 documented a benchmark run: 10 uploads with a 24-node topology produced 83 values instead of ~250 — 67% data loss. 2,386 Work items failed in that single run, 96.6% from StaleObjectStateException. The cascade effect makes it worse: one failed first-tier node prevents all its downstream nodes from running.

What this PR changes

1. Defer queue insertion until after commit (WorkService.create())

The core fix. Instead of workQueue.addWorks(newWorks) inside the transaction, we register a JTA Synchronization.afterCompletion() callback. Work items only enter the in-memory queue after STATUS_COMMITTED — guaranteeing the DB row is visible to all transactions.

2. Same pattern for retry re-queue (WorkService.execute())

The retry path had the same race: workQueue.add(w) re-queued failed work before the retryCount update committed. Another worker could pick it up, see the old retryCount, and re-execute concurrently. Now retries also use afterCompletion. On rollback, we still re-queue (since the DB state wasn't advanced — the work would otherwise be stuck until restart).

3. em.find() instead of em.merge() (WorkService.execute())

em.merge() triggers dirty-checking and cascade operations on detached entities, which caused StaleObjectStateException via CascadeType.MERGE on the @ManyToMany relations. em.find() does a simple SELECT and returns null if the row doesn't exist — much safer. CascadeType.MERGE was also removed from Work's sourceValues and sourceNodes.

4. @Version on Work entity

Adds optimistic locking so concurrent modifications are detected rather than silently overwriting each other.

5. RETRY_LIMIT = 0 → 3

Transient failures (deadlocks, brief connection starvation) now get retried instead of being permanently dropped.

6. delete() uses em.find() + em.remove() instead of deleteById()

deleteById() doesn't respect @Version — it does a direct DELETE by primary key, bypassing optimistic lock checks. em.find() + em.remove() loads the entity (including its version) and checks it during flush.

7. Eager initialization of lazy proxies before afterCompletion

Since afterCompletion runs outside the Hibernate session, any lazy proxies accessed in the callback would throw LazyInitializationException. The fix initializes the fields that WorkQueue.sort() → Work.dependsOn() traverses while the session is still active.

8. Observability

System.err.println → structured Log.warnf/Log.errorf with work ID and node ID
Micrometer counters: h5m.work.completed, h5m.work.failed
retryCount persisted to DB so retry state survives restarts

9. Infrastructure

WorkQueueExecutor producer changed from @Dependent to @Singleton (was creating separate queue instances per injection point)
Startup validation warns if worker thread count exceeds DB connection pool
Test JDBC pool increased from 2 to 10 to prevent connection starvation

How the tests prove it

WorkQueueRaceTest has two variants:

rapidUploadsShouldProduceAllValues — uploads 5 files back-to-back without draining, maximizing contention
sequentialUploadsShouldProduceAllValues — uploads with queue drain between each

Both assert three things that would fail if the race condition existed:

Zero permanently failed work items (counter check)
Zero orphaned Work rows in DB
At least 25 computed values produced

The awaitWorkQueue helper requires both isIdle() AND Work.count() == 0 for 5 consecutive checks — catching the brief gap between a parent work completing and its cascade children being enqueued via afterCompletion.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Version

Work items were queued in-memory before the creating transaction committed, so worker threads hit StaleObjectStateException on the not-yet-visible DB rows and silently dropped work. Fixes: - Defer queue insertion via JTA Synchronization.afterCompletion() so workers only see committed DB rows - Eagerly initialize lazy proxies before afterCompletion since the Hibernate session is closed when addWorks runs - Use em.find() instead of em.merge() in execute() to load fresh managed state rather than dirty-marking detached entities - Increase test connection pool (2 → 10) and transaction timeout to support the new concurrency regression test Follow-up issues for deferred scope: - Hyperfoil#53 Optimistic locking (@Version) on Work - Hyperfoil#54 Retry logic and structured error handling - Hyperfoil#55 Micrometer metrics for work processing

stalep requested review from barreiro, Copilot and willr3 April 13, 2026 15:04

Copilot started reviewing on behalf of stalep April 14, 2026 18:40 View session