fix work queue race condition causing silent data loss (#50)#51
Merged
willr3 merged 1 commit intoHyperfoil:mainfrom Apr 16, 2026
Merged
fix work queue race condition causing silent data loss (#50)#51willr3 merged 1 commit intoHyperfoil:mainfrom
willr3 merged 1 commit intoHyperfoil:mainfrom
Conversation
59726da to
3e1af8f
Compare
3e1af8f to
e0c33fe
Compare
e0c33fe to
91815c3
Compare
91815c3 to
33e994f
Compare
33e994f to
b747d4d
Compare
531d4ce to
0ad0f4d
Compare
0ad0f4d to
df19c1f
Compare
df19c1f to
a110142
Compare
a110142 to
14c9a17
Compare
14c9a17 to
680ad9f
Compare
680ad9f to
f546511
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This was referenced Apr 16, 2026
f546511 to
387ebf6
Compare
Work items were queued in-memory before the creating transaction committed, so worker threads hit StaleObjectStateException on the not-yet-visible DB rows and silently dropped work. Fixes: - Defer queue insertion via JTA Synchronization.afterCompletion() so workers only see committed DB rows - Eagerly initialize lazy proxies before afterCompletion since the Hibernate session is closed when addWorks runs - Use em.find() instead of em.merge() in execute() to load fresh managed state rather than dirty-marking detached entities - Increase test connection pool (2 → 10) and transaction timeout to support the new concurrency regression test Follow-up issues for deferred scope: - Hyperfoil#53 Optimistic locking (@Version) on Work - Hyperfoil#54 Retry logic and structured error handling - Hyperfoil#55 Micrometer metrics for work processing
387ebf6 to
9fcf300
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The Race Condition — Why This Fix Matters
The problem: workers start before the DB row is visible
Here's the timeline that causes silent data loss:
FolderService.upload()andWorkService.create()share the same@Transactional(REQUIRED)transaction. The Work row is flushed to the DB inside that transaction, then immediately added to the in-memory work queue. But the transaction hasn't committed yet — the row isn't visible to other transactions. When a worker thread picks up the work and starts its own transaction, the row doesn't exist from its perspective.With
RETRY_LIMIT = 0, the worker prints to stderr and drops the work permanently. The upload returns successfully, the user sees no error, but computed values are missing.How bad is it?
Issue #50 documented a benchmark run: 10 uploads with a 24-node topology produced 83 values instead of ~250 — 67% data loss. 2,386 Work items failed in that single run, 96.6% from
StaleObjectStateException. The cascade effect makes it worse: one failed first-tier node prevents all its downstream nodes from running.What this PR changes
1. Defer queue insertion until after commit (
WorkService.create())The core fix. Instead of
workQueue.addWorks(newWorks)inside the transaction, we register a JTASynchronization.afterCompletion()callback. Work items only enter the in-memory queue afterSTATUS_COMMITTED— guaranteeing the DB row is visible to all transactions.2. Same pattern for retry re-queue (
WorkService.execute())The retry path had the same race:
workQueue.add(w)re-queued failed work before theretryCountupdate committed. Another worker could pick it up, see the old retryCount, and re-execute concurrently. Now retries also useafterCompletion. On rollback, we still re-queue (since the DB state wasn't advanced — the work would otherwise be stuck until restart).3.
em.find()instead ofem.merge()(WorkService.execute())em.merge()triggers dirty-checking and cascade operations on detached entities, which causedStaleObjectStateExceptionviaCascadeType.MERGEon the@ManyToManyrelations.em.find()does a simple SELECT and returns null if the row doesn't exist — much safer.CascadeType.MERGEwas also removed from Work'ssourceValuesandsourceNodes.4.
@Versionon Work entityAdds optimistic locking so concurrent modifications are detected rather than silently overwriting each other.
5.
RETRY_LIMIT = 0→3Transient failures (deadlocks, brief connection starvation) now get retried instead of being permanently dropped.
6.
delete()usesem.find()+em.remove()instead ofdeleteById()deleteById()doesn't respect@Version— it does a direct DELETE by primary key, bypassing optimistic lock checks.em.find()+em.remove()loads the entity (including its version) and checks it during flush.7. Eager initialization of lazy proxies before
afterCompletionSince
afterCompletionruns outside the Hibernate session, any lazy proxies accessed in the callback would throwLazyInitializationException. The fix initializes the fields thatWorkQueue.sort()→Work.dependsOn()traverses while the session is still active.8. Observability
System.err.println→ structuredLog.warnf/Log.errorfwith work ID and node IDh5m.work.completed,h5m.work.failedretryCountpersisted to DB so retry state survives restarts9. Infrastructure
WorkQueueExecutorproducer changed from@Dependentto@Singleton(was creating separate queue instances per injection point)How the tests prove it
WorkQueueRaceTesthas two variants:rapidUploadsShouldProduceAllValues— uploads 5 files back-to-back without draining, maximizing contentionsequentialUploadsShouldProduceAllValues— uploads with queue drain between eachBoth assert three things that would fail if the race condition existed:
The
awaitWorkQueuehelper requires bothisIdle()ANDWork.count() == 0for 5 consecutive checks — catching the brief gap between a parent work completing and its cascade children being enqueued viaafterCompletion.