Make scale-up worker additions transactional#853
Open
Andyz26 wants to merge 5 commits into
Open
Conversation
598f8f2 to
020559b
Compare
- testJobSubmitInitalizationFails: actor now stops on submit-time init failure, so assert termination instead of expecting a follow-up GetJobDetailsResponse. - testWorkerAcceptedToStartedMsRecordedOnWorkerStatus: SpectatorRegistryFactory.setRegistry is a one-shot CAS, so when JobScaleUpDownTests installs a registry first this test was reading a different DefaultRegistry instance than the one actually wired up. Read the installed registry via getRegistry() instead.
hellolittlej
reviewed
May 6, 2026
| store.updateStage(this); | ||
| } catch (Exception e) { | ||
| this.numWorkers = previousNumWorkers; | ||
| throw e; |
Collaborator
There was a problem hiding this comment.
do we log the exception and track the error here?
Collaborator
Author
There was a problem hiding this comment.
i think the log is at upper layer/caller.
hellolittlej
reviewed
May 6, 2026
| describeRollbackWorker(workerByNumber))); | ||
| } | ||
|
|
||
| if (!workerByIndexMetadataSet.remove(workerIndex, workerByIndex)) { |
Collaborator
There was a problem hiding this comment.
in which case we will fail this check and below one would fail when the first check passed?
james-lubin
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make scale-up worker additions transactional
Summary
JobActor.scaleStagepreviously allowed in-memory job metadata to drift fromdurable storage on
MantisJobStore.storeNewWorkerfailures: the newJobWorkerwas added to the stage's index/number maps but no record reached the store,
leaving sparse worker indexes that broke subsequent scale-down, heartbeat
resubmits, and index-based lookups.
This PR makes each scale-up addition transactional via a two-phase
prepare/commit/rollback flow. Failed adds are reverted in-memory, the stage
target is shrunk by one, and the loop continues with the failed slot reused —
keeping indexes dense. Partial scales are visible to the caller through a richer
response and a new metric.
Changes
Core (
JobActor,MantisJobMetadataImpl,MantisStageMetadataImpl)PendingWorkerAdditionwraps the in-memory mutation done byprepareWorker. Caller commits after a successfulstoreNewWorkeror rollsback on failure.
IWorkerManager.scaleStagenow returnsScaleStageResult(actual / requested/ failedAdditions) instead of a bare int. Response messages distinguish full
success from partial:
"Partially scaled stage 1 to 3 workers (requested 4; 1 add failures)".MantisJobMetadataImpl.unsafeRemoveWorkerMetadataandMantisStageMetadataImpl.removeWorkerIndexuse a check-first-then-CAS pattern(
remove(K, V)) and fully revert if the second remove fails.MantisStageMetadataImpl.unsafeSetNumWorkersrevertsthis.numWorkersifupdateStagethrows, keeping in-memory and durable counts coherent.JobActor.onJobInitializestops the actor on submit-time init failure(
isSubmit=trueonly, to preserve init-retry semantics for store reloads).Failure-mode escalation
PartialScaleStageFailureExceptioncarrying the realized partial count; queuedworkers are flushed first; caller gets
SERVER_ERRORwithactualNumWorkers=realized count. Original add-failure preserved as cause; shrink failure attached
as suppressed.
JobClusterProto.KillJobRequestto the parent so the job is failed andarchived. Workers stored before the corruption are cleaned up via the job
archive lifecycle.
Metrics
numWorkerStoreFailurescounter increments per rollback-and-shrink.numScaleStagenow also increments on partial-success scales (still skippedon plain failures that durably stored nothing).
Misc
mantis.worker.resubmission.interval.secsdefault lengthened from5:10:20→5:10:30:60:120:600. (Tangential — flag for split if reviewers prefer.)