Skip to content

Make scale-up worker additions transactional#853

Open
Andyz26 wants to merge 5 commits into
masterfrom
andyz/workerAddRollbackSupport
Open

Make scale-up worker additions transactional#853
Andyz26 wants to merge 5 commits into
masterfrom
andyz/workerAddRollbackSupport

Conversation

@Andyz26
Copy link
Copy Markdown
Collaborator

@Andyz26 Andyz26 commented May 5, 2026

Make scale-up worker additions transactional

Summary

JobActor.scaleStage previously allowed in-memory job metadata to drift from
durable storage on MantisJobStore.storeNewWorker failures: the new JobWorker
was added to the stage's index/number maps but no record reached the store,
leaving sparse worker indexes that broke subsequent scale-down, heartbeat
resubmits, and index-based lookups.

This PR makes each scale-up addition transactional via a two-phase
prepare/commit/rollback flow. Failed adds are reverted in-memory, the stage
target is shrunk by one, and the loop continues with the failed slot reused —
keeping indexes dense. Partial scales are visible to the caller through a richer
response and a new metric.

Changes

Core (JobActor, MantisJobMetadataImpl, MantisStageMetadataImpl)

  • New PendingWorkerAddition wraps the in-memory mutation done by
    prepareWorker. Caller commits after a successful storeNewWorker or rolls
    back on failure.
  • IWorkerManager.scaleStage now returns ScaleStageResult (actual / requested
    / failedAdditions) instead of a bare int. Response messages distinguish full
    success from partial: "Partially scaled stage 1 to 3 workers (requested 4; 1 add failures)".
  • New rollback primitives MantisJobMetadataImpl.unsafeRemoveWorkerMetadata and
    MantisStageMetadataImpl.removeWorkerIndex use a check-first-then-CAS pattern
    (remove(K, V)) and fully revert if the second remove fails.
  • MantisStageMetadataImpl.unsafeSetNumWorkers reverts this.numWorkers if
    updateStage throws, keeping in-memory and durable counts coherent.
  • JobActor.onJobInitialize stops the actor on submit-time init failure
    (isSubmit=true only, to preserve init-retry semantics for store reloads).

Failure-mode escalation

  • Shrink-failure after partial success: throw
    PartialScaleStageFailureException carrying the realized partial count; queued
    workers are flushed first; caller gets SERVER_ERROR with actualNumWorkers =
    realized count. Original add-failure preserved as cause; shrink failure attached
    as suppressed.
  • Rollback corruption (stage maps inconsistent): fatal — sends
    JobClusterProto.KillJobRequest to the parent so the job is failed and
    archived. Workers stored before the corruption are cleaned up via the job
    archive lifecycle.

Metrics

  • New numWorkerStoreFailures counter increments per rollback-and-shrink.
  • numScaleStage now also increments on partial-success scales (still skipped
    on plain failures that durably stored nothing).

Misc

  • mantis.worker.resubmission.interval.secs default lengthened from 5:10:20
    5:10:30:60:120:600. (Tangential — flag for split if reviewers prefer.)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

Test Results

164 files  + 4  164 suites  +4   10m 7s ⏱️ - 2m 9s
795 tests +97  784 ✅ +96  11 💤 +4  0 ❌  - 3 
795 runs  +95  784 ✅ +94  11 💤 +4  0 ❌  - 3 

Results for commit b4202af. ± Comparison against base commit 5c81ae9.

♻️ This comment has been updated with latest results.

@Andyz26 Andyz26 force-pushed the andyz/workerAddRollbackSupport branch from 598f8f2 to 020559b Compare May 5, 2026 23:22
- testJobSubmitInitalizationFails: actor now stops on submit-time
  init failure, so assert termination instead of expecting a follow-up
  GetJobDetailsResponse.

- testWorkerAcceptedToStartedMsRecordedOnWorkerStatus: SpectatorRegistryFactory.setRegistry
  is a one-shot CAS, so when JobScaleUpDownTests installs a registry first
  this test was reading a different DefaultRegistry instance than the one
  actually wired up. Read the installed registry via getRegistry() instead.
store.updateStage(this);
} catch (Exception e) {
this.numWorkers = previousNumWorkers;
throw e;
Copy link
Copy Markdown
Collaborator

@hellolittlej hellolittlej May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we log the exception and track the error here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the log is at upper layer/caller.

describeRollbackWorker(workerByNumber)));
}

if (!workerByIndexMetadataSet.remove(workerIndex, workerByIndex)) {
Copy link
Copy Markdown
Collaborator

@hellolittlej hellolittlej May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case we will fail this check and below one would fail when the first check passed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants