Skip to content

Adds retry support to the Amazon.Lambda.DurableExecution#2363

Draft
GarrettBeatty wants to merge 5 commits into
GarrettBeatty/stack/2from
GarrettBeatty/stack/3
Draft

Adds retry support to the Amazon.Lambda.DurableExecution#2363
GarrettBeatty wants to merge 5 commits into
GarrettBeatty/stack/2from
GarrettBeatty/stack/3

Conversation

@GarrettBeatty
Copy link
Copy Markdown
Contributor

@GarrettBeatty GarrettBeatty commented May 12, 2026

Stacked PRs:


#2216

What

Adds retry support to the Amazon.Lambda.DurableExecution SDK on top of the foundation in #2360. After this PR a step that throws can be retried with configurable backoff and jitter; durable executions resume after the retry timer elapses without billing Lambda compute during the wait.

Public API introduced:

Type Purpose
IRetryStrategy Decides whether a failed step should retry, with what delay.
RetryDecision Output of IRetryStrategy.ShouldRetryShouldRetry flag plus Delay.
RetryStrategy Static factory: Default, Transient, None, Exponential(...), FromDelegate(...).
JitterStrategy None / Half / Full for exponential backoff.
StepSemantics AtLeastOncePerRetry (default) / AtMostOncePerRetry.
StepConfig.RetryStrategy, StepConfig.Semantics Per-step retry configuration.

Why

Real workflows fail. A step that calls a flaky downstream service or hits a transient throttle needs to retry without restarting the whole workflow. Durable execution makes service-mediated retries possible: the SDK checkpoints a RETRY operation with a NextAttemptDelaySeconds, suspends the Lambda, and the service re-invokes us when the timer fires. The user's compute isn't billed during the wait.

AtMostOncePerRetry semantics handle non-idempotent steps (e.g. charging a card): a START checkpoint is durably persisted before user code runs, so a Lambda crash mid-execution can be detected on replay and routed through the retry strategy rather than re-executing.

How

Retry control flow. When a step throws, StepOperation.HandleStepFailureAsync consults the configured IRetryStrategy.ShouldRetry(ex, attemptNumber). If the decision says retry, the SDK enqueues a RETRY checkpoint carrying NextAttemptDelaySeconds, then suspends via TerminationManager.SuspendAndAwait so RunAsync returns Pending to the service. On the next invocation, StepOperation.ReplayAsync sees Status == PENDING and either re-suspends (timer not yet elapsed) or re-executes (timer fired) with the carried-forward attempt counter.

At-most-once semantics. For non-idempotent steps, Semantics = AtMostOncePerRetry writes a START checkpoint and blocks until the batcher flushes it before user code runs. If Lambda crashes between user code and the SUCCEED flush, replay sees STARTED with no terminal record and routes through HandleStepFailureAsync as a failed attempt instead of re-executing — the side effect runs at most once per attempt.

Retry strategy contract. IRetryStrategy.ShouldRetry(Exception, int attemptNumber) returns a RetryDecision. ExponentialRetryStrategy supports configurable max attempts, initial/max delay, backoff rate, jitter (None/Half/Full), and exception filtering by type or message regex. Built-in factories: RetryStrategy.Default (6 attempts, 5s/60s, 2× backoff, full jitter), Transient (3 attempts, 1s/5s, half jitter), None. RetryStrategy.FromDelegate(...) for arbitrary policies.

Key files:

  • Config/IRetryStrategy.cs — strategy interface + RetryDecision value type
  • Config/RetryStrategy.cs — built-in strategies, ExponentialRetryStrategy, JitterStrategy, StepSemantics, DelegateRetryStrategy
  • Config/StepConfig.cs — adds RetryStrategy and Semantics properties
  • Internal/StepOperation.cs — adds PENDING (retry timer) and STARTED (AtMostOnce crash recovery) replay arms; HandleStepFailureAsync decision tree
  • Internal/TerminationManager.cs — adds RetryScheduled reason

Testing

21 new unit tests in Amazon.Lambda.DurableExecution.Tests (130 total, up from 109 in #2360):

  • RetryStrategyTests (14 tests) — exponential backoff math, jitter strategies, max-attempt exhaustion, exception-type and message-pattern filtering, delegate strategies
  • DurableContextTests retry block (6 tests) — FailsWithRetryStrategy_CheckpointsRetryAndSuspends, FailsNoRetryStrategy_CheckpointsFail, RetryExhausted_CheckpointsFail, PendingWithFutureTimestamp_Suspends, PendingWithPastTimestamp_ReExecutes, AtMostOnce_FlushesStartBeforeExecution, AtMostOnce_StartedReplay_TriggersRetryHandler

Integration tests (Amazon.Lambda.DurableExecution.IntegrationTests) — RetrySucceeds and RetryExhausts end-to-end against the real durable-execution service.

Out of scope (follow-up PRs)

  • MapAsync / ParallelAsync / RunInChildContextAsync / WaitForConditionAsync
  • CallbackAsync, InvokeAsync
  • DefaultJsonCheckpointSerializer
  • DurableLogger replay-suppression (currently NullLogger)
  • Annotations source-generator integration / [DurableExecution] attribute
  • DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
  • dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2361, branch: GarrettBeatty/stack/3
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 711bf82 to 4f05fa9 Compare May 12, 2026 16:20
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 16:31
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 4f05fa9 to 54d18f9 Compare May 12, 2026 16:31
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 16:31
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 18:16
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 54d18f9 to 599445f Compare May 12, 2026 18:16
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 18:16
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 21:30
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 599445f to e7a85e4 Compare May 12, 2026 21:30
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 21:30
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 12, 2026 21:34
GarrettBeatty added a commit that referenced this pull request May 12, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from e7a85e4 to 8f23ebb Compare May 12, 2026 21:34
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 12, 2026 21:34
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 16:04
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 8f23ebb to e39e68e Compare May 13, 2026 16:04
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 16:04
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 16:21
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from e39e68e to 52055d3 Compare May 13, 2026 16:21
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 16:21
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 81b9144 to 531cbbe Compare May 13, 2026 21:24
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 21:24
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 21:49
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 531cbbe to 31ea7e8 Compare May 13, 2026 21:49
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 21:49
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 22:20
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 31ea7e8 to ef44439 Compare May 13, 2026 22:20
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 22:20
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 22:31
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from ef44439 to 6bc97f2 Compare May 13, 2026 22:31
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 22:31
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 13, 2026 22:35
GarrettBeatty added a commit that referenced this pull request May 13, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 6bc97f2 to 85eae3e Compare May 13, 2026 22:35
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 13, 2026 22:35
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/2 to feature/durablefunction May 14, 2026 01:24
GarrettBeatty added a commit that referenced this pull request May 14, 2026
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/3 branch from 85eae3e to 0a32c0d Compare May 14, 2026 01:24
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/2 May 14, 2026 01:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Builds on PR #2360 to add retry support to the Amazon.Lambda.DurableExecution SDK. Failed steps can now be retried with configurable backoff and jitter via service-mediated retries (the SDK checkpoints a RETRY operation and suspends the Lambda so the user is not billed during backoff). Adds at-most-once semantics for non-idempotent steps via a synchronously-flushed START checkpoint that allows crash detection on replay.

Changes:

  • New public retry API: IRetryStrategy, RetryDecision, RetryStrategy factories (Default/Transient/None/Exponential/FromDelegate), JitterStrategy, StepSemantics, and StepConfig.RetryStrategy/StepConfig.Semantics.
  • StepOperation adds PENDING (retry-timer) and STARTED (AtMostOnce crash-recovery) replay arms, a HandleStepFailureAsync decision tree, and START-checkpoint emission (sync for AtMostOnce, fire-and-forget for AtLeastOnce).
  • 21 new unit tests plus integration-test updates asserting StepStarted events and richer history logging.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Config/IRetryStrategy.cs New strategy interface + RetryDecision struct
Config/RetryStrategy.cs ExponentialRetryStrategy, DelegateRetryStrategy, JitterStrategy, StepSemantics, factories
Config/StepConfig.cs Adds RetryStrategy and Semantics properties
Internal/StepOperation.cs PENDING/STARTED replay arms, retry decision tree, START-checkpoint emission
Internal/TerminationManager.cs Adds RetryScheduled termination reason
Internal/CheckpointBatcher.cs Doc-only update describing fire-and-forget semantics
Tests/RetryStrategyTests.cs 14 unit tests for exponential math/jitter/filters/delegate
Tests/DurableContextTests.cs 6 retry/AtMostOnce/Pending replay tests
Tests/DurableFunctionTests.cs Updated to assert START + SUCCEED + WAIT-START flat sequence
IntegrationTests/*.cs Add StepStarted-event assertions; richer history dump in DurableFunctionDeployment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with serializer hook (retry deferred to follow-up PR)
- ICheckpointSerializer interface
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- IRetryStrategy, ExponentialRetryStrategy, retry decision factories
- DefaultJsonCheckpointSerializer
- DurableLogger replay-suppression (currently returns NullLogger)
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2

remove

update

update

update

update
var history = await deployment.WaitForHistoryAsync(
arn!,
h => (h.Events?.Count(e => e.StepSucceededDetails != null) ?? 0) >= 2
h => (h.Events?.Count(e => e.EventType == EventType.StepStarted) ?? 0) >= 2
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we are emitting START steps (which are needed for retries) we are asserting them in the IT tests


COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]
/// Replay semantics — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c>
/// Replay branches — example: <c>await ctx.StepAsync(ChargeCard, "charge")</c>
/// <list type="bullet">
/// <item>Fresh: no prior state → run func → emit SUCCEED → return result.</item>
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in previous PR only SUCCEEDED or FAILED mattered. But now for replays, we need to keep track of how many times the function was executed, which is done via the number of STARTED steps.

GarrettBeatty and others added 2 commits May 14, 2026 13:41
Match the Python / Java / JavaScript reference SDKs' replay-mode model:
the workflow is "replaying" iff it has not yet revisited every
checkpointed completed user-replayable operation. A single global flag
flipped on the first fresh op (the prior model) misclassified workflow-
body code that runs before the first step and would not generalize to
Map/Parallel/Callback later.

ExecutionState changes:
- Replace `Mode`/`ExecutionMode`/`EnterExecutionMode()` with `IsReplaying`
  + `TrackReplay(operationId)`.
- Initial replay decision: any non-EXECUTION op present means we're
  replaying. The service always sends an EXECUTION-type op carrying the
  input payload — that's bookkeeping, not user history, so it does not
  count toward replay (matches Python execution.py:258, Java
  ExecutionManager:81, JS execution-context.ts:62).
- TrackReplay flips IsReplaying false once every checkpointed terminal-
  status non-EXECUTION op has been visited. Terminal set matches
  Python's: SUCCEEDED, FAILED, CANCELLED, STOPPED.

Operation changes:
- DurableOperation.ExecuteAsync calls TrackReplay(OperationId) at the
  top, so every operation participates in visit accounting without each
  subclass needing to remember.
- StepOperation/WaitOperation drop their manual EnterExecutionMode calls.

Tests:
- ExecutionStateTests rewritten around IsReplaying/TrackReplay, including
  pinning regressions: only-EXECUTION-op ⇒ NotReplaying, all-visited ⇒
  flips out of replay, PENDING ops do not block transition, idempotency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release Not Needed Add this label if a PR does not need to be released.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants