Skip to content

fix(logs): add durable execution diagnostics foundation#3564

Open
PlaneInABottle wants to merge 8 commits intosimstudioai:stagingfrom
PlaneInABottle:upstream/execution-diagnostics-foundation
Open

fix(logs): add durable execution diagnostics foundation#3564
PlaneInABottle wants to merge 8 commits intosimstudioai:stagingfrom
PlaneInABottle:upstream/execution-diagnostics-foundation

Conversation

@PlaneInABottle
Copy link

Summary

  • persist durable execution diagnostics in workflow_execution_logs, including lastStartedBlock, lastCompletedBlock, trace metadata, and finalizationPath
  • centralize terminal execution finalization so completed, failed, cancelled, and paused runs keep consistent diagnostics without letting callback failures change execution outcomes
  • add focused regression coverage for diagnostics derivation, logging-session durability, and executor finalization ordering

Why

Later jobs and log read-surface fixes depend on a trustworthy execution diagnostics foundation. This PR stores the minimum durable data needed to explain where a run got to and how it ended without pulling in broader API or jobs-surface changes.

Scope

  • included: logging-session persistence, execution log shape updates, executor finalization ordering, and focused tests
  • excluded: jobs route changes, logs API read-surface changes, paused status normalization across APIs, cleanup routes, and webhook/async handoff UI work

Validation

  • bun --cwd apps/sim vitest run lib/logs/execution/diagnostics.test.ts lib/logs/execution/logger.test.ts lib/logs/execution/logging-session.test.ts lib/workflows/executor/execution-core.test.ts
  • bunx @biomejs/biome check apps/sim/lib/workflows/executor/execution-core.ts apps/sim/lib/workflows/executor/execution-core.test.ts apps/sim/lib/logs/execution/logging-session.ts apps/sim/lib/logs/execution/logging-session.test.ts apps/sim/lib/logs/execution/logger.ts apps/sim/executor/orchestrators/loop.ts apps/sim/executor/orchestrators/parallel.ts apps/sim/executor/orchestrators/node.ts apps/sim/executor/utils/subflow-utils.ts apps/sim/executor/execution/block-executor.ts
  • verified latest local execution rows directly in Postgres include finalizationPath, lastStartedBlock, and lastCompletedBlock

Follow-ups

  • centralize execution status contract for read surfaces
  • normalize paused execution status across read APIs
  • reconcile jobs async status with execution truth and expose handoff state

@cursor
Copy link

cursor bot commented Mar 13, 2026

PR Summary

Medium Risk
Touches execution lifecycle callbacks and logging persistence paths; bugs here could change block-event timing or leave runs with incomplete/incorrect diagnostics, though failures are mostly swallowed to preserve execution outcomes.

Overview
Adds durable execution diagnostics to workflow_execution_logs. Execution data now records lastStartedBlock, lastCompletedBlock, hasTraceSpans/traceSpanCount, and a finalizationPath/completionFailure to explain how a run ended even when full trace persistence fails.

Hardens executor lifecycle + finalization sequencing. Block/subflow lifecycle hooks (onBlockStart/onBlockComplete) are now awaited in the executor/orchestrators where correctness matters, wrapped to persist markers via LoggingSession, and user callbacks are fire-and-forget with errors swallowed so callback failures don’t change execution outcomes; empty loop/parallel paths also await emitted lifecycle events.

Tests updated/added for ordering and failure modes. Adds coverage for empty-parallel lifecycle awaiting, logging-session progress write draining/monotonic marker updates, and ensures success results stay successful even if post-run finalization throws.

Written by Cursor Bugbot for commit c3f5d77. This will update automatically on new commits. Configure here.

@vercel
Copy link

vercel bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Mar 13, 2026 1:15pm

Request Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR establishes a durable execution diagnostics foundation by persisting lastStartedBlock, lastCompletedBlock, finalizationPath, and completionFailure into workflow_execution_logs, and by converting the previously fire-and-forget post-execution logging in execution-core.ts into an awaited, centralized finalization step.

Key changes:

  • logging-session.ts: Adds onBlockStart / onBlockComplete lifecycle hooks that issue monotonic JSONB jsonb_set writes per block, tracked in a pendingProgressWrites set that is drained before any terminal completion call.
  • execution-core.ts: Replaces fire-and-forget void (async () => {})() finalization with await finalizeExecutionOutcome(...) / await finalizeExecutionError(...), ensuring DB writes complete before the function returns; introduces wasExecutionFinalizedByCore to prevent double-logging by outer callers.
  • All orchestrators (block-executor, loop, parallel, node) and subflow-utils are updated to make their onBlockStart / onBlockComplete callbacks async and wrapped in try/catch so callback failures never break execution.
  • diagnostics.ts (new): Provides buildExecutionDiagnostics for the upcoming read-surface — currently covered by tests but not yet wired to any route.
  • types.ts: Adds ExecutionFinalizationPath, ExecutionLastStartedBlock, and ExecutionLastCompletedBlock types with a runtime type guard.

Notable behavior change: The response from executeWorkflowCore is now delayed until DB finalization writes complete, trading a small latency increase for guaranteed diagnostic durability.

Confidence Score: 4/5

  • Safe to merge — no critical logic bugs found; changes are well-tested and isolated to the logging/diagnostics layer.
  • The refactoring is thorough, backed by focused regression tests covering ordering invariants, retry semantics, and callback isolation. The main behavioral change (awaiting DB writes before returning) is intentional and correct. The noted issues are minor: countTraceSpans duplication, a snapshot-based drain that could theoretically miss late writes (practically safe given the execution model), a changed import style in logger.test.ts, and fire-and-forget cost flushes that are not tracked in pendingProgressWrites (safe because the terminal write uses in-memory cost). No runtime errors or data-loss scenarios were identified.
  • Pay close attention to apps/sim/lib/logs/execution/logging-session.ts (drain semantics and cost flush fire-and-forget) and apps/sim/lib/workflows/executor/execution-core.ts (new awaited finalization path).

Important Files Changed

Filename Overview
apps/sim/lib/workflows/executor/execution-core.ts Major refactor: replaces fire-and-forget post-execution logging with awaited finalizeExecutionOutcome/finalizeExecutionError helpers; adds wrappedOnBlockStart that awaits persistence before firing user callbacks as void; exports wasExecutionFinalizedByCore for double-finalization prevention. Behavior change: response latency increases slightly as DB writes are now awaited before returning, which is the correct tradeoff for reliability.
apps/sim/lib/logs/execution/logging-session.ts Substantial additions: onBlockStart/pendingProgressWrites tracking, monotonic JSONB update queries for lastStartedBlock/lastCompletedBlock, drainPendingProgressWrites before terminal finalization, centralized completeExecutionWithFinalization, and completionPromise clearing on failure to allow error-path retry. Minor concern: snapshot-based drain could theoretically miss late-registered writes; flushAccumulatedCost is now fire-and-forget without being tracked in pendingProgressWrites.
apps/sim/lib/logs/execution/logger.ts Added buildCompletedExecutionData helper that merges lastStartedBlock, lastCompletedBlock, finalizationPath, completionFailure, and trace metadata from existing DB data and new params. Introduces a duplicate countTraceSpans (also in diagnostics.ts). The completeWorkflowExecution signature is expanded with finalizationPath and completionFailure params.
apps/sim/lib/logs/execution/diagnostics.ts New utility for deriving execution diagnostics from existing executionData (DB read path). Handles untyped data safely with runtime checks; validates finalizationPath using the new type guard. Currently unused in the main codebase (only in tests) — serves as foundation for upcoming read-surface changes.
apps/sim/lib/logs/types.ts Adds ExecutionFinalizationPath const enum with type guard, ExecutionLastStartedBlock/ExecutionLastCompletedBlock interfaces, and extends WorkflowExecutionLog['executionData'] with the new diagnostic fields. Clean additions with no breaking changes to existing callers.
apps/sim/executor/execution/block-executor.ts Made callOnBlockStart and callOnBlockComplete async with try/catch wrappers so callback failures are logged but never bubble up to break block execution. Straightforward and safe change.
apps/sim/executor/utils/subflow-utils.ts Updated addSubflowErrorLog and emitEmptySubflowEvents to use void promise.catch() pattern for onBlockStart/onBlockComplete since these are synchronous utility functions. Writes are still registered in pendingProgressWrites synchronously before first suspension, so drain semantics are preserved.

Sequence Diagram

sequenceDiagram
    participant EC as executeWorkflowCore
    participant LS as LoggingSession
    participant DB as Database
    participant EX as Executor

    EC->>LS: safeStart()
    EC->>EX: execute() with wrappedOnBlockStart/Complete

    loop For each block
        EX->>EC: wrappedOnBlockStart(blockId, ...)
        EC->>LS: onBlockStart(blockId, startedAt)
        LS->>DB: jsonb_set lastStartedBlock (monotonic)
        DB-->>LS: ack (tracked in pendingProgressWrites)
        LS-->>EC: resolved
        EC-->>EX: void userCallback fired separately

        EX->>EC: wrappedOnBlockComplete(blockId, output)
        EC->>LS: onBlockComplete(blockId, output)
        LS->>DB: jsonb_set lastCompletedBlock (monotonic)
        LS->>DB: void flushAccumulatedCost (fire-and-forget)
        LS-->>EC: resolved
        EC-->>EX: void userCallback fired separately
    end

    EX-->>EC: ExecutionResult

    EC->>EC: finalizeExecutionOutcome()
    EC->>LS: safeComplete / safeCompleteWithCancellation / safeCompleteWithPause
    LS->>LS: drainPendingProgressWrites()
    LS->>DB: completeWorkflowExecution (finalizationPath, lastStarted/CompletedBlock, traceSpans)
    DB-->>LS: ack
    LS-->>EC: resolved

    EC->>DB: clearExecutionCancellation
    EC->>DB: updateWorkflowRunCounts
    EC-->>Caller: ExecutionResult
Loading

Last reviewed commit: 9db5e87

@PlaneInABottle
Copy link
Author

@icecrasher321 I think this is an important one. One scenario I have experienced is stuck workflow. I couldn't find any logs, and it just kept in running. I am planning to introduce few more prs after this one too.

test added 3 commits March 13, 2026 14:17
Store last-started and last-completed block markers with finalization metadata so later read surfaces can explain how a run ended without reconstructing executor state.
Await only the persistence needed to keep diagnostics durable before terminal completion while keeping callback failures from changing execution behavior.
Keep successful fallback output and accumulated cost intact while tightening progress-write draining and deduplicating trace span counting for diagnostics helpers.
@PlaneInABottle PlaneInABottle force-pushed the upstream/execution-diagnostics-foundation branch from 9db5e87 to c6d9195 Compare March 13, 2026 11:44
Add the missing AuthType export to the hybrid auth mock so the async execution route test exercises the 202 queueing path instead of crashing with a 500 in CI.
Allow same-millisecond marker writes to replace prior markers and drop the unused diagnostics read helper so this PR stays focused on persistence rather than unread foundation code.
Drop the unused  helper so this PR only ships the persistence-side status types it actually uses.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Ensure empty-subflow and subflow-error lifecycle callbacks participate in progress-write draining before terminal finalization while still swallowing callback failures.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant