Skip to content

design: HardFork deployment simplification via bootstrap#67

Open
bdchatham wants to merge 1 commit intomainfrom
design/hardfork-bootstrap-simplification
Open

design: HardFork deployment simplification via bootstrap#67
bdchatham wants to merge 1 commit intomainfrom
design/hardfork-bootstrap-simplification

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

LLD for simplifying the HardFork deployment plan from 6 tasks to 4 by leveraging the existing bootstrap Job mechanism.

Key Change

Instead of the controller orchestrating halt-height coordination between incumbent and entrant nodes via sidecar tasks (SubmitHaltSignal + AwaitNodesAtHeight), the entrant node's bootstrap spec handles it:

  • Bootstrap image = incumbent binary (e.g., sei:v6.3.0)
  • Target height = haltHeight - 1
  • Production image = new binary (e.g., sei:v6.4.0)

The bootstrap Job syncs to one block before the upgrade, exits cleanly, then the StatefulSet starts with the new binary which processes the upgrade height.

Plan Reduction

Before (6 tasks): Create → AwaitRunning → SubmitHaltSignal → AwaitHeight → Switch → Teardown
After  (4 tasks): Create → AwaitRunning → Switch → Teardown

Design Covers

  • CRD changes (IncumbentImage on DeploymentStatus)
  • Planner simplification (remove 2 tasks)
  • CreateEntrantNodes bootstrap injection
  • Entrant lifecycle trace (bootstrap → upgrade handler → Running)
  • Failure modes and recovery
  • BlueGreen convergence analysis
  • Traffic switch timing
  • File-by-file changes and implementation order

📄 Design doc: .tide/designs/hardfork-bootstrap-simplification.md

🤖 Generated with Claude Code

LLD for reducing the HardFork deployment plan from 6 tasks to 4 by
leveraging the existing bootstrap Job system. The entrant node's
bootstrap image (incumbent binary) syncs to haltHeight-1, then the
production StatefulSet (new binary) handles the upgrade height.

Eliminates SubmitHaltSignal and AwaitNodesAtHeight tasks — the
bootstrap mechanism handles halt-height coordination internally.
Incumbents keep running until traffic switch, removing the risk of
premature SIGTERM.

Covers: CRD changes, planner simplification, CreateEntrantNodes
bootstrap injection, failure modes, BlueGreen convergence analysis,
traffic switch timing, and implementation order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant