Enable transitioning of node to using MEL when previously ran with Inbox Reader-Tracker#4597
Enable transitioning of node to using MEL when previously ran with Inbox Reader-Tracker#4597rauljordan merged 20 commits intomasterfrom
Conversation
…box Reader-Tracker
…de-transition-to-mel
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #4597 +/- ##
==========================================
- Coverage 33.95% 33.61% -0.34%
==========================================
Files 498 499 +1
Lines 59889 60120 +231
==========================================
- Hits 20333 20211 -122
- Misses 36014 36365 +351
- Partials 3542 3544 +2 |
❌ 15 Tests Failed:
View the top 3 failed tests by shortest run time
📣 Thoughts on this report? Let Codecov know! | Powered by Codecov |
| // | ||
| // Batch count, msg count, and delayed read are derived from legacy batch metadata. | ||
| // The MEL inbox accumulator is reconstructed for any unread delayed messages. | ||
| func CreateInitialMELStateFromLegacyDB( |
There was a problem hiding this comment.
This is a really nice function! Great job on putting it together
…de-transition-to-mel
…de-transition-to-mel
…de-transition-to-mel
…de-transition-to-mel
The base branch was changed.
eljobe
left a comment
There was a problem hiding this comment.
Overall this is a well-structured migration — the legacy DB read layer is solid, the CreateInitialMELStateFromLegacyDB state construction is correct, the test coverage (especially legacy_db_reads_test.go and TestMELMigrationFromLegacyNode) is thorough, and the IsBatchGasFieldsMissing optimization is a nice touch.
Two issues I'd like to see addressed before merge:
-
SaveInitialMelStatesets in-memory cache beforedbBatch.Write()— if the batch write fails, theDatabasestruct is left in an inconsistent state where it routes reads to legacy keys despite nothing being persisted. (See inline comment.) -
legacyGetDelayedMessageAndParentChainBlockNumberswallows non-not-found errors from the RLP prefix — if the "e" prefix key exists but is corrupted, the error is silently swallowed and the code falls back to the "d" prefix instead of surfacing the corruption. Notably, the same function does check!rawdb.IsDbErrNotFound(err)for the parent chain block number fallback a few lines later, so this asymmetry looks like an oversight. (See inline comment.)
| d.hasInitialState = true | ||
| d.initialDelayedCount = initialState.DelayedMessagesSeen | ||
| d.initialBatchCount = initialState.BatchCount | ||
| return dbBatch.Write() |
There was a problem hiding this comment.
The in-memory fields are set before dbBatch.Write(). If the batch write fails (disk full, I/O error, etc.), the Database struct will have hasInitialState = true with the boundary values cached, but the DB will not actually contain the initial MEL state or the InitialMelStateBlockNumKey. This means fetchBatchMetadata and FetchDelayedMessage will dispatch to legacy keys based on stale thresholds for the remainder of the process lifetime.
In practice the caller returns an error and the node fails to start, so this is unlikely to cause data corruption — but it's a correctness bug in the function's contract and straightforward to fix:
if err := dbBatch.Write(); err != nil {
return err
}
d.hasInitialState = true
d.initialDelayedCount = initialState.DelayedMessagesSeen
d.initialBatchCount = initialState.BatchCount
return nilThere was a problem hiding this comment.
addressed it
| msg, _, err := legacyGetDelayedMessageFromRlpPrefix(db, index) | ||
| if err != nil { | ||
| // Fall back to legacy "d" prefix | ||
| msg, _, err = legacyGetDelayedMessageFromLegacyPrefix(db, index) |
There was a problem hiding this comment.
When legacyGetDelayedMessageFromRlpPrefix fails for a reason other than "not found" (e.g., corrupted data causing an RLP decode error, truncated entry), the code unconditionally falls through to the legacy "d" prefix instead of surfacing the corruption error. This could silently mask data corruption.
Notably, the same function already checks !rawdb.IsDbErrNotFound(err) for the parent chain block number fallback on line 61, so the asymmetry here looks like an oversight.
Suggested fix — only fall back on not-found:
msg, _, err := legacyGetDelayedMessageFromRlpPrefix(db, index)
if err != nil {
if !rawdb.IsDbErrNotFound(err) {
return nil, 0, fmt.Errorf("error reading RLP delayed message at index %d: %%w", index, err)
}
// Key genuinely absent — fall back to legacy "d" prefix
msg, _, err = legacyGetDelayedMessageFromLegacyPrefix(db, index)
...
}There was a problem hiding this comment.
Good catch, fixed it
This PR enables a nitro node to transition to using message extraction code in place of the current inbox reader and tracker code. MEL is considered to be consensus active if the state's version > 0, hence a node can turn MEL
ONwith version = 0 and use it just for processing without having to use it for consensus, validation etc...When MEL is first enabled on the node, we first detect an appropriate finalized parent chain block from whom a batch has been READ, we then appropriately fill in the MEL state fields and create an Initial MEL state. From here on, the node will use MEL for processing parent chain blocks and
the node will not be allowed to restart the old way i.e using inbox reader and tracker code.Testing Done
I successfully ran an arb-sepolia node from scratch upto ~300k blocks using inbox reader and tracker code - shut it down - and successfully restarted with MEL turned on and saw the node continue catchup in an expected way.
Resolves NIT-4209