Skip to content

Enable transitioning of node to using MEL when previously ran with Inbox Reader-Tracker#4597

Merged
rauljordan merged 20 commits intomasterfrom
node-transition-to-mel
Apr 14, 2026
Merged

Enable transitioning of node to using MEL when previously ran with Inbox Reader-Tracker#4597
rauljordan merged 20 commits intomasterfrom
node-transition-to-mel

Conversation

@ganeshvanahalli
Copy link
Copy Markdown
Contributor

@ganeshvanahalli ganeshvanahalli commented Apr 2, 2026

This PR enables a nitro node to transition to using message extraction code in place of the current inbox reader and tracker code. MEL is considered to be consensus active if the state's version > 0, hence a node can turn MEL ON with version = 0 and use it just for processing without having to use it for consensus, validation etc...

When MEL is first enabled on the node, we first detect an appropriate finalized parent chain block from whom a batch has been READ, we then appropriately fill in the MEL state fields and create an Initial MEL state. From here on, the node will use MEL for processing parent chain blocks and the node will not be allowed to restart the old way i.e using inbox reader and tracker code.

Testing Done

I successfully ran an arb-sepolia node from scratch upto ~300k blocks using inbox reader and tracker code - shut it down - and successfully restarted with MEL turned on and saw the node continue catchup in an expected way.

Resolves NIT-4209

@ganeshvanahalli ganeshvanahalli marked this pull request as ready for review April 2, 2026 21:41
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 8.80282% with 259 lines in your changes missing coverage. Please review.
✅ Project coverage is 33.61%. Comparing base (7e71bec) to head (a4430e7).
⚠️ Report is 21 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4597      +/-   ##
==========================================
- Coverage   33.95%   33.61%   -0.34%     
==========================================
  Files         498      499       +1     
  Lines       59889    60120     +231     
==========================================
- Hits        20333    20211     -122     
- Misses      36014    36365     +351     
- Partials     3542     3544       +2     

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

❌ 15 Tests Failed:

Tests completed Failed Passed Skipped
4924 15 4909 0
View the top 3 failed tests by shortest run time
TestEndToEnd_ManyEvilValidators
Stack Traces | -0.000s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203082
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a

goroutine 5534630 [select]:
github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast.func1()
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:107 +0xd9
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203079
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a

goroutine 5534639 [select]:
github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast.func1()
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:107 +0xd9
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203079
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a

goroutine 5534629 [select]:
github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast.func1()
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:107 +0xd9
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203073
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a
TestEndToEnd_ManyEvilValidators/honest_essential_edges_confirmed_by_challenge_win
Stack Traces | -0.000s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203082
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a

goroutine 5534630 [select]:
github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast.func1()
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:107 +0xd9
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203079
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a

goroutine 5534639 [select]:
github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast.func1()
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:107 +0xd9
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203079
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a

goroutine 5534629 [select]:
github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast.func1()
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:107 +0xd9
created by github.com/offchainlabs/nitro/bold/containers/events.(*Producer[...]).Broadcast in goroutine 203073
	/home/runner/work/nitro/nitro/bold/containers/events/producer.go:106 +0x10a
TestRedisProduceComplex/one_producer,_all_consumers_are_active
Stack Traces | 1.350s run time
... [CONTENT TRUNCATED: Keeping last 20 lines]
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xack                           �[36mcid�[0m=81b71789-589a-496c-b2ea-a94dfdcf0785 �[36mmessageId�[0m=1776191299766-7
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xack                           �[36mcid�[0m=6637f0ff-c089-4efa-b76f-dfec056c2777 �[36mmessageId�[0m=1776191299766-10
�[36mDEBUG�[0m[04-14|18:28:20.880] Redis stream consuming                   �[36mconsumer_id�[0m=1e6cc4c5-cc4e-434b-b027-26fb9eaf0a3c �[36mmessage_id�[0m=1776191299767-0
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xdel                           �[36mcid�[0m=7e849589-cc36-491a-a2a4-8e60dd09f86e �[36mmessageId�[0m=1776191299766-5
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: setting result                 �[36mcid�[0m=1e6cc4c5-cc4e-434b-b027-26fb9eaf0a3c �[36mmsgIdInStream�[0m=1776191299767-0  �[36mresultKeyInRedis�[0m=result-key:stream:c54d3fc3-e015-4898-9ba0-3c2d61d83314.1776191299767-0
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xdel                           �[36mcid�[0m=219be019-5ee7-454c-ae27-8fcda0472e1b �[36mmessageId�[0m=1776191299766-6
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xdel                           �[36mcid�[0m=c6119c94-e802-484b-89fa-abb56aeb28aa �[36mmessageId�[0m=1776191299766-9
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xdel                           �[36mcid�[0m=6637f0ff-c089-4efa-b76f-dfec056c2777 �[36mmessageId�[0m=1776191299766-10
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xdel                           �[36mcid�[0m=81b71789-589a-496c-b2ea-a94dfdcf0785 �[36mmessageId�[0m=1776191299766-7
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xack                           �[36mcid�[0m=1e6cc4c5-cc4e-434b-b027-26fb9eaf0a3c �[36mmessageId�[0m=1776191299767-0
�[36mDEBUG�[0m[04-14|18:28:20.880] consumer: xdel                           �[36mcid�[0m=1e6cc4c5-cc4e-434b-b027-26fb9eaf0a3c �[36mmessageId�[0m=1776191299767-0
�[36mDEBUG�[0m[04-14|18:28:20.881] consumer: xdel                           �[36mcid�[0m=08ec4646-b38e-4d29-adcb-956b060b8869 �[36mmessageId�[0m=1776191299764-1
�[36mDEBUG�[0m[04-14|18:28:20.888] consumer: xack                           �[36mcid�[0m=8addd35e-d804-4a75-a70a-0b25a867d521 �[36mmessageId�[0m=1776191299766-8
�[36mDEBUG�[0m[04-14|18:28:20.888] consumer: xdel                           �[36mcid�[0m=8addd35e-d804-4a75-a70a-0b25a867d521 �[36mmessageId�[0m=1776191299766-8
�[36mDEBUG�[0m[04-14|18:28:21.001] checkResponses                           �[36mresponded�[0m=92 �[36merrored�[0m=0 �[36mchecked�[0m=100
�[36mDEBUG�[0m[04-14|18:28:21.026] redis producer: check responses starting
�[36mDEBUG�[0m[04-14|18:28:21.035] checkResponses                           �[36mresponded�[0m=8  �[36merrored�[0m=0 �[36mchecked�[0m=8
�[31mERROR�[0m[04-14|18:28:21.035] Error from XpendingExt in getting PEL for auto claim �[31merr�[0m="context canceled" �[31mpendingLen�[0m=0
�[36mDEBUG�[0m[04-14|18:28:21.087] Error destroying a stream group          �[36merror�[0m="dial tcp 127.0.0.1:36057: connect: connection refused"
--- FAIL: TestRedisProduceComplex/one_producer,_all_consumers_are_active (1.35s)

📣 Thoughts on this report? Let Codecov know! | Powered by Codecov

//
// Batch count, msg count, and delayed read are derived from legacy batch metadata.
// The MEL inbox accumulator is reconstructed for any unread delayed messages.
func CreateInitialMELStateFromLegacyDB(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice function! Great job on putting it together

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

rauljordan
rauljordan previously approved these changes Apr 3, 2026
Base automatically changed from replace-readerandtracker-with-messageextractor to master April 10, 2026 11:13
@ganeshvanahalli ganeshvanahalli dismissed rauljordan’s stale review April 10, 2026 11:13

The base branch was changed.

@ganeshvanahalli ganeshvanahalli assigned eljobe and unassigned rauljordan Apr 13, 2026
Copy link
Copy Markdown
Member

@eljobe eljobe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a well-structured migration — the legacy DB read layer is solid, the CreateInitialMELStateFromLegacyDB state construction is correct, the test coverage (especially legacy_db_reads_test.go and TestMELMigrationFromLegacyNode) is thorough, and the IsBatchGasFieldsMissing optimization is a nice touch.

Two issues I'd like to see addressed before merge:

  1. SaveInitialMelState sets in-memory cache before dbBatch.Write() — if the batch write fails, the Database struct is left in an inconsistent state where it routes reads to legacy keys despite nothing being persisted. (See inline comment.)

  2. legacyGetDelayedMessageAndParentChainBlockNumber swallows non-not-found errors from the RLP prefix — if the "e" prefix key exists but is corrupted, the error is silently swallowed and the code falls back to the "d" prefix instead of surfacing the corruption. Notably, the same function does check !rawdb.IsDbErrNotFound(err) for the parent chain block number fallback a few lines later, so this asymmetry looks like an oversight. (See inline comment.)

Comment thread arbnode/mel/runner/database.go Outdated
Comment on lines +77 to +80
d.hasInitialState = true
d.initialDelayedCount = initialState.DelayedMessagesSeen
d.initialBatchCount = initialState.BatchCount
return dbBatch.Write()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-memory fields are set before dbBatch.Write(). If the batch write fails (disk full, I/O error, etc.), the Database struct will have hasInitialState = true with the boundary values cached, but the DB will not actually contain the initial MEL state or the InitialMelStateBlockNumKey. This means fetchBatchMetadata and FetchDelayedMessage will dispatch to legacy keys based on stale thresholds for the remainder of the process lifetime.

In practice the caller returns an error and the node fails to start, so this is unlikely to cause data corruption — but it's a correctness bug in the function's contract and straightforward to fix:

if err := dbBatch.Write(); err != nil {
    return err
}
d.hasInitialState = true
d.initialDelayedCount = initialState.DelayedMessagesSeen
d.initialBatchCount = initialState.BatchCount
return nil

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed it

Comment on lines +49 to +52
msg, _, err := legacyGetDelayedMessageFromRlpPrefix(db, index)
if err != nil {
// Fall back to legacy "d" prefix
msg, _, err = legacyGetDelayedMessageFromLegacyPrefix(db, index)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When legacyGetDelayedMessageFromRlpPrefix fails for a reason other than "not found" (e.g., corrupted data causing an RLP decode error, truncated entry), the code unconditionally falls through to the legacy "d" prefix instead of surfacing the corruption error. This could silently mask data corruption.

Notably, the same function already checks !rawdb.IsDbErrNotFound(err) for the parent chain block number fallback on line 61, so the asymmetry here looks like an oversight.

Suggested fix — only fall back on not-found:

msg, _, err := legacyGetDelayedMessageFromRlpPrefix(db, index)
if err != nil {
    if !rawdb.IsDbErrNotFound(err) {
        return nil, 0, fmt.Errorf("error reading RLP delayed message at index %d: %%w", index, err)
    }
    // Key genuinely absent — fall back to legacy "d" prefix
    msg, _, err = legacyGetDelayedMessageFromLegacyPrefix(db, index)
    ...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed it

@eljobe eljobe assigned ganeshvanahalli and unassigned eljobe Apr 14, 2026
@ganeshvanahalli ganeshvanahalli removed their assignment Apr 14, 2026
@ganeshvanahalli ganeshvanahalli requested a review from eljobe April 14, 2026 09:34
eljobe
eljobe previously approved these changes Apr 14, 2026
@eljobe eljobe enabled auto-merge April 14, 2026 09:37
@eljobe eljobe added this pull request to the merge queue Apr 14, 2026
@ganeshvanahalli ganeshvanahalli removed this pull request from the merge queue due to a manual request Apr 14, 2026
eljobe
eljobe previously approved these changes Apr 14, 2026
@ganeshvanahalli ganeshvanahalli added this pull request to the merge queue Apr 14, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 14, 2026
@ganeshvanahalli ganeshvanahalli added this pull request to the merge queue Apr 14, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 14, 2026
@ganeshvanahalli ganeshvanahalli added this pull request to the merge queue Apr 14, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 14, 2026
@rauljordan rauljordan added this pull request to the merge queue Apr 14, 2026
Merged via the queue into master with commit c28c984 Apr 14, 2026
23 of 24 checks passed
@rauljordan rauljordan deleted the node-transition-to-mel branch April 14, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants