Simplex reconfiguration framework - Part III (MSM implementation)#365
Conversation
yacovm
commented
Apr 16, 2026
- Add block building to msm.go
- Add verification.go which contains logic for block verification
- Add tests that mimic Simplex flow (fake_node_test.go)
- Add block building to msm.go - Add verification.go which contains logic for block verification - Add tests that mimic Simplex flow (fake_node_test.go) Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
62fef9a to
7430e36
Compare
Addresses review comments: finalized blocks were a prefix of notarized blocks, so keeping two separate slices was redundant and error prone (e.g. tryFinalizeNextBlock panicked when the two went out of sync). Replaces both with a single blocks []blockState slice where each entry carries a finalized flag. Also removes the duplicate lookup in the GetBlock test fixture that would return finalized blocks as non-finalized. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
| bs := make(blockStore) | ||
| bs[0] = &outerBlock{block: genesisBlock} | ||
|
|
||
| var testConfig testConfig | ||
| testConfig.blockStore = bs | ||
| testConfig.validatorSetRetriever.result = NodeBLSMappings{ |
There was a problem hiding this comment.
maybe we should have constructors for helpers? seems like it would set them up to be more re-usable + maintainable in the future?
There was a problem hiding this comment.
maybe we can do that after i re-introduce verification to do that in bulk.
| type SignatureAggregator interface { | ||
| AggregateSignatures(signatures ...[]byte) ([]byte, error) | ||
|
|
||
| IsQuorum(approverWeights []uint64, totalWeight uint64) bool |
There was a problem hiding this comment.
i don't think we should redefine this interface in the msm. Would it be better to use the interface defined in api.go?
I think it's better for 2 reasons, the first being we would currently need to implement these SignatureAggregator interfaces using two different structs because function names cannot be overloaded. More importantly, I think we should just have one implementation of IsQuorum. I think its reasonable to assume the SignatureAggregator would have the weights of each nodes in a PoS setting, so we can just use the IsQuorum method defined in api.go:SignatureAggregator
There was a problem hiding this comment.
The problem is that the signature aggregator cannot know which P-chain block we're referring to, and therefore if it gets an invocation from an epoch or from an MSM it has no clue which P-chain to lookup against.
I agree that it's best that the IsQuorum will be a unified and single API.
What do you think about doing the complete opposite - to have the Epoch use IsQuorum(approverWeights []uint64, totalWeight uint64) bool and also add weights to the membership of the communication? This will simplify the implementation of IsQuorum in the avalanchego side.
There was a problem hiding this comment.
to recap our quick sync offline: let's initialize SignatureAggregators with a set of node <-> weight mappings, and then use the newly initialized signature aggregator when calling IsQuorum in the MSM approval logic. This allows us to not worry about passing in weights and keep the abstraction in AvalancheGo as is.
| seq := pmd.Seq | ||
|
|
||
| if seq == 0 { | ||
| return fmt.Errorf("attempted to build a genesis inner block") |
There was a problem hiding this comment.
this is in verification though, not block building
There was a problem hiding this comment.
yeah but the idea is "hey, someone tried to build a genesis block".
Also it shouldn't be "inner" - it's a copy-paste error I thought I fixed.
Will fix it to something more reasonable like "received a genesis block"
| type SignatureAggregator interface { | ||
| AggregateSignatures(signatures ...[]byte) ([]byte, error) | ||
|
|
||
| IsQuorum(approverWeights []uint64, totalWeight uint64) bool |
There was a problem hiding this comment.
to recap our quick sync offline: let's initialize SignatureAggregators with a set of node <-> weight mappings, and then use the newly initialized signature aggregator when calling IsQuorum in the MSM approval logic. This allows us to not worry about passing in weights and keep the abstraction in AvalancheGo as is.
| } | ||
| simplexEpochInfo := constructSimplexZeroBlock(pChainHeight, newValidatorSet, prevVMBlockSeq) | ||
|
|
||
| return sm.buildBlockImpatiently(ctx, parentBlock, simplexMetadata, simplexBlacklist, simplexEpochInfo, pChainHeight) |
There was a problem hiding this comment.
🤔
the caller BuildBlock is only called when we are selected to build a block. We also do the wait impatiently thingy here, but more often than not we will actually not have an actual block to build. I say this because when we start a chain we will probably not begin issuing transactions right away.
Therefore, more often than not the first Simplex block will just be an empty one that is produced right after the simplex chain is created. This also means that most block "zero" will be the same.
Because most zero blocks will be the same I'm thinking we could either hardcode the zeroth block and remove a lot of the branching that is caused by treating zeroth blocks so specially(no need for the build block zero method, the verify method becomes simple and a lot of branching is removed).
Another possibility -> why rush to build the zeroth block? can't we just wait like the normal blocks?
Maybe my logic is wrong somewhere, but I generally have a feeling there is a much simpler solution with handling the zeroth case.
There was a problem hiding this comment.
Because most zero blocks will be the same I'm thinking we could either hardcode the zeroth block and remove a lot of the branching that is caused by treating zeroth blocks so specially(no need for the build block zero method, the verify method becomes simple and a lot of branching is removed).
we can't hardcode the zero block because it has metadata fields we cannot precompute, such as ICM epoch and timestamp, previous VM block seq, etc.
Another possibility -> why rush to build the zeroth block? can't we just wait like the normal blocks?
so we can easily expand the validator set. Essentially, if we spin up a single node network, we can then add a node, add another node, add another node, all by interacting with the P-chain, and we will never need any user traffic.
If we need user traffic to build the zero epoch, we're stuck.
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
| case stateBuildBlockNormalOp: | ||
| return sm.buildBlockNormalOp(ctx, parentBlock, simplexMetadataBytes, simplexBlacklistBytes, prevBlockSeq) | ||
| case stateBuildCollectingApprovals: | ||
| return sm.buildBlockCollectingApprovals(ctx, parentBlock, simplexMetadataBytes, simplexBlacklistBytes, prevBlockSeq) |
There was a problem hiding this comment.
what happens if we are collecting approvals for a validator set change v1, but the pchain height has advanced to a new validator set v2. am i correct to say the current code would first finish collecting approvals for v1 and then transition to v2 after all approvals are collected?
would it be better to stop the transition of v1 when we notice v2? curious about your thoughts
There was a problem hiding this comment.
Once we have the next P-chain reference height (we moved from normal block to collecting approvals) we are essentially "locked" on that P-chain height for the epoch transition and cannot change it.
am i correct to say the current code would first finish collecting approvals for v1 and then transition to v2 after all approvals are collected?
Yes, in such a case we'll have two epoch changes one after the other.
would it be better to stop the transition of v1 when we notice v2? curious about your thoughts
So, I did think about this "problem" as well when I designed the MSM, but I don't think this is a use case we should optimize for.
We might just be able to solve this problem passively by using P-chain from ICM epochs, because they change their P-chain height slowly every time frame (I think 30 seconds or so) and therefore it makes implicit batching of P-chain changes.
I'll look at the code once I re-introduce ICM epoching and see if I can make it more robust.
Thanks for asking this 👍
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
| NextEpochApprovals *NextEpochApprovals `canoto:"pointer,7"` | ||
| // SealingBlockSeq is the block sequence of the sealing block of the current epoch. | ||
| // It defines the validator set of the next epoch. | ||
| SealingBlockSeq uint64 `canoto:"uint,8"` |
There was a problem hiding this comment.
is this only set when we are building telocks?
There was a problem hiding this comment.
Yes. It's set once we build the first Telock and then we copy over the previous Telock's sealing block seq in induction.
| } | ||
|
|
||
| // TODO: This P-chain height should be taken from the ICM epoch | ||
| childBlock, err := sm.BlockBuilder.BuildBlock(ctx, sm.GetPChainHeight()) |
There was a problem hiding this comment.
🤔 we are building a new block for the epoch, but a couple things can change between epochs
- validator set ordering -> this means we might have been the leader for seq
xin epoch1, but now that the epoch has changed we may not be the leader for seqxin epoch2. - Our node may be kicked out of validator set, so we shouldn't be building a block anyways
There was a problem hiding this comment.
we are building a new block for the epoch, but a couple things can change between epochs
To be precise, we're building a block for a new epoch.
- validator set ordering -> this means we might have been the leader for seq
xin epoch1, but now that the epoch has changed we may not be the leader for seqxin epoch2.- Our node may be kicked out of validator set, so we shouldn't be building a block anyways
A new epoch, a new Epoch instance, right?
Why does it matter which node is the leader at a current round? The MSM assumes nothing about the relative position of the node within the validator set. Eventually there would be a non faulty node that will be a leader and will execute this line. When a node executes this line, it will already have the new epoch instance with the new validator set, right?
Keep in mind that the API of the MSM is very primitive - we can either BuildBlock or VerifyBlock. There is no assumption of any kind of continuation. We can only build a single block at any given round.
So, if in the first time we execute this function we haven't sealed (finalized the sealing block) our epoch, we are running under the context of the previous epoch. Otherwise, If the epoch has been sealed then this is executed under the context of the new epoch, and therefore a node that has been evicted isn't expected to execute it (and even if it has, no one will accept such a block).
Does that make sense?
There was a problem hiding this comment.
ah makes sense, thanks for the explaination 👍
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>
Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>