Explaining the Namada 0.13.3 consensus fork
Namada 0.13.3 was released on January 25 to address a testnet halt at the start of epoch 35. The cause of the halt was a rounding error when small bonds were converted into Tendermint voting power.
Namada 0.13.3 was released on January 25 to address a testnet halt at the start of epoch 35. The cause of the halt was a rounding error when small bonds were converted into Tendermint voting power, which led to Tendermint crashing when Namada attempted to include a new validator with zero voting power in the validator set.
After the upgrade to 0.13.3, however, the validator set forked and the chain was unable to proceed. We investigated storage dumps from each side of the split to determine where the state had diverged. After some investigation, we determined that the way the node writes new conversions for shielded pool incentives on each epoch was incorrect in a subtle way which the earlier chain halt brought to the surface.
Specifically, the new conversions were committed to storage immediately upon being computed—before the block was properly committed by consensus. Because of this, the conversion set updated twice (at least) on the start of epoch 35 for most validators, because updates from failed attempts at block 3501 during the halt were also included. However, some validators resynced their nodes from scratch when updating to 0.13.3, and thus replayed the entire chain state, correctly updating the conversions only once in the process.
The running testnet can be resumed using Namada 0.13.3 if a quorum of validators resync from scratch, arriving at the correct conversion state in the process.
We have also developed a fix where the conversion update is only committed along with the block commit, by writing to the write log where completed but not yet committed state changes are stored instead of directly to storage.