Namada Testnet v0.13.0 Upgrade Postmortem
Namada’s height-activated protocol upgrade failed to activate on the testnet at block height #37370 (18th of January 2023 between 19:00-21:00 UTC). This post is a diagnosis of what happened and the steps we plan to take to avoid this kind of issue in the future.
Namada’s height-activated protocol upgrade failed to activate on the testnet at block height
37370 (18th of January 2023 between 19:00-21:00 UTC). This post is a diagnosis of what happened and the steps we plan to take to avoid this kind of issue in the future.
Transactions rejected by the validity predicates
The most recent Namada public testnet spawned on the 12th of January 2023 at 17:00 UTC following a decentralised genesis process, with Namada’s release version
The network was running reliably after genesis, but many community members reported issues with transactions. The core team investigated and found the issue in the 2 fields included in the genesis files called
whitelist_tx. Transactions were accepted and added to the mempool, but once executed and inserted to the blocks, validity predicates rejected them.
These fields contain the transaction and validity predicate hashes of allowed WASM files. The problem was that these 2 arrays contained the hashes in lower case style. Before including transactions, Namada checks that the executed transaction and validity predicate have their hashes inside the whitelists. Specifically,
&tx_hash.to_string() return the upper case hash, so the checks failed. The fix consisted in including the hashes in upper case style in the genesis file. The code was fixed by writing the 2 arrays in storage all in lowercase and making the checks case insensitive.
To deploy this fix, the network required an upgrade to the patched protocol version
v.0.13.1. To give enough time to coordinate a decentralised upgrade without halting the network, the upgrade was programmed to happen in the future at block height
Protocol release process issue
On the 17th of January, the Namada community was informed about the release and the instructions to upgrade the network with the fix. The issue was that there were two release versions published
v0.13.1-hardfork. Several clarifications in communications were issued to upgrade their nodes to
v0.13.1-hardfork. Unfortunately, some validators upgraded to
v0.13.1, while others upgraded to
v0.13.1-hardfork, which resulted in two different state roots at block height
37370. The network forked into two, with neither fork having sufficient voting power (2/3) to make progress and continue producing blocks.
During the investigation, the team also found out that both protocol versions contained another bug that would’ve prevented nodes from synching from scratch.
Fixed protocol version and deployment paths
A new protocol version containing the right code was created:
v.0.13.2. After careful assessment, the team found two options for deploying it: upgrading or restarting the network.
Option 1) Recover the network with another upgrade
This path required coordinating with validators in the community that were operating
v0.13.1 to upgrade to
v0.13.2 (but not to the ones operating
v0.13.1-hardfork). When possible, recovering the network is always a preferred option, but in this case this option carried a lot of complexity and hence risk of failure, as it involved: creating a release version able to resync. from 0 (2.5-3 hours) (only relevant for nodes on
v0.13.1) and then activate the hardfork; to test the release simulate a hardfork in a devnet; communicate and wait that all affected validators upgrade to the correct version.
Option 2) Restarting the network
This option required releasing
v0.13.2, test the version in a devnet, and restart the network with the same genesis validator set as on 12th of January 2023 with the correct version of the protocol.
Given the coordination complexity and risks with option 1 in a decentralised network, the team proposed to proceed with restarting the network. To avoid the same issues forward, the team has agreed to the following process improvements:
- Devnets (core internal testnets for Q/A) configurations will be as close as public testnets as possible, including genesis configurations. This would’ve helped catch the issue with the hash whitelist check.
- Sticking to one release only, which would’ve decreased significantly the risk of validators upgrading to different versions.
Join Namada's Discord server for feedback and questions.