Validators on Gonka testnet were dropping out of the epoch signing group during BLS DKG because gas cost grew with each new participant. Two fixes on the release/v0.2.12 branch, landing between 2026-04-16 and 2026-04-21, removed the growth on the hot path and preemptively closed four more places that would have broken once v0.2.12 turns on transaction fees.
Timeline (UTC)
- Pre-2026-04-16 — A testnet trace captured
MsgSubmitGroupKeyValidationSignaturefailing withgasWanted: 10000000, gasUsed: 10240546. Validators who sent that transaction silently dropped from the new epoch. - 2026-04-16 00:29 — PR #1070 opened. Moves dealer-part storage out of the shared struct into per-participant records.
- 2026-04-17 14:48 — PR #1070 merged. Dealer-part submissions now pay the same gas regardless of order.
- 2026-04-17 20:30 — PR #1088 opened. Applies the same pattern to four more places across BLS and the bridge.
- 2026-04-21 00:17 — PR #1088 merged. All five growing-gas paths closed.
What broke
BLS DKG is the process where epoch participants jointly build a shared signing key. In an 8-participant group, the 1st dealer paid roughly 187k gas for its share. The 8th paid roughly 10.4M gas, a 56x difference that matches the Cosmos SDK per-byte write pricing.
Gonka estimates gas by simulating the transaction beforehand. Between simulation and inclusion in a block, more dealer submissions could land, increasing the payload size. Real gas exceeded the simulated estimate, and the transaction failed with out of gas.
A failed dealer missed CalculateSlotsWithDealerParts, fell below the ITotalSlots/2 threshold in TransitionToVerifyingPhase, and was filtered out of the new epoch's signing group. From the outside it looked like validators were "dropping" after the upgrade — in reality their in-flight transactions ran out of gas before inclusion.
Root cause
SubmitDealerPart rewrote the entire EpochBLSData struct on every submission. The list of dealer parts lived inside that struct and grew with each new entry. Under byte-based gas pricing, every write paid for the whole growing list — so the Nth submission cost roughly N times the first.
The same pattern existed in four other places across the BLS and bridge modules: partial signatures for group-key validation, verification submissions, dealer complaints, threshold signing requests, and the validator list on bridge transactions. All of them were waiting to misbehave once transaction fees activate in v0.2.12.
Fix
PR #1070 moved dealer-part storage out of the shared struct into per-participant records keyed as {DealerPartPrefix}{epochID}/{participantIdx}. Writing one dealer part now costs the same gas no matter how many are already stored. All existing consumers — phase_transitions.go, dispute_resolution.go, msg_server_verifier.go, bls_crypto.go — still see the slice shape they always had, because GetEpochBLSData rebuilds it on read. Pre-upgrade inline entries stay valid as a baseline, and no migration step was required.
PR #1088 applied the same split to the four remaining places in five self-reviewed commits, one fix per commit. For nodes already running v0.2.12-rc, in-flight pre-split state is migrated inside the v0.2.12 upgrade handler.
Lessons
- Byte-priced writes turn any inline growing list into an O(N) trap. When the data lives inside a single struct, each new entry makes every future write more expensive than the last. Simulation-based estimation does not catch it, because the gap between estimate and inclusion is exactly where the list grows.
- Gas regressions surface late. The 8-participant DKG only hit the ceiling because
gasWantedwas set at 10M — smaller groups passed quietly. A static gas ceiling picked from observed cost leaves no headroom for late-arriving transactions. - Transaction fees amplify silent regressions. v0.2.12 introduces consensus fees, and the four other places with the same pattern would have started failing the moment fees activate. Preemptive splitting in PR #1088 avoided a second round of DKG dropouts on MainNet.
- AI-assisted review caught this class of bug twice in one week. PR #1070 and PR #1088 both credit Claude Opus 4.7 in their AI Usage notes; related PR #1087 credits
ai-reviewer on gemini_high. Each commit ships tests that pin the invariant, so any regression fails CI.