discv4: Fix Kademlia crash when trying to sync #342

jlokier · 2021-04-02T14:48:19Z

Fixes #341, fixes status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is findNodes being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then findNodes sends a duplicate message to the peer and calls waitNeighbours to listen for the reply. There's already a waitNeighbours callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending FindNodes twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip findNodes in this state and return a fake empty Neighbours reply. This is a bit of a hack as findNodes should not be called like this; there's a logic error at a higher level. But it works.

Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working.

Fixes #341, status-im/nimbus-eth1#489. When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. The cause is `findNodes` being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing and clock reading delays.) Then `findNodes` sends a duplicate message to the peer and calls `waitNeighbours` to listen for the reply. There's already a `waitNeighbours` callback in a shared table, so that function hits an assert failure. Ignoring the assert would be wrong as it would break timeout logic, and sending `FindNodes` twice in rapid succession also makes us a bad peer. As a simple workaround, just skip `findNodes` in this state and return a fake empty `Neighbours` reply. This is a bit of a hack as `findNodes` should not be called like this; there's a logic error at a higher level. But it works. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>

kdeme

With status-im/nimbus-eth1#576 working, this can be updated to head and then merged

kdeme reviewed Apr 2, 2021

View reviewed changes

jlokier force-pushed the kademlia-fix branch from d74b586 to 24813f9 Compare April 2, 2021 19:26

kdeme merged commit e4b4b7f into master Apr 2, 2021

kdeme deleted the kademlia-fix branch April 2, 2021 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discv4: Fix Kademlia crash when trying to sync #342

discv4: Fix Kademlia crash when trying to sync #342

jlokier commented Apr 2, 2021

kdeme left a comment

discv4: Fix Kademlia crash when trying to sync #342

discv4: Fix Kademlia crash when trying to sync #342

Conversation

jlokier commented Apr 2, 2021

kdeme left a comment

Choose a reason for hiding this comment