Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discv4: Fix Kademlia crash when trying to sync #342

Merged
merged 1 commit into from
Apr 2, 2021
Merged

Conversation

jlokier
Copy link
Contributor

@jlokier jlokier commented Apr 2, 2021

Fixes #341, fixes status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is findNodes being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then findNodes sends a duplicate message to the peer and calls waitNeighbours to listen for the reply. There's already a waitNeighbours callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending FindNodes twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip findNodes in this state and return a fake empty Neighbours reply. This is a bit of a hack as findNodes should not be called like this; there's a logic error at a higher level. But it works.

Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working.

Fixes #341, status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few
minutes.  It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is `findNodes` being called twice in succession to the same peer,
within about 5 seconds of each other.  ("About" 5 seconds, because Chronos does
not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then `findNodes` sends a duplicate message to the peer and calls
`waitNeighbours` to listen for the reply.  There's already a `waitNeighbours`
callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending
`FindNodes` twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip `findNodes` in this state and return a fake
empty `Neighbours` reply.  This is a bit of a hack as `findNodes` should not be
called like this; there's a logic error at a higher level.  But it works.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <[email protected]>
Copy link
Contributor

@kdeme kdeme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With status-im/nimbus-eth1#576 working, this can be updated to head and then merged

@kdeme kdeme merged commit e4b4b7f into master Apr 2, 2021
@kdeme kdeme deleted the kademlia-fix branch April 2, 2021 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

discv4: Kademlia crash when trying to sync Kademlia crash when trying to sync with testnet or mainnet
2 participants