-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kademlia crash when trying to sync with testnet or mainnet #489
Comments
I encountered this recently and was told it's a known issue but rarely arises, hard to reproduce. But it's not hard to reproduce for me at the moment. I can't avoid it, on all my test systems, within a few minutes of starting:
|
I get a similar assertion failure, but the preceding cause may be different. The effect is the same, crash when syncing with Without
With trace (Linux server):
With trace (MacOS):
(A side question is why is the assertion failure reported differently between these two installations (MacOS and Linux), as they are both build from the same version of Nim in |
Just running 'build/nimbus --log-level:TRACE' under runit/daemontools and save the logs after each crash. System is a debian/buster on qemu: Every 2.0s: ls -l /etc/service/svc-eth1/root curd: Tue Mar 9 14:16:28 2021 total 2592 it crases regularly ... |
blame Nim stdlib, it's a known issue that on different platform, Nim stdlib sometimes raise different exception. Usually we will report an issue to Nim compiler repo about this inconsistency. But because we also stick to Nim "stable" release such as 1.2.6 or 1.2.10, this kind of inconsistency, well, as long as it doesn't cause any significant impact on our code, it can be put aside.
I remember now, it's hard to reproduce on mainnet or goerli net, but yes it occurs frequently with other testnet. |
Two of us are finding it crashes on mainnet as well, although it takes a longer. Still within, maybe 20 minutes. |
Fixes #341, status-im/nimbus-eth1#489. When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. The cause is `findNodes` being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing and clock reading delays.) Then `findNodes` sends a duplicate message to the peer and calls `waitNeighbours` to listen for the reply. There's already a `waitNeighbours` callback in a shared table, so that function hits an assert failure. Ignoring the assert would be wrong as it would break timeout logic, and sending `FindNodes` twice in rapid succession also makes us a bad peer. As a simple workaround, just skip `findNodes` in this state and return a fake empty `Neighbours` reply. This is a bit of a hack as `findNodes` should not be called like this; there's a logic error at a higher level. But it works. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>
Fixes #341, status-im/nimbus-eth1#489. When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. The cause is `findNodes` being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing and clock reading delays.) Then `findNodes` sends a duplicate message to the peer and calls `waitNeighbours` to listen for the reply. There's already a `waitNeighbours` callback in a shared table, so that function hits an assert failure. Ignoring the assert would be wrong as it would break timeout logic, and sending `FindNodes` twice in rapid succession also makes us a bad peer. As a simple workaround, just skip `findNodes` in this state and return a fake empty `Neighbours` reply. This is a bit of a hack as `findNodes` should not be called like this; there's a logic error at a higher level. But it works. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>
Fixes #341, status-im/nimbus-eth1#489. When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. The cause is `findNodes` being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing and clock reading delays.) Then `findNodes` sends a duplicate message to the peer and calls `waitNeighbours` to listen for the reply. There's already a `waitNeighbours` callback in a shared table, so that function hits an assert failure. Ignoring the assert would be wrong as it would break timeout logic, and sending `FindNodes` twice in rapid succession also makes us a bad peer. As a simple workaround, just skip `findNodes` in this state and return a fake empty `Neighbours` reply. This is a bit of a hack as `findNodes` should not be called like this; there's a logic error at a higher level. But it works. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>
…mainnet (#489) Prior to this commit, when using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>
…mainnet (#489) Prior to this commit, when using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>
command line example
error message:
The text was updated successfully, but these errors were encountered: