Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kademlia crash when trying to sync with testnet or mainnet #489

Closed
jangko opened this issue Apr 14, 2020 · 5 comments · Fixed by status-im/nim-eth#342
Closed

Kademlia crash when trying to sync with testnet or mainnet #489

jangko opened this issue Apr 14, 2020 · 5 comments · Fixed by status-im/nim-eth#342

Comments

@jangko
Copy link
Contributor

jangko commented Apr 14, 2020

command line example

build/nimbus --ropsten --dataDir:build/ropsten
build/nimbus --rinkeby --dataDir:build/rinkeby

error message:

WRN 2020-04-14 17:03:19+07:00 Trying to add ourselves to the routing table topics="kademlia" tid=17008 file=kademlia.nim:194 node=Node[127.0.0.1:30303]
Traceback (most recent call last, using override)
Stack trace override procedure not registered.
Error: unhandled exception: C:\Users\Jangko\.nimble\pkgs\eth-1.0.0\eth\p2p\kademlia.nim(289, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]
  • this is a eth/kademlia bug
@jlokier
Copy link
Contributor

jlokier commented Mar 3, 2021

I encountered this recently and was told it's a known issue but rarely arises, hard to reproduce. But it's not hard to reproduce for me at the moment. I can't avoid it, on all my test systems, within a few minutes of starting:

  • On a laptop behind 4g with MacOS
  • On a well-connected, fast data centre server running Linux

@jlokier
Copy link
Contributor

jlokier commented Mar 3, 2021

I get a similar assertion failure, but the preceding cause may be different. The effect is the same, crash when syncing with --testnet.

Without --log-level:TRACE (on a Linux server):

./build/nimbus --testnet
Traceback (most recent call last, using override)
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(206) nimbus
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(174) process
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/kademlia.nim(293, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]

With trace (Linux server):

TRC 2021-03-03 16:47:56.115+00:00 Bonding to peer                            topics="discovery" tid=6599 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 16:47:56.115+00:00 Bonded with candidates                     topics="discovery" tid=6599 file=kademlia.nim:391 count=16
TRC 2021-03-03 16:47:56.296+00:00 Discovered nodes                           topics="discovery" tid=6599 file=discovery.nim:293 nodes=273
TRC 2021-03-03 16:47:56.297+00:00 Starting lookup; initial neighbours:       topics="discovery" tid=6599 file=kademlia.nim:399 closest="@[Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:20638], Node[<someone's IP>:20630], Node[<someone's IP>:57955], Node[<someone's IP>:30303], Node[<someone's IP>:30303]]"
TRC 2021-03-03 16:47:56.297+00:00 Node lookup; querying                      topics="discovery" tid=6599 file=kademlia.nim:402 nodesToAsk="@[Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303]]"
TRC 2021-03-03 16:47:56.297+00:00 >>> find_node to                           topics="discovery" tid=6599 file=discovery.nim:132 n=Node[<someone's IP>:30303]
TRC 2021-03-03 16:47:56.297+00:00 >>> find_node to                           topics="discovery" tid=6599 file=discovery.nim:132 n=Node[<someone's IP>:30303]
TRC 2021-03-03 16:47:56.297+00:00 >>> find_node to                           topics="discovery" tid=6599 file=discovery.nim:132 n=Node[<someone's IP>:30303]
Traceback (most recent call last, using override)
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(206) nimbus
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(174) process
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/kademlia.nim(293, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]

With trace (MacOS):

...
TRC 2021-03-03 14:25:06.275+00:00 Bonding to peer                            topics="discovery" tid=9250110 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 14:25:06.275+00:00 Bonding to peer                            topics="discovery" tid=9250110 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 14:25:06.275+00:00 Bonding to peer                            topics="discovery" tid=9250110 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 14:25:06.275+00:00 Bonded with candidates                     topics="discovery" tid=9250110 file=kademlia.nim:391 count=2
TRC 2021-03-03 14:25:06.275+00:00 Node lookup; querying                      topics="discovery" tid=9250110 file=kademlia.nim:402 nodesToAsk="@[Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:20202]]"
TRC 2021-03-03 14:25:06.275+00:00 >>> find_node to                           topics="discovery" tid=9250110 file=discovery.nim:132 n=Node[<someone's IP>:30303]
/Users/jamie/Work/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: /Users/jamie/Work/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/kademlia.nim(293, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]

(A side question is why is the assertion failure reported differently between these two installations (MacOS and Linux), as they are both build from the same version of Nim in nimbus-build-system, Nim 1.2.6, same git hash bf320ed172f74f60fd274338e82bdc9ce3520dd9.)

@mjfh
Copy link
Contributor

mjfh commented Mar 9, 2021

Just running 'build/nimbus --log-level:TRACE' under runit/daemontools and save the logs after each crash. System is a debian/buster on qemu:

Every 2.0s: ls -l /etc/service/svc-eth1/root curd: Tue Mar 9 14:16:28 2021

total 2592
-rw-r--r-- 1 root root 240960 Mar 9 12:33 20210309123339-session.xz
-rw-r--r-- 1 root root 214684 Mar 9 12:35 20210309123542-session.xz
-rw-r--r-- 1 root root 208796 Mar 9 12:37 20210309123703-session.xz
-rw-r--r-- 1 root root 241628 Mar 9 12:39 20210309123917-session.xz
-rw-r--r-- 1 root root 313912 Mar 9 12:48 20210309124817-session.xz
-rw-r--r-- 1 root root 247756 Mar 9 13:09 20210309130923-session.xz
-rw-r--r-- 1 root root 195280 Mar 9 13:13 20210309131310-session.xz
-rw-r--r-- 1 root root 236720 Mar 9 13:15 20210309131515-session.xz
-rw-r--r-- 1 root root 249456 Mar 9 13:28 20210309132853-session.xz
-rw-r--r-- 1 root root 239276 Mar 9 13:31 20210309133157-session.xz
-rw-r--r-- 1 root root 252916 Mar 9 13:33 20210309133355-session.xz

it crases regularly ...
jordan

@jangko
Copy link
Contributor Author

jangko commented Mar 10, 2021

The effect is the same, crash when syncing with --testnet.

--tesnet is an alias to --ropsten, expect it to behave identically when we use --ropsten

A side question is why is the assertion failure reported differently between these two installations (MacOS and Linux)

blame Nim stdlib, it's a known issue that on different platform, Nim stdlib sometimes raise different exception. Usually we will report an issue to Nim compiler repo about this inconsistency. But because we also stick to Nim "stable" release such as 1.2.6 or 1.2.10, this kind of inconsistency, well, as long as it doesn't cause any significant impact on our code, it can be put aside.

I encountered this recently and was told it's a known issue but rarely arises, hard to reproduce

I remember now, it's hard to reproduce on mainnet or goerli net, but yes it occurs frequently with other testnet.

@jlokier jlokier changed the title crash when trying to sync with testnet Kademlia crash when trying to sync with testnet or mainnet Mar 17, 2021
@jlokier
Copy link
Contributor

jlokier commented Mar 17, 2021

Two of us are finding it crashes on mainnet as well, although it takes a longer. Still within, maybe 20 minutes.
So I've updated the title to reflect this.

jlokier added a commit to status-im/nim-eth that referenced this issue Apr 2, 2021
Fixes #341, status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few
minutes.  It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is `findNodes` being called twice in succession to the same peer,
within about 5 seconds of each other.  ("About" 5 seconds, because Chronos does
not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then `findNodes` sends a duplicate message to the peer and calls
`waitNeighbours` to listen for the reply.  There's already a `waitNeighbours`
callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending
`FindNodes` twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip `findNodes` in this state and return a fake
empty `Neighbours` reply.  This is a bit of a hack as `findNodes` should not be
called like this; there's a logic error at a higher level.  But it works.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <[email protected]>
jlokier added a commit to status-im/nim-eth that referenced this issue Apr 2, 2021
Fixes #341, status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few
minutes.  It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is `findNodes` being called twice in succession to the same peer,
within about 5 seconds of each other.  ("About" 5 seconds, because Chronos does
not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then `findNodes` sends a duplicate message to the peer and calls
`waitNeighbours` to listen for the reply.  There's already a `waitNeighbours`
callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending
`FindNodes` twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip `findNodes` in this state and return a fake
empty `Neighbours` reply.  This is a bit of a hack as `findNodes` should not be
called like this; there's a logic error at a higher level.  But it works.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <[email protected]>
kdeme pushed a commit to status-im/nim-eth that referenced this issue Apr 2, 2021
Fixes #341, status-im/nimbus-eth1#489.

When using discv4 (Kademlia) to find peers, there is a crash after a few
minutes.  It occurs for most of us on Eth1 mainnet, and everyone on Ropsten.

The cause is `findNodes` being called twice in succession to the same peer,
within about 5 seconds of each other.  ("About" 5 seconds, because Chronos does
not guarantee to run the timeout branch at a particular time, due to queuing
and clock reading delays.)

Then `findNodes` sends a duplicate message to the peer and calls
`waitNeighbours` to listen for the reply.  There's already a `waitNeighbours`
callback in a shared table, so that function hits an assert failure.

Ignoring the assert would be wrong as it would break timeout logic, and sending
`FindNodes` twice in rapid succession also makes us a bad peer.

As a simple workaround, just skip `findNodes` in this state and return a fake
empty `Neighbours` reply.  This is a bit of a hack as `findNodes` should not be
called like this; there's a logic error at a higher level.  But it works.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <[email protected]>
jlokier added a commit that referenced this issue Apr 7, 2021
…mainnet (#489)

Prior to this commit, when using discv4 (Kademlia) to find peers, there is a
crash after a few minutes.  It occurs for most of us on Eth1 mainnet, and
everyone on Ropsten.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <[email protected]>
jlokier added a commit that referenced this issue Apr 7, 2021
…mainnet (#489)

Prior to this commit, when using discv4 (Kademlia) to find peers, there is a
crash after a few minutes.  It occurs for most of us on Eth1 mainnet, and
everyone on Ropsten.

Tested for about 4 days constant operation on Ropsten.  The crash which occured
every few minutes no longer occurs, and discv4 keeps working.

Signed-off-by: Jamie Lokier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants