Kademlia crash when trying to sync with testnet or mainnet #489

jangko · 2020-04-14T10:21:51Z

command line example

build/nimbus --ropsten --dataDir:build/ropsten
build/nimbus --rinkeby --dataDir:build/rinkeby

error message:

WRN 2020-04-14 17:03:19+07:00 Trying to add ourselves to the routing table topics="kademlia" tid=17008 file=kademlia.nim:194 node=Node[127.0.0.1:30303]
Traceback (most recent call last, using override)
Stack trace override procedure not registered.
Error: unhandled exception: C:\Users\Jangko\.nimble\pkgs\eth-1.0.0\eth\p2p\kademlia.nim(289, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]

this is a eth/kademlia bug

The text was updated successfully, but these errors were encountered:

jlokier · 2021-03-03T16:46:04Z

I encountered this recently and was told it's a known issue but rarely arises, hard to reproduce. But it's not hard to reproduce for me at the moment. I can't avoid it, on all my test systems, within a few minutes of starting:

On a laptop behind 4g with MacOS
On a well-connected, fast data centre server running Linux

jlokier · 2021-03-03T18:05:24Z

I get a similar assertion failure, but the preceding cause may be different. The effect is the same, crash when syncing with --testnet.

Without --log-level:TRACE (on a Linux server):

./build/nimbus --testnet
Traceback (most recent call last, using override)
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(206) nimbus
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(174) process
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/kademlia.nim(293, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]

With trace (Linux server):

TRC 2021-03-03 16:47:56.115+00:00 Bonding to peer                            topics="discovery" tid=6599 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 16:47:56.115+00:00 Bonded with candidates                     topics="discovery" tid=6599 file=kademlia.nim:391 count=16
TRC 2021-03-03 16:47:56.296+00:00 Discovered nodes                           topics="discovery" tid=6599 file=discovery.nim:293 nodes=273
TRC 2021-03-03 16:47:56.297+00:00 Starting lookup; initial neighbours:       topics="discovery" tid=6599 file=kademlia.nim:399 closest="@[Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:20638], Node[<someone's IP>:20630], Node[<someone's IP>:57955], Node[<someone's IP>:30303], Node[<someone's IP>:30303]]"
TRC 2021-03-03 16:47:56.297+00:00 Node lookup; querying                      topics="discovery" tid=6599 file=kademlia.nim:402 nodesToAsk="@[Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:30303]]"
TRC 2021-03-03 16:47:56.297+00:00 >>> find_node to                           topics="discovery" tid=6599 file=discovery.nim:132 n=Node[<someone's IP>:30303]
TRC 2021-03-03 16:47:56.297+00:00 >>> find_node to                           topics="discovery" tid=6599 file=discovery.nim:132 n=Node[<someone's IP>:30303]
TRC 2021-03-03 16:47:56.297+00:00 >>> find_node to                           topics="discovery" tid=6599 file=discovery.nim:132 n=Node[<someone's IP>:30303]
Traceback (most recent call last, using override)
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(206) nimbus
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(174) process
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(407) reportUnhandledError
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/excpt.nim(358) reportUnhandledErrorAux
Error: unhandled exception: /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/kademlia.nim(293, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]

With trace (MacOS):

...
TRC 2021-03-03 14:25:06.275+00:00 Bonding to peer                            topics="discovery" tid=9250110 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 14:25:06.275+00:00 Bonding to peer                            topics="discovery" tid=9250110 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 14:25:06.275+00:00 Bonding to peer                            topics="discovery" tid=9250110 file=kademlia.nim:329 n=Node[<someone's IP>:30303]
TRC 2021-03-03 14:25:06.275+00:00 Bonded with candidates                     topics="discovery" tid=9250110 file=kademlia.nim:391 count=2
TRC 2021-03-03 14:25:06.275+00:00 Node lookup; querying                      topics="discovery" tid=9250110 file=kademlia.nim:402 nodesToAsk="@[Node[<someone's IP>:30303], Node[<someone's IP>:30303], Node[<someone's IP>:20202]]"
TRC 2021-03-03 14:25:06.275+00:00 >>> find_node to                           topics="discovery" tid=9250110 file=discovery.nim:132 n=Node[<someone's IP>:30303]
/Users/jamie/Work/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
Error: unhandled exception: /Users/jamie/Work/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/kademlia.nim(293, 11) `
not contains(k.neighboursCallbacks, remote)`  [AssertionError]

(A side question is why is the assertion failure reported differently between these two installations (MacOS and Linux), as they are both build from the same version of Nim in nimbus-build-system, Nim 1.2.6, same git hash bf320ed172f74f60fd274338e82bdc9ce3520dd9.)

mjfh · 2021-03-09T14:21:09Z

Just running 'build/nimbus --log-level:TRACE' under runit/daemontools and save the logs after each crash. System is a debian/buster on qemu:

Every 2.0s: ls -l /etc/service/svc-eth1/root curd: Tue Mar 9 14:16:28 2021

total 2592
-rw-r--r-- 1 root root 240960 Mar 9 12:33 20210309123339-session.xz
-rw-r--r-- 1 root root 214684 Mar 9 12:35 20210309123542-session.xz
-rw-r--r-- 1 root root 208796 Mar 9 12:37 20210309123703-session.xz
-rw-r--r-- 1 root root 241628 Mar 9 12:39 20210309123917-session.xz
-rw-r--r-- 1 root root 313912 Mar 9 12:48 20210309124817-session.xz
-rw-r--r-- 1 root root 247756 Mar 9 13:09 20210309130923-session.xz
-rw-r--r-- 1 root root 195280 Mar 9 13:13 20210309131310-session.xz
-rw-r--r-- 1 root root 236720 Mar 9 13:15 20210309131515-session.xz
-rw-r--r-- 1 root root 249456 Mar 9 13:28 20210309132853-session.xz
-rw-r--r-- 1 root root 239276 Mar 9 13:31 20210309133157-session.xz
-rw-r--r-- 1 root root 252916 Mar 9 13:33 20210309133355-session.xz

it crases regularly ...
jordan

jangko · 2021-03-10T02:45:00Z

The effect is the same, crash when syncing with --testnet.

--tesnet is an alias to --ropsten, expect it to behave identically when we use --ropsten

A side question is why is the assertion failure reported differently between these two installations (MacOS and Linux)

blame Nim stdlib, it's a known issue that on different platform, Nim stdlib sometimes raise different exception. Usually we will report an issue to Nim compiler repo about this inconsistency. But because we also stick to Nim "stable" release such as 1.2.6 or 1.2.10, this kind of inconsistency, well, as long as it doesn't cause any significant impact on our code, it can be put aside.

I encountered this recently and was told it's a known issue but rarely arises, hard to reproduce

I remember now, it's hard to reproduce on mainnet or goerli net, but yes it occurs frequently with other testnet.

jlokier · 2021-03-17T14:14:47Z

Two of us are finding it crashes on mainnet as well, although it takes a longer. Still within, maybe 20 minutes.
So I've updated the title to reflect this.

Fixes #341, status-im/nimbus-eth1#489. When using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. The cause is `findNodes` being called twice in succession to the same peer, within about 5 seconds of each other. ("About" 5 seconds, because Chronos does not guarantee to run the timeout branch at a particular time, due to queuing and clock reading delays.) Then `findNodes` sends a duplicate message to the peer and calls `waitNeighbours` to listen for the reply. There's already a `waitNeighbours` callback in a shared table, so that function hits an assert failure. Ignoring the assert would be wrong as it would break timeout logic, and sending `FindNodes` twice in rapid succession also makes us a bad peer. As a simple workaround, just skip `findNodes` in this state and return a fake empty `Neighbours` reply. This is a bit of a hack as `findNodes` should not be called like this; there's a logic error at a higher level. But it works. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>

…mainnet (#489) Prior to this commit, when using discv4 (Kademlia) to find peers, there is a crash after a few minutes. It occurs for most of us on Eth1 mainnet, and everyone on Ropsten. Tested for about 4 days constant operation on Ropsten. The crash which occured every few minutes no longer occurs, and discv4 keeps working. Signed-off-by: Jamie Lokier <[email protected]>

jlokier changed the title ~~crash when trying to sync with testnet~~ Kademlia crash when trying to sync with testnet or mainnet Mar 17, 2021

jlokier mentioned this issue Apr 2, 2021

discv4: Kademlia crash when trying to sync status-im/nim-eth#341

Closed

jlokier mentioned this issue Apr 2, 2021

discv4: Fix Kademlia crash when trying to sync status-im/nim-eth#342

Merged

kdeme closed this as completed in status-im/nim-eth#342 Apr 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kademlia crash when trying to sync with testnet or mainnet #489

Kademlia crash when trying to sync with testnet or mainnet #489

jangko commented Apr 14, 2020 •

edited

Loading

jlokier commented Mar 3, 2021

jlokier commented Mar 3, 2021

mjfh commented Mar 9, 2021

jangko commented Mar 10, 2021 •

edited

Loading

jlokier commented Mar 17, 2021

Kademlia crash when trying to sync with testnet or mainnet #489

Kademlia crash when trying to sync with testnet or mainnet #489

Comments

jangko commented Apr 14, 2020 • edited Loading

jlokier commented Mar 3, 2021

jlokier commented Mar 3, 2021

mjfh commented Mar 9, 2021

jangko commented Mar 10, 2021 • edited Loading

jlokier commented Mar 17, 2021

jangko commented Apr 14, 2020 •

edited

Loading

jangko commented Mar 10, 2021 •

edited

Loading