Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Answering: "What happens today when we run nimbus --goerli?" #862

Closed
jlokier opened this issue Oct 15, 2021 · 2 comments
Closed

Answering: "What happens today when we run nimbus --goerli?" #862

jlokier opened this issue Oct 15, 2021 · 2 comments
Labels
Networking Security Security vulnerability or security related Sync Prevents or affects sync with Ethereum network

Comments

@jlokier
Copy link
Contributor

jlokier commented Oct 15, 2021

This simple question was getting speculative answers, but for one reason or another, nobody really tested it in a long time.

This meta issue is to write down what really happens when we run nimbus --goerli, no other changes, to discover all the significant issues. See also related issue #687 "Sync to Goerli". Issues that come up are to be filed separately and fixed one by one, outside this meta issue. Ideally we will file individual issues and fixes, and update this meta issue to point to them.

Time and space required

It has proven useful to know a guideline for how much time and storage to expect, so a Goerli sync can be replicated without going through the tedium of trial and error, disk full recovery efforts, etc.

Value
Base version tested 521f29c (2021-08-24 18:30:52 +0700)
(Later commits are required to complete, see issues below)
Time to sync 2 days 17 hours
Storage space used 805 GB (up to block number 5631351)
Test CPU AMD Ryzen 9 5950X 3.5 GHz
Test storage 3x RAID-0 NVMe SSD, 512k stripe size
Test network 1 Gbit/s internet, no NAT

To finish in a similar time, you will need to run nimbus --goerli in a loop to auto-restart it when it crashes, and with enough storage space. The time shown above does not count stops during the test where Nimbus crashed and was later restarted after analysis, time to recover from disk full conditions, or time spent syncing which was reverted to a clean storage snapshot. (True calendar time for this test was 28 days 18 hours).

What happened

Short version:

  1. There were a number of crash types, for which the workaround is to restart nimbus in a loop.
  2. There were 4 points where sync stopped progressing. Two networking issues and two consensus logic issues.
  3. When it reaches the head block, state is fully synchronised but it does not keep up after that in real time.

Points where bulk sync stopped

The issues in this section need fixes to allow sync to reach the Goerli head block in October 2021. All are fixed now.
Note: Berlin starts at block 4460644, London starts at block 5062605.

  • Progress slowed dramatically after the start of Berlin fork, 4460644.

    • Most peers dropped connections after Nimbus sent the first GetBlockHeaders, but only after the Nimbus sync had reached Berlin.
    • This turned out to be due to a counter-intuitive effect of EIP-2124: "Fork identifier for chain compatibility
      checks"
      .
    • Nimbus was configured with the Berlin fork block but not London.
    • We didn't expect that to be relevant to syncing in the Berlin fork.
    • Somewhat counter-intuitively, the rules of EIP-2124 tell peers to drop an Eth node when it reaches the start of Berlin fork, if that node doesn't know about London fork, but only when network consensus has passed the London fork, which it had at the time of testing.
    • This is a little unexpected because (a) syncing into Berlin was fine before, and (b) a node that already synced and just keeping up will carry on until start of London, when it stops updating.
    • This issue appeared to halt progress, but actually it was only slowing it down, a lot. Peers that only know about Berlin still share their state.
    • The fix was easy, add London fork block number for Goerli.
    • Fixed in 69f2a0f "config: replace stdlib parseOpt with nim-confutils".
  • Progress stopped at block 4463617. This block number was due to a consensus bug at block 4463713 (see next), and the batching logic in blockchain_sync which does 192 blocks at a time and aborts the whole batch when any block fails.

  • Consensus bug at block 4463713. This was an incorrect gas calculation in the EVM, but it manifest as a mismatching receiptRoot because it didn't change the state, only the receipt.

    • Symptom:
      DBG 2021-09-21 12:37:36.427+07:00 wrong receiptRoot in block                 tid=12308 blockNumber=4463713 actual=FBAFEA49FD186AFE450FEC9DDD1E8CD19A6511C946C280423A79F1A5034A6B6C expected=382CC45D2242719F9F612409324273EC42140174F38E0BC9E38022F8326E84CD
      TRC 2021-09-21 12:37:36.427+07:00 peer disconnected                          tid=12308 file=blockchain_sync.nim:407 peer=<IP:PORT>
      
    • More explanation of how this only affected receiptRoot starting from Sync to Goerli #687 (comment).
    • Fixed in 908dc21 "evm: fixes EIP2929 opcodes" (evm: fixes EIP2929 opcodes #842).
  • Progress stopped at block 4494913. This was due to a sync networking bug in blockchain_sync, issuing requests for N block bodies and incorrectly handling replies which contained fewer than N bodies.

    • The effect was to repeatedly download a large amount of data without making progress. The Nimbus code did multiple requests in a batch, accepted the individual replies, then aborted the whole batch to fetch it again.
    • Aborted and restarted batches occurred with earlier blocks too, but this became more pronounced until there were no suitable peers at batch 4494913..4495104. (The number to look for in progress records is 4494721, as that's the last logged "Persisting" start value when stuck.)
    • Filed as issue Sync progress stops at Goerli block 4494913 #864 "Sync progress stops at Goerli block 4494913".
    • Fixed in PR Sync fix: GetBlockBodies logic preventing sync, dropping peers #865 "Sync fix: GetBlockBodies logic preventing sync, dropping peers".
  • Progress stopped at block 5080897. This block number was due to a consensus bug at block 5080941 (see next), and the batching logic in blockchain_sync which does 192 blocks at a time and aborts the whole batch when any block fails.

  • Consensus bug at block 5080941. This was incorrect return data when a CREATE or CREATE2 operation tried to make a contract larger than 24 kiB and the caller looked at the RETURNDATA or RETURNDATASIZE.

    • Symptom:
      TRC 2021-10-01 21:18:12.883+01:00 Persisting blocks                          tid=902365 file=persist_blocks.nim:43 fromBlock=5080897 toBlock=5081088
      ...
      DBG 2021-10-01 21:18:13.270+01:00 Contract code size exceeds EIP170          topics="vm computation" tid=902365 file=computation.nim:236 limit=24577 actual=31411
      DBG 2021-10-01 21:18:13.271+01:00 gasUsed neq cumulativeGasUsed              tid=902365 file=process_block.nim:68 block=5080941/0A3537BC5BDFC637349E1C77D9648F2F65E2BF973ABF7956618F854B769DF626 gasUsed=3129669 cumulativeGasUsed=3132615
      TRC 2021-10-01 21:18:13.271+01:00 peer disconnected                          tid=902365 file=blockchain_sync.nim:407 peer=<IP:PORT>
      
    • The 24 kiB contract size limit was exceeded as early as block 4394113 (or 4394114, logs are ambiguous), but that's fine. The bug was in RETURNDATA when this happens.
    • This issue is not related to London fork semantics. The fact it occurs for the first time in the London range is probably a coincidence.
    • More explanation starting from Sync to Goerli #687 (comment).
    • Filed as issue EIP-170 related consensus error at Goerli block 5080941 #867 "EIP-170 related consensus error at Goerli block 5080941".
    • Two fixes tested but to be committed.

Crash types

The crashes in this section have a simple workaround: Restart Nimbus-eth1. For the disk full crash of course you must add more disk space to stop it happening.

Of course the workaround is not good enough for a release. It's useful to have this set of observations so we can address them one by one. Fixes will be noted here as they are implemented. However, for the purpose of working with Nimbus-eth1, it's good to know just restarting is a temporary solution to these issues.

  1. Symptom (seen 10 times during Goerli sync): AssertionError with no useful stacktrace.

    libbacktrace error: /home/jamie/Status/nimbus-eth1/build/nimbus (24)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    Error: unhandled exception: 24 [AssertionError]
    
    • The cause is running out of file descriptors because devp2p accepts too many incoming connections.
    • That is why libbacktrace can't open /proc/self/exe to read the executable. It's been proposed to change libbacktrace but that wouldn't address the real problem here, which is about sockets not libbacktrace.
    • Workaround 1: Increase the number of file descriptors available. On Linux, ulimit -S -n 65536. This will allow Nimbus-eth1 to run longer between these types of crashes.
    • Workaround 2: Restart Nimbus-eth1. This is a recoverable event.
    • We need to fix this, but managing the number of peer connections right needs some deeper thought, it's not as trivial as it seems. Ratios of different kinds of peer characteristics have to be maintained in the pool, dropping some to allow others to connect is important (otherwise the pool can fill with bad peers), and the accept() queue complicates accounting.
    • There might be some scenarios where RocksDB cannot perform updates and the program stops abruptly. Hopefully RocksDB is resiliant enough that it doesn't lose data or corrupt the database when this happens.
    • Related issue Limit (incoming) connecting peers. nim-eth#107 "Limit (incoming) connecting peers".
    • Related issue Per IP-address peer limiting? nim-eth#111 "Per IP-address peer limiting?".
    • Related issue Improve p2p peer management nim-eth#365 "Improve p2p peer management".
  2. Symptom (seen 1 time during Goerli sync): Too many open files [ResultDefect] with no useful stacktrace.

    libbacktrace error: /home/jamie/Status/nimbus-eth1/build/nimbus (24)
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    ...
    libbacktrace error: failed to read executable information (-1)
    libbacktrace error: failed to read executable information (-1)
    [[reraised from:
    ]]
    [[reraised from:
    ]]
    [[reraised from:
    ]]
    [[reraised from:
    ]]
    Error: unhandled exception: working database: IO error: While open a file for random read: /home/jamie/.nimbus/goerli-oldsync/nimbus/data/068761.sst: Too many open files [ResultDefect]
    
    • Same cause as the previous item, running out of file descriptors because devp2p accepts too many incoming connections.
    • This error is RocksDB unable to read from the database.
    • Same workarounds as prevous item.
  3. Symptom (seen 10 times during Goerli sync): [Defect] ac.savePoint == sp and sp.state == Pending.

    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/blockchain_sync.nim(289) obtainBlocksFromPeer
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/blockchain_sync.nim(167) returnWorkItem
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/blockchain_sync.nim(98) persistWorkItem
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/chain_helpers.nim(26) persistBlocks
    [[reraised from:
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(96) colonanonymous
    ]]
    Error: unhandled exception: /home/jamie/Status/nimbus-eth1/nimbus/db/accounts_cache.nim(105, 12) `ac.savePoint == sp and sp.state == Pending`  [Defect]
    
    • This is caused by disk full condition. We apologise for the lack of a useful error message...
    • Workaround is obvious: Don't run out of disk space 😄
    • Workaround: Ensure more disk space, then restart Nimbus-eth1. So far this has been a recoverable event.
    • Superficial bug is obvious: It should produce a useful error message about disk full.
    • The currents messages do indicate one or more deeper bugs in the Nimbus-eth1 internals. The hexary tree and other writing code very much assumes database writes cannot fail at every level in the stack, but when that happens it should at least abort gracefully. The transaction scope error means an invariant is broken, which suggests an exception is not being handled properly or is being lost.
    • After RocksDB cannot write due to disk full, hopefully RocksDB is resiliant enough that it doesn't lose data or corrupt the database when this happens.
    • This may be related to issue AssertionError when syncing goerli #582 "AssertionError when syncing goerli", but there are enough differences that it may not.
  4. Symptom (seen 4 times during Goerli sync): [RangeError] value out of range: -87 notin 0 .. 9223372036854775807.

    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
    /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(1164) rlpxAccept
    /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/auth.nim(469) decodeAuthMessage
    /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/auth.nim(370) decodeAuthMessageEip8
    /home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/chcks.nim(32) raiseRangeErrorI
    /home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
    [[reraised from:
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(252) rlpxAccept
    ]]
    [[reraised from:
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(96) colonanonymous
    ]]
    Error: unhandled exception: value out of range: -87 notin 0 .. 9223372036854775807 [RangeError]
    
    • The values out of range were -87, -40 and -37 at different times.
    • The cause is an invalid incoming message from a peer, and Nimbus-eth1's devp2p code being insufficiently resilient to this.
    • Workaround: Restart Nimbus-eth1. This is a recoverable event.
    • Fix: Not done yet but it looks like an easy fix. As it relates to network security, surrounding code should be reviewed to ensure there aren't other, similar input validation issues.
    • Details in issue RLPx out of range crash triggered by message in auth handshake #766 "RLPx out of range crash triggered by message in auth handshake".
  5. Symptom (seen 10 times during Goerli sync): FutureDefect relating to rlpx.nim:

    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncfutures2.nim(283) fail
    /home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/gc.nim(197) checkFinished
    Error: unhandled exception: An attempt was made to complete a Future more than once. Details:
      Future ID: 162273
      Creation location:
        p2p_protocol_dsl.nim(153)    [unspecified]
      First completion location:
        rlpx.nim(77)    [unspecified]
      Second completion location:
        rlpx.nim(110)    [unspecified]
    
    [FutureDefect]
    
    • The cause is not known, but I'd guess it has to do with timeout occuring around the same time as data arriving.
    • A Received msg already expired was always found in the page of log output prior to this crash, which lends weight to the idea that this is the cause.
    • Workaround: Restart Nimbus-eth1. This is a recoverable event.
    • Fix: Not done yet but looks like a reasonably easy fix. Similar issues are already handled in other code using Future and timouts.
    • This may explain issue rlpx.nim Future double completion nim-eth#44 "rlpx.nim Future double completion".
    • This may explain issue "Attempt to complete a Future more than once" after "Transport is already closed" #565 "Attempt to complete a Future more than once" after "Transport is already closed" and "Attempt to complete a Future more than once" after "Transport is already closed" nim-eth#333 (same issue, different repo).
      • The line numbers in the log here from Chronos (p2p_protocol_dsl.nim and rlpx.nim) are equivalent to the line numbers in those issues, just a different time in repo history.
      • But there is no obviously related "Transport is already closed" in the latest logs.
    • Implications: The analogous issue may also be present in other networking code that uses timeouts, such as discovery, where it remains hidden only because it doesn't trigger in practice.
  6. Symptom (seen 2 times during earlier Goerli tests, not the latest): [IndexError] index 33 not in 0 .. 32.

    WRN 2021-07-26 10:01:00.425+01:00 Error while handling RLPx message          topics="rlpx" tid=1212253 file=rlpx.nim:620 peer=Node[210.91.82.155:9360] msg=33 err="RLPx message with an invalid id 33 on a connection supporting eth"
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(226) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(188) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
    /home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(627) dispatchMessages
    /home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/chcks.nim(23) raiseIndexError2
    /home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
    [[reraised from:
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(226) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(188) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(96) colonanonymous
    ]]
    Error: unhandled exception: index 33 not in 0 .. 32 [IndexError]
    
    • This stopped happening on current Goerli because it was due to a faulty peer sending invalid messages.
    • Even so, it's a bug in the devp2p networking code which should not crash when it receives a faulty message.
    • Workaround: Restart Nimbus-eth1. This is a recoverable event.
    • Fix: Not done yet but it looks like an easy fix. As it relates to network security, surrounding code should be reviewed to ensure there aren't other, similar input validation issues.
    • Details in issue RLPx out of bounds crash when downloading from Goerli #767 "RLPx out of bounds crash when downloading from Goerli".

After bulk sync is completed to current head block of Goerli

  1. Real-time chain updates are not followed. It does pick up new blocks, but only about once every 2-17 minutes.

    • New blocks are picked up when it connects to enough new peers that it resumes the bulk sync process towards a new head.
  2. Occasional symptom in the synchronised state: FutureDefect relating to persistBlocksImpl.

    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    Error: unhandled exception: Asynchronous task [obtainBlocksFromPeer() at blockchain_sync.nim:206] finished with an exception "BlockNotFound"!
    Message: No block with hash 94926cdda9ecbf1ade5ddf23ab7ea204332c88e9dc8602a150733a715479e2ba
    Stack trace: /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(269) nimbus
    /home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(231) process
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
    /home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/blockchain_sync.nim(293) obtainBlocksFromPeer
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/blockchain_sync.nim(167) returnWorkItem
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/blockchain_sync.nim(98) persistWorkItem
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/persist_blocks.nim(117) persistBlocks
    /home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/persist_blocks.nim(53) persistBlocksImpl
    
    • The cause is sometimes Nimbus-eth1 blockchain_sync cannot handle chain reorg from the network.
    • After this happens, Nimbus-eth1 will not recover. If it is run again, the same error occurs after a few minutes.
    • Superficial bug is that BlockNotFound should be handled and a blockchain_sync work item should be aborted, instead of crashing.
    • Deeper bug is that BlockNotFound is occurring because the new blocks received refer to a parent block that doesn't exist locally.
    • Implications: The inability to handle reorgs is concerning, and it suggests deeper architectural issues.
    • Merely rewinding the local chain to sync to new blocks from an earlier point would be wrong because that does not perform the required logic to track the chain with highest total difficulty (or PoA equivalent). Essentially, blockchain_sync's block fetching strategy and data model is not designed for multiple simultaneously competing branches.
  3. The inability to handle reorgs after reaching the chain head suggests bulk sync also has a bug, one that would only rarely manifest. It suggets incorrect chain handling in the last few hundreds (thousands at a push) of blocks of bulk sync, if there are competing chains being received. Instead of switching to the winning chain, because of the data model and fetching strategy it will receive conflicting data (over time or from different peers) causing an unfetched gap in received blocks on the new winning chain.

@jlokier jlokier changed the title Answering: "What happens now when we just run nimbus --goerli?" Answering: "What happens today when we run nimbus --goerli?" Oct 15, 2021
@jlokier jlokier added Networking Security Security vulnerability or security related Sync Prevents or affects sync with Ethereum network labels Oct 15, 2021
@zah
Copy link
Contributor

zah commented Oct 19, 2021

Can the re-org problem be worked-around by manually changing the head block reference in the database (e.g. by going back to an older good state)?

@tersec
Copy link
Contributor

tersec commented May 25, 2024

@tersec tersec closed this as completed May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Networking Security Security vulnerability or security related Sync Prevents or affects sync with Ethereum network
Projects
None yet
Development

No branches or pull requests

3 participants