Skip to content

Commit

Permalink
Sync fix: GetBlockBodies logic preventing sync, dropping peers
Browse files Browse the repository at this point in the history
Fixes #864 "Sync progress stops at Goerli block 4494913", and equivalent on
other networks.

The block body fetcher in `blockchain_sync.nim` had an incorrect assumption
about how peers respond to `GetBlockBodies`.  It was issuing requests for N
block bodies and incorrectly handling replies which contained fewer than N
bodies.

Having received up to 192 headers in a batch, it split the range into smaller
`GetBlockBodies` requests, fetched each reply, then combined replies.  The
effect was Nimbus requested batches of 128+64 block bodies, received gaps in
the reply sequence, then aborted.

That meant it repeatedly fetched data, then discarded it, and fetched it again,
dropping good peers in the process.

Aborted and restarted batches occurred with earlier blocks too, but this became
more pronounced until there were no suitable peers at batch 4494913..4495104.

Here's a trace:

```
TRC 2021-09-29 02:40:24.977+01:00 Requesting block headers                   file=blockchain_sync.nim:224 start=4494913 count=192 peer=<ENODE>
TRC 2021-09-29 02:40:24.977+01:00 >> Sending eth.GetBlockHeaders (0x03)      file=protocol_eth65.nim:51 peer=<PEER> startBlock=4494913 max=192
TRC 2021-09-29 02:40:25.005+01:00 << Got reply eth.BlockHeaders (0x04)       file=protocol_eth65.nim:51 peer=<PEER> count=192
TRC 2021-09-29 02:40:25.007+01:00 >> Sending eth.GetBlockBodies (0x05)       file=protocol_eth65.nim:51 peer=<PEER> count=128
TRC 2021-09-29 02:40:25.209+01:00 << Got reply eth.BlockBodies (0x06)        file=protocol_eth65.nim:51 peer=<PEER> count=13
TRC 2021-09-29 02:40:25.210+01:00 >> Sending eth.GetBlockBodies (0x05)       file=protocol_eth65.nim:51 peer=<PEER> count=64
TRC 2021-09-29 02:40:25.290+01:00 << Got reply eth.BlockBodies (0x06)        file=protocol_eth65.nim:51 peer=<PEER> count=64
WRN 2021-09-29 02:40:25.306+01:00 Bodies len != headers.len                  file=blockchain_sync.nim:276 bodies=77 headers=192
TRC 2021-09-29 02:40:25.306+01:00 peer disconnected                          file=blockchain_sync.nim:403 peer=<PEER>
TRC 2021-09-29 02:40:25.306+01:00 Finished obtaining blocks                  file=blockchain_sync.nim:303 peer=<PEER>
```

In practice, for modern peers, Nimbus received shorter replies than it assumed
depending on the block sizes on the chain.  Geth/Erigon has 2MiB `BlockBodies`
response size soft limit.  OpenEthereum has 4MiB.

Up to Berlin (EIP-2929), Nimbus's fetcher failed often, but there were still
some peers serving what Nimbus needed.

Just after the start of Berlin, at batch 4494913..4495104 on Goerli, zero peers
responded with full size replies for the whole batch, so Nimbus couldn't
progress past that point.  But there was already a problem happening before
that for large blocks, dropping good peers and repeatedly fetching the same
block data.

Signed-off-by: Jamie Lokier <[email protected]>
  • Loading branch information
jlokier committed Oct 19, 2021
1 parent d93a8bc commit 40fbed4
Showing 1 changed file with 25 additions and 21 deletions.
46 changes: 25 additions & 21 deletions nimbus/p2p/blockchain_sync.nim
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
# at your option. This file may not be copied, modified, or distributed except according to those terms.

import
std/[sets, options, random, hashes],
std/[sets, options, random, hashes, sequtils],
chronos, chronicles,
eth/common/eth_types,
eth/[p2p, p2p/private/p2p_types, p2p/rlpx, p2p/peer_pool],
Expand Down Expand Up @@ -239,8 +239,27 @@ proc obtainBlocksFromPeer(syncCtx: SyncContext, peer: Peer) {.async.} =
count=results.get.headers.len
shallowCopy(workItem.headers, results.get.headers)

var bodies = newSeq[BlockBody]()
var hashes = newSeq[KeccakHash]()
var bodies = newSeqOfCap[BlockBody](workItem.headers.len)
var hashes = newSeqOfCap[KeccakHash](maxBodiesFetch)
template fetchBodies() =
tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer,
count=hashes.len
let b = await peer.getBlockBodies(hashes)
if b.isNone:
raise newException(CatchableError, "Was not able to get the block bodies")
let bodiesLen = b.get.blocks.len
tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
count=bodiesLen
if bodiesLen == 0:
raise newException(CatchableError, "Zero block bodies received for request")
elif bodiesLen < hashes.len:
hashes.delete(0, bodiesLen - 1)
elif bodiesLen == hashes.len:
hashes.setLen(0)
else:
raise newException(CatchableError, "Too many block bodies received for request")
bodies.add(b.get.blocks)

var nextIndex = workItem.startIndex
for i in workItem.headers:
if i.blockNumber != nextIndex:
Expand All @@ -249,25 +268,10 @@ proc obtainBlocksFromPeer(syncCtx: SyncContext, peer: Peer) {.async.} =
nextIndex = nextIndex + 1
hashes.add(blockHash(i))
if hashes.len == maxBodiesFetch:
tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer,
count=hashes.len
let b = await peer.getBlockBodies(hashes)
if b.isNone:
raise newException(CatchableError, "Was not able to get the block bodies.")
tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
count=b.get.blocks.len
hashes.setLen(0)
bodies.add(b.get.blocks)
fetchBodies()

if hashes.len != 0:
tracePacket ">> Sending eth.GetBlockBodies (0x05)", peer,
count=hashes.len
let b = await peer.getBlockBodies(hashes)
if b.isNone:
raise newException(CatchableError, "Was not able to get the block bodies.")
tracePacket "<< Got reply eth.BlockBodies (0x06)", peer,
count=b.get.blocks.len
bodies.add(b.get.blocks)
while hashes.len != 0:
fetchBodies()

if bodies.len == workItem.headers.len:
shallowCopy(workItem.bodies, bodies)
Expand Down

0 comments on commit 40fbed4

Please sign in to comment.