Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RLPx out of bounds crash when downloading from Goerli #767

Closed
jlokier opened this issue Jul 24, 2021 · 3 comments · Fixed by status-im/nim-eth#438
Closed

RLPx out of bounds crash when downloading from Goerli #767

jlokier opened this issue Jul 24, 2021 · 3 comments · Fixed by status-im/nim-eth#438
Labels
Networking Security Security vulnerability or security related

Comments

@jlokier
Copy link
Contributor

jlokier commented Jul 24, 2021

Reported by @mjfh, occurring regularly with Nimbus-eth1 on the Goerli network:

Can confirm thal on my Goerli block collection test. Reproducible event:

/home/jordan/nimbus-eth1/nimbus/nimbus.nim(264) nimbus
/home/jordan/nimbus-eth1/nimbus/nimbus.nim(226) process
/home/jordan/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
/home/jordan/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
/home/jordan/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(619) dispatchMessages
/home/jordan/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/chcks.nim(23) raiseIndexError2
/home/jordan/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
[[reraised from:
/home/jordan/nimbus-eth1/nimbus/nimbus.nim(264) nimbus
/home/jordan/nimbus-eth1/nimbus/nimbus.nim(226) process
/home/jordan/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
/home/jordan/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(96) colonanonymous
]]
Error: unhandled exception: index 33 not in 0 .. 32 [IndexError]

with always the same index bat varying time until the error occurs (best run was to collect 500k blocks which was enough for my replay.)

@jlokier
Copy link
Contributor Author

jlokier commented Jul 24, 2021

I haven't seen this when testing on Goerli. Perhaps they are rare events so tend not to be seen testing on a fast syncing network.

The out of bounds index is msgId in if peer.awaitedMessages[msgId] != nil:. That's processing incoming messages.

Message 33 is the eth protocol message Receipts. Ie. it's a valid message.

(If it was an invalid message, it would be caught by invokeThunk first, which checks if msgId is valid.)

There is a bug (easily fixed) which is that awaitedMessages size is off by 1.

However there is something else interesting going on here. Receipts should only ever be received in response to a GetReceipts request we have sent, and we don't send those currently. So why did we ge a Receipts message, and which client is sending those unsolicited?

@jlokier
Copy link
Contributor Author

jlokier commented Jul 29, 2021

Alright, after more testing:

  • The reason I didn't see this on Goerli is it doesn't happen when running with eth/65.

  • The culprit is a client called dangnn/v1.9.10-stable/windows-amd64/go1.16.2.
    @mjfh, next time when reporting a bug, the trace output shortly before a stacktrace event would be helpful.
    In this case it's:

    TRC 2021-07-24 04:17:52.264+01:00 Received Hello                             topics="rlpx" tid=1027594 file=rlpx.nim:1208 version=5 id=dangnn/v1.9.10-stable/windows-amd64/go1.16.2
    TRC 2021-07-24 04:17:52.264+01:00 devp2p handshake completed                 topics="rlpx" tid=1027594 file=rlpx.nim:1219 peer=Node[REDACTED:6709] clientId=dangnn/v1.9.10-stable/windows-amd64/go1.16.2
    TRC 2021-07-24 04:17:52.264+01:00 DevP2P local and remote protocols          topics="rlpx" tid=1027594 file=rlpx.nim:209 selected=[eth/63] localOffer=[eth/64] remoteOffer="[eth/63, eth/64]"
    TRC 2021-07-24 04:17:52.264+01:00 >> Sending eth.Status (0x00) [eth/63]      tid=1027594 file=eth65.nim:52 peer=Node[REDACTED:6709] td=2 bestHash=e9dc478239532458976ae9286ef84f3ebe345c05cc5e74d67f0c49ab37dc4805 networkId=5 genesis=bf7e331f7f7c1dd2e05159666b3bf8bc7a8a3a9eb1d518969eab529dd9b88c1a
    TRC 2021-07-24 04:17:52.474+01:00 Waiting for more peers                     tid=1027594 file=p2p.nim:115 peers=0
    WRN 2021-07-24 04:17:52.538+01:00 Error while handling RLPx message          topics="rlpx" tid=1027594 file=rlpx.nim:620 peer=Node[210.91.82.155:6709] msg=33 err="RLPx message with an invalid id 33 on a connection supporting eth"
    
  • Amazingly, the RLPx specification is wrong (PR sent).

  • I was wrong about message 33 being Receipts, due to being misled by the RLPx spec error. 33 is invalid. It's actually out of range.

  • It's not a bug in awaitedMessages size: The size is correct. The bug is the range check in earlier invokeThunk doesn't cause an early return and awaitedMessages is then read with out of range msgId.

  • dangnn/v1.9.10 is sending its eth.Status message with incorrect msgId 33 instead of 16. It's buggy software on the network. It also uses 33 if we negotiate eth/64 instead of eth/63.

  • Even though it's a buggy peer, it's a networking bug in Nimbus that RLPx is vulnerable to be crashed by out of range msgId. The cause is when the range check in earlier invokeThunk detects a bad message, it then falls through to read awaitedMessages out of range. It should probably terminate the dispatch loop instead.

@zah
Copy link
Contributor

zah commented Jul 29, 2021

Regarding the RPLx problem, indeed, disconnecting the peer with BreachOfProtocol seems the right way to go.

jlokier added a commit to status-im/nim-eth that referenced this issue Nov 30, 2021
Closes [nimbus-eth1#767](status-im/nimbus-eth1#767).

Crashes occur when certain invalid RLPx messages are received from a peer.
Specifically, `msgId` out of range.  Because any peer can easily trigger this
crash, we'd consider it a DOS vulnerability if Nimbus-eth1 was in general use.

We noticed when syncing to Goerli, there were some rare crashes with this
exception.  It turned out one peer with custom code, perhaps malfunctioning,
was sending these messages if we were unlucky enough to connect to it.

`invokeThunk` is called from `dispatchMessages` and checks the range of
`msgId`.  It correctly recognise that it's out of range, raises and exception
and produces a message.  Job done.

Except the code in `dispatchMessage` treats all that as a warning instead of
error, and continues to process the message.  A bit lower down, `msgId` is used
again without a range check.

The trivial fix is to check array bounds before access.

--

ps. Here's the stack trace ("reraised" sections hidden):

```
WRN 2021-11-08 21:29:33.238+00:00 Error while handling RLPx message          topics="rlpx" tid=2003472 file=rlpx.nim:607 peer=Node[<IP>:45456] msg=45 err="RLPx message with an invalid id 45 on a connection supporting eth,snap"
/home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/chain_desc.nim(437) main
/home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/chain_desc.nim(430) NimMain
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(258) process
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(1218) rlpxAccept
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(101) postHelloSteps
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(985) postHelloSteps
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(101) dispatchMessages
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(77) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(614) dispatchMessages
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/chcks.nim(23) raiseIndexError2
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
[[reraised from: ... ]]
[[reraised from: ... ]]
[[reraised from: ... ]]
[[reraised from: ... ]]
Error: unhandled exception: index 45 not in 0 .. 40 [IndexError]
```

Signed-off-by: Jamie Lokier <[email protected]>
jlokier added a commit to status-im/nim-eth that referenced this issue Nov 30, 2021
Closes [nimbus-eth1#767](status-im/nimbus-eth1#767).

Crashes occur when certain invalid RLPx messages are received from a peer.
Specifically, `msgId` out of range.  Because any peer can easily trigger this
crash, we'd consider it a DOS vulnerability if Nimbus-eth1 was in general use.

We noticed when syncing to Goerli, there were some rare crashes with this
exception.  It turned out one peer with custom code, perhaps malfunctioning,
was sending these messages if we were unlucky enough to connect to it.

`invokeThunk` is called from `dispatchMessages` and checks the range of
`msgId`.  It correctly recognise that it's out of range, raises and exception
and produces a message.  Job done.

Except the code in `dispatchMessage` treats all that as a warning instead of
error, and continues to process the message.  A bit lower down, `msgId` is used
again without a range check.

The trivial fix is to check array bounds before access.

--

ps. Here's the stack trace ("reraised" sections hidden):

```
WRN 2021-11-08 21:29:33.238+00:00 Error while handling RLPx message          topics="rlpx" tid=2003472 file=rlpx.nim:607 peer=Node[<IP>:45456] msg=45 err="RLPx message with an invalid id 45 on a connection supporting eth,snap"
/home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/chain_desc.nim(437) main
/home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/chain_desc.nim(430) NimMain
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(258) process
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(1218) rlpxAccept
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(101) postHelloSteps
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(985) postHelloSteps
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(101) dispatchMessages
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(77) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(614) dispatchMessages
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/chcks.nim(23) raiseIndexError2
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
[[reraised from: ... ]]
[[reraised from: ... ]]
[[reraised from: ... ]]
[[reraised from: ... ]]
Error: unhandled exception: index 45 not in 0 .. 40 [IndexError]
```

Signed-off-by: Jamie Lokier <[email protected]>
zah pushed a commit to status-im/nim-eth that referenced this issue Nov 30, 2021
Closes [nimbus-eth1#767](status-im/nimbus-eth1#767).

Crashes occur when certain invalid RLPx messages are received from a peer.
Specifically, `msgId` out of range.  Because any peer can easily trigger this
crash, we'd consider it a DOS vulnerability if Nimbus-eth1 was in general use.

We noticed when syncing to Goerli, there were some rare crashes with this
exception.  It turned out one peer with custom code, perhaps malfunctioning,
was sending these messages if we were unlucky enough to connect to it.

`invokeThunk` is called from `dispatchMessages` and checks the range of
`msgId`.  It correctly recognise that it's out of range, raises and exception
and produces a message.  Job done.

Except the code in `dispatchMessage` treats all that as a warning instead of
error, and continues to process the message.  A bit lower down, `msgId` is used
again without a range check.

The trivial fix is to check array bounds before access.

--

ps. Here's the stack trace ("reraised" sections hidden):

```
WRN 2021-11-08 21:29:33.238+00:00 Error while handling RLPx message          topics="rlpx" tid=2003472 file=rlpx.nim:607 peer=Node[<IP>:45456] msg=45 err="RLPx message with an invalid id 45 on a connection supporting eth,snap"
/home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/chain_desc.nim(437) main
/home/jamie/Status/nimbus-eth1/nimbus/p2p/chain/chain_desc.nim(430) NimMain
/home/jamie/Status/nimbus-eth1/nimbus/nimbus.nim(258) process
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncloop.nim(279) poll
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(1218) rlpxAccept
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(101) postHelloSteps
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(74) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(985) postHelloSteps
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(101) dispatchMessages
/home/jamie/Status/nimbus-eth1/vendor/nim-chronos/chronos/asyncmacro2.nim(77) colonanonymous
/home/jamie/Status/nimbus-eth1/vendor/nim-eth/eth/p2p/rlpx.nim(614) dispatchMessages
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/chcks.nim(23) raiseIndexError2
/home/jamie/Status/nimbus-eth1/vendor/nimbus-build-system/vendor/Nim/lib/system/fatal.nim(49) sysFatal
[[reraised from: ... ]]
[[reraised from: ... ]]
[[reraised from: ... ]]
[[reraised from: ... ]]
Error: unhandled exception: index 45 not in 0 .. 40 [IndexError]
```

Signed-off-by: Jamie Lokier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Networking Security Security vulnerability or security related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants