uTP / Portal stream: Investigate security issues & mitigations in implementation #1085

kdeme · 2022-05-12T15:24:32Z

e.g.:

Limiting amount of open connections (total, per incoming/outgoing, per offer/accept or findcontent/content flows)
Lingering connections, due to perhaps missing timeouts, or partial timeouts (per part / content item read on the socket)
etc.

kdeme · 2023-09-08T12:42:27Z

This issue has become a bit more pressing due to some OOMs of our nodes in the fleet.

Current theory is that this is occurring when a Bridge is injecting a lot of content and a lot of offers (probably outgoing ones) are adding up.

It is know that the current system of AsyncQueues is not exactly properly designed to avoid this to happen.

Some probable issues:

contentQueue in a PortalStream is ~~unlimited. Might want to add a limit on this~~ limited, but will await when limit is reached, and this await occurs after all the content has been read over uTP stream. To avoid a build up there drop incoming sockets when that limit is reached. Or better, avoid accepting the offer in the first place.
The offerQueue in PortalProtocol does have a limit. It is actually set to the same amount as the amount of workers popping off the queue. The possible issue is that several running NeighborhoodGossips end up blocking because of the offerQueue reaching its limit. Especially as for each incoming offer/accept/content cycle, potentially 8 new offers are being done.
Seems like quite some copies of the content items and content keys are being done in NeighborhoodGossip. This is due to the different way of how the stream passes along the data and how they are passed to the offerQueue. This combined with 2. makes it much worse.
This is a bit more of a pure assumption, but it is possible (likely?) that quite a few duplicate offers are being accepted (and possibly gossiped) at the same time. Assuming here that the same offers come from different peers around the same time. We don't avoid accepting this as long as we haven't received and verified the data. This should be verified first if it is really an issue.

kdeme · 2023-09-16T14:05:57Z

is done in Do not accept new offers if our contentQueue is full #1753 and appears to work. So this we will already merge.
is done in Potential quickfix against a NHGossip offer build-up #1739, but has not been proven to get rid of the memory build up on its own, so closing for now. Might want to add a similar version of this still.

kdeme added the Fluffy label May 12, 2022

kdeme added this to the Portal Alpha History Network Launch Nice To Haves milestone Jul 13, 2022

kdeme added the portal-wire label May 10, 2024

Provide feedback