Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uTP / Portal stream: Investigate security issues & mitigations in implementation #1085

Open
kdeme opened this issue May 12, 2022 · 2 comments

Comments

@kdeme
Copy link
Contributor

kdeme commented May 12, 2022

e.g.:

  • Limiting amount of open connections (total, per incoming/outgoing, per offer/accept or findcontent/content flows)
  • Lingering connections, due to perhaps missing timeouts, or partial timeouts (per part / content item read on the socket)
  • etc.
@kdeme
Copy link
Contributor Author

kdeme commented Sep 8, 2023

This issue has become a bit more pressing due to some OOMs of our nodes in the fleet.

Visible for example here: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-fluffy-mainnet-master-01&from=1693655722341&to=1693773093911
Or here: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-fluffy-mainnet-master-02&from=1693499124066&to=1693614110733

Current theory is that this is occurring when a Bridge is injecting a lot of content and a lot of offers (probably outgoing ones) are adding up.

It is know that the current system of AsyncQueues is not exactly properly designed to avoid this to happen.

Some probable issues:

  1. contentQueue in a PortalStream is unlimited. Might want to add a limit on this limited, but will await when limit is reached, and this await occurs after all the content has been read over uTP stream. To avoid a build up there drop incoming sockets when that limit is reached. Or better, avoid accepting the offer in the first place.
  2. The offerQueue in PortalProtocol does have a limit. It is actually set to the same amount as the amount of workers popping off the queue. The possible issue is that several running NeighborhoodGossips end up blocking because of the offerQueue reaching its limit. Especially as for each incoming offer/accept/content cycle, potentially 8 new offers are being done.
  3. Seems like quite some copies of the content items and content keys are being done in NeighborhoodGossip. This is due to the different way of how the stream passes along the data and how they are passed to the offerQueue. This combined with 2. makes it much worse.
  4. This is a bit more of a pure assumption, but it is possible (likely?) that quite a few duplicate offers are being accepted (and possibly gossiped) at the same time. Assuming here that the same offers come from different peers around the same time. We don't avoid accepting this as long as we haven't received and verified the data. This should be verified first if it is really an issue.

@kdeme
Copy link
Contributor Author

kdeme commented Sep 16, 2023

  1. is done in Do not accept new offers if our contentQueue is full #1753 and appears to work. So this we will already merge.

  2. is done in Potential quickfix against a NHGossip offer build-up #1739, but has not been proven to get rid of the memory build up on its own, so closing for now. Might want to add a similar version of this still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant