Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What are the pros and cons of using something like Bloom Filters for syncing purposes? #3

Open
viktorvsk opened this issue Aug 3, 2023 · 3 comments

Comments

@viktorvsk
Copy link

Hi!
When Implementing this protocol did you consider any existing solutions? Maybe you specifically considered bloom filters and decided to avoid it?

What are disadvantages of using something like the following:

  1. Relay-1 initiates sync-request using BLOOM-SCHEME
  2. If Relay-2 confirms the request, Relay-1 prepares a bloom filter containing all event ids it has and sends to Relay-2
  3. Relay-2 checks its events against this bloom filter and prepares have, need and in-doubt objects
  4. Then its implementation details on how relays decide to actually exchange events

it would require couple of MB of Bloom Filter data to be sent over the network for millions of events and if error rate for bloom filter is set to 0.01% then for 100 millions of events it will require to actually check for 10k events

@hoytech
Copy link
Owner

hoytech commented Aug 15, 2023

Hi! Sorry for the late reply.

I did consider many different protocols for syncing of nostr events, and in fact I previously was using an entirely different approach: https://github.com/hoytech/quadrable#syncing

I also looked at several other algorithms for set reconciliation (including this interesting one: https://github.com/sipa/minisketch) but my conclusion was that none of them would be effective for my use-case (syncing nostr events).

I have run the numbers on bloom filters before and couldn't figure out a way for the numbers to make sense. I believe you that you would need a several megabyte size filter for significantly-sized DBs. If this needed to be transferred on every sync, that would be an unacceptable overhead, IMO.

Instead, with negentropy, the amount of traffic grows logarithmically with DB size, and linearly with the symmetric differences between the two DBs. When the two DBs are identical, this can be determined with about 300 bytes of network traffic and one network round-trip (with the default settings). For many use-cases, I anticipate the syncing of identical DBs to be the most common case (ie periodically syncing two replicas in case any events got missed on one of them, which would be rare).

I'm happy to discuss bloom filters or alternate approaches in more detail, although I am quite satisfied with how simple and effective the range-based set reconciliation approach seems to be working so far -- not to say there isn't room for improvement! In particular, it would be great if fingerprints could be pre-computed/cached to reduce CPU and IO overhead during the sync setup.

One thing I don't quite understand how it would work with bloom filters: You are hashing each event ID and setting that in a giant bloom filter, and the other side is doing the same, right? In this case you would have false positives that represent an event that side A incorrectly believes side B to have. How do you detect these events? False negatives would be easier to deal with since they would be detected once you actually try to download them (assuming error rate was low enough that this overhead is negligible).

@hoytech
Copy link
Owner

hoytech commented Aug 15, 2023

I found a paper that describes a scheme that uses bloom filters, in addition to an Invertible Bloom Lookup Table (IBLT):

https://people.cs.umass.edu/~gbiss/graphene.pdf

@hoytech
Copy link
Owner

hoytech commented Aug 16, 2023

I did a bit more reading, and found a good summary of the issues here: https://www.sciopen.com/article/pdf/10.1109/TST.2016.7442499.pdf

"For the SBF-based set reconciliation method, due to false positives, some different objects unique to one set or the other cannot be identified. We can decrease the number of omitted unique objects by decreasing the false positive probability, while causing more communication overhead. For the IBF-based method, due to the probability of decoding failure, this degrades to a worst-case situation in which no unique object can be decoded."

With a regular bloom filter I guess you have to accept that some elements will just get missed, or drive the probability so far down it won't happen (which implies very large filter size).

I also found a cool repo for benchmarking various set reconcilliation schemes, and I added an issue requesting they add negentropy: nislab/gensync#9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants