-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast channel discovery after first startup #5828
Comments
I don't know much about the channel stuff, but assuming that channel discovery is done by sending |
Disclaimer: Normally, you really do not want to hit maximum throughput. Filled UDP buffers lead to packet drop and broken overlay message flows. If we are bootstrapping our channels though, we don't really care about this packet drop. However, do note, you really want this filling out of your buffers to stop after a while, to resume normal control messaging and nicely load-balanced communication. If you really want the maximum throughput you have to balance out the latency of (1) adding new peers for varied information and (2) actually receiving channels from the community for information volume. I.e., once you have a decently varied group of overlay neighbors, don't spend your valuable time and bandwidth on NAT-puncturing and keeping alive a new peer, spend it on pulling in channels. It is best to get connected to a small group of other peers and have them send their knowledge to us at a high rate. This translates to a pull-based mechanism, using a "please send me everything you have for the next X seconds"-message. Note that IPv8 already takes care of introducing you to a decently sized pool of peers to communicate with by default (disclaimer # 2: what is "decent size" depends on the application, but we know from years of gossip reseach papers that more than 60 neighbors gives you almost no returns and 15 neighbors is about the start of the range you want to be in). As mentioned by @egbertbouman , the current implementation asks any newly responding peer for a single channel this peer is subscribed to, once: tribler/src/tribler-core/tribler_core/modules/metadata_store/community/remote_query_community.py Line 119 in 1c4335c
I agree with @egbertbouman and I suggest to instead use the aforementioned "please send me everything you have for the next X seconds"-message upon connection to a new peer instead of this very polite "please share me one of your subscribed channels"-message. |
@qstokkink I assumed the |
@egbertbouman I stand corrected on that then and I also agree that one result per message should be fine, as long as you send more messages if you want more results (or, as I suggested above, a single message that leads to multiple answers). I do want to stress that the current implementation is very good about its bandwidth, its load balancing and has very readable and maintainable code. It's just not.. "mean" enough. |
Not maximum, just 25 UDP messages incoming would be ideal. I see too many security risks with, say, 4 MByte/seconds of incoming traffic (3000 packets per second!) . No only that congestion-control is not done yet (packet loss, buffer fill), we also have no trust function yet. Very helpful discussion.
Perfect! Each new peer can only give us 1 entry. No compromise on security/spam. If we can talk to 25 new people per second and ask them for a random subscribed channel; very done! Can we reply to incoming new strangers as well (or we do that already..)? tribler/src/tribler-core/tribler_core/modules/metadata_store/community/gigachannel_community.py Line 98 in 1c4335c
My preference is to re-use the regular gossip mechanism. I don't like to use the generic remote query mechanism (that API is great for debug obviously). Can we re-use that existing code? |
Well, I mean, we can. But, probably we shouldn't: see https://ieeexplore.ieee.org/document/6934311. Based on past both experience and science from our lab, I would suggest depending more on information propagation than peer churn. |
Correction: it will send you all the results of the remote query, broken down into as many (lz4-compressed) UDP packets as necessary. There is a hard limit on the sender side to prevent abuse: never send more than 50 entries. For example, if a remote query results in N entries (channels) to be sent, these N entries will be sent as 1-N packets, depending on how many of these fit into a single packet. The sender side will proceed to pack the response entries into a single packet until it fills up to capacity, then it will send the packet and proceed to pack the rest of the unsent entries into the next packet, etc. So, it is pretty efficient as it is. |
Oops... Actually, the refactored GigaChannelCommunity is now based on RemoteQueryCommunity and should use |
Of course, but I would say we should make it pull-based instead, so the querying side can control the rate. |
Also, the previous (push-based) logic packed preview content into the same packet as the gossiped channel. That's inefficient. RemoteQuery features the query-back subsystem which can query as much preview content as needed, up to any channel depth. If you're concerned with efficiency, the better way to solve it is to just cache the query results on the sender side. |
Yes. However, I do not know IPv8 good enough to even imagine what kind of tweaks should be done to e.g. increase the walk speed 10 times. |
That one is easy, divide this value by 10 (from 0.5 -> 0.05):
If you make the value too low, Tribler will detect main thread congestion and auto-scale back anyway. |
Is it possible to set |
IPv8 was not made to have varying walk intervals per Community (it smooths out network traffic using the fixed value - which is really cool, I do recommend checking out your network traffic in the system monitor 😃). Perhaps it is indeed better to keep your own looping TaskManager task in your particular Community. This gives you the most freedom to control the intervals. That said, you could also set the A note on `target_interval` for ultra-low `walk_interval` valuesThe recommended solution for ultra-low Concretely, this is done by changing all of these lines:
to match the current default minimum of @walk_strategy(random_walk, target_interval=0.5) Alternatively, you could make a partial class definition in |
So, I should just create a task to trigger the walks by myself, right? |
Yes, something like this (if you want to get really fancy you could even modulate your own interval on the fly - I won't show that here): tribler/src/tribler-core/tribler_core/modules/metadata_store/gigachannel_manager.py Lines 45 to 47 in 08bb535
And then, in the task itself, fire off the "pull" messages to (random?) peers from |
But that is not a "walk", right? e.g. this will not speed up the discovery of new peers. Because now we query each introduced peer on introduction. So, there is no reason to query them again. Instead, we should get more introductions. |
All the Disclaimer. I'd still prefer depending on information propagation instead. Forcing 50000 users to hammer the bootstrap nodes is one of the few ways you can actually take our entire network down. If we do switch to "open" bootstrap nodes, this is also a good way to get IP banned. |
Detailed specificationConnect using IPv8 to 200 unique peers and obtain from each of them 1 random subscribed channel (non-empty). Do this at a rate of roughly 20 UDP messages/second with signed votes per second. No invasive change. Would fill up screen fast and safe incremental step forward for Tribler.
Yes 👍
This was very helpful to clarify what I believe users will like. |
Bump up this value from 30 to 200.
|
Can this be changed dynamically? So first, 20 seconds use |
Yes, the script becomes about 3 lines though. I'll help with that when the time comes. |
Let only speedup the discovery process in 1 community for 10 seconds. How would you advice us to bypass the 500ms |
Here's the complete example (given #5828 (comment) and excluding imports): class TheCommunityYouWishToAddThisTo(Community):
def __init__(self, etc):
# Initialize some stuff
# Example starts here:
self.start_speed_walk = time.time()
self.walker = RandomWalk(self, timeout=10.0, window_size=200)
self.register_task("speed walking", self.speed_walk, interval=0.05)
def speed_walk(self):
self.walker.take_step()
if time.time() - self.start_speed_walk > 10.0: # Stop after 10 seconds.
self.cancel_pending_task("speed walking") |
I always thought a community only gets the first introduction from the bootstrap nodes, and the rest of the addresses/punctures are introduced directly by other peers? |
No, that causes severe network partitioning/clustering. For example, in practice we experienced situations where 4 local nodes would form 2 groups of 2 (causing silly stuff). You need to periodically bootstrap to remain connected to a single cluster (aka a robust network topology). Regarding contacting bootstrap nodes, there are two extremes you don't want to be in:
|
Well, we have a special case for these "fast walks" case: we don't care about connectivity too much. E.g. can we disable bootstrap-node-querying when calling |
Yes, in that case you can do something like this in the if len(self.get_peers()) > 20: # Stop walking to bootstrap nodes after you have 20 peers.
self.walker.reset_chance = 0 |
I tried this, kinda works... Next, we must fine-tune the parameters of the walking process. Also, we need to decide if we want to send all the channels, not just the subscribed ones. Now, this seems to look suspiciously like the PopularityCommunity problem... @xoriole , @kozlovsky , @drew2a , may I kindly ask you guys to take over this task? You already got an infrastructure for working with this kind of experiments. Also, this one is easier than the PopCom, as the dataset is so much smaller, and we're in full control of the DB. (p.s. There is a still a bug with the GUI not showing updates sometimes, and a problem with dynamically sorting stuff, I will solve these of course.) |
Quick update: I've ran an experiment with the suggested tweaks, and this is the raw results. The experiment lasted one minute. Below are presented two runs:
In both runs two values were measured
Tweaks implemented: class ChannelDiscoveryBoosterMixin(RemoteQueryCommunity):
TIMEOUT_IN_SEC = 60.0
MAX_PEERS = 200
# Stop walking to bootstrap nodes after you have this number of peers.
RETURN_TO_BOOTSTRAP_THRESHOLD = 20
TAKE_STEP_INTERVAL_IN_SEC = 0.05
def __init__(self, *args, **kwargs):
super(ChannelDiscoveryBoosterMixin, self).__init__(*args, **kwargs)
self.logger.info(f'Init. Timeout: {ChannelDiscoveryBoosterMixin.TIMEOUT_IN_SEC}s, '
f'Max peers: {ChannelDiscoveryBoosterMixin.MAX_PEERS}, '
f'Take step interval: {ChannelDiscoveryBoosterMixin.TAKE_STEP_INTERVAL_IN_SEC}s')
self.walker = RandomWalk(self, timeout=ChannelDiscoveryBoosterMixin.TIMEOUT_IN_SEC,
window_size=ChannelDiscoveryBoosterMixin.MAX_PEERS)
self.register_task('take step', self.speed_walk,
interval=ChannelDiscoveryBoosterMixin.TAKE_STEP_INTERVAL_IN_SEC)
def speed_walk(self):
self.logger.info('Take a step')
self.walker.take_step()
# Stop walking to bootstrap nodes after you have N peers.
if len(self.get_peers()) > ChannelDiscoveryBoosterMixin.RETURN_TO_BOOTSTRAP_THRESHOLD:
self.logger.info('Stop returns to bootstrap')
self.walker.reset_chance = 0 |
Quick update. The data from the experiment with implemented almost all requested changes. The experiment duration: 1 min
The remote query used: self.send_remote_select(peer=peer, metadata_type=CHANNEL_TORRENT, subscribed=True,
attribute_ranges=(("num_entries", 1, None),), last=1) The count query used: channels_count = count(ts for ts in self.mds.ChannelNode if ts.metadata_type == CHANNEL_TORRENT) |
self.send_remote_select(peer=peer, metadata_type=CHANNEL_TORRENT, subscribed=True,
attribute_ranges=(("num_entries", 1, None),), last=1) Well, you're missing the opportunity to put more than one channel entry into a single UDP packet. Typically, a single channel entry takes about 250 bytes. So, if you change it to Also, keep in mind that more queries = more load on the network. So, I don't see any reason to include |
@ichorid thank you for the comment, I keep it in mind. The reason, why I publish charts and PR is to get a feedback as earlier as possible. Why did I use |
@ichorid by the way, setting My guess is that the ordinary user haven't subscribed channels at all. |
@drew2a could you include |
@qstokkink yes, sure. I have a question here. From run to run I see that "peers count" get stuck in range 30-60. |
It's because the RandomWalk strategy picks a random peer from your known peers to get an introduction from. Basically, the more peers you get, the higher the chance you get a peer you already knew. Essentially, you need to do an exponential amount of work for every new peer. You could experiment with the EdgeWalk strategy to overcome this. |
@qstokkink thank you, that was my main assumption... what arguments do you suggest to use in EdgeWalk? Benchmark update: |
I've never actually tried to prioritize getting a substantial amount of peers as quickly as possible. Intuitively, I'd start with a neighborhood size of 30 and an edge length of 10, edge timeout can probably stay at the default. You'd have to experiment a bit to find the optimal values. Also, regarding the recent result, omitting |
Yes, it is strange. Of course, I could have made a mistake in the query. @ichorid could you verify the query? |
Solid progress! Great measure&exploration work. |
Yes |
Congratulations, sir, you found a bug! 🎉 🐛 🎉 tribler/src/tribler-core/tribler_core/modules/metadata_store/orm_bindings/metadata_node.py Line 109 in b6774ca
you guess it should be
|
Also, be warned that if you're going to get all the channels (and not just the subscribed ones), you lose the advantage of filtering junk. Consequently, you lose the information necessary to build the beautiful VSIDS-based channels rating. |
@ichorid I understood everything, except:
|
When you query other peers for subscribed channels, there is a callback on receiving new entries, that bumps (add votes to) the received channels: tribler/src/tribler-core/tribler_core/modules/metadata_store/community/gigachannel_community.py Line 92 in b6774ca
In a sense, the voting system uses pull-based gossip as a "cue" about what is popular. There is no explicit "peer X voted for channel Y" message. If you are going to query remote peers for non-subscribed channels, you have to disable this callback. Otherwise, all of the received channel entries will be bumped (i.e. marked as voted by the peer who sent you these). This will result in breaking the distributed algorithm for voting, which is:
If there is no difference between "subscribed" and "not subscribed" channels, there is no difference between "voted" and "non-voted" channels. In that case, every channels' rating will eventually become (almost) the same. To circumvent this problem without changing the packet format, or introducing different community, or doing Bloom filter, whatever, the simplest thing is to simultaneously query the same host for both all the channels (w/o vote-bumping callback), and for subscribed channels only (with vote-bumping callback). This will still result in the propagation of junk, but at least it will keep the voting system working, which is the only thing that saves the user from being completely overwhelmed by junk. |
@ichorid thank you for the explanation. It is clear now ✊. @qstokkink regarding I made an experiment today in which I tried to estimate Pseudocode of the experiment is the following: for neighborhood_size in [1, 25, 50, 75, 100]:
for edge_length in [1, 12, 25, 37, 50]:
for repeat in range(5):
calculate_peers_count(neighborhood_size, edge_length, timeout=30) Surprisingly, the best (average) values are:
The row data in csv format: edge_walk.csv.zip And this is a visualisation (but, unfortunately, it's not really obvious) |
@drew2a Nice graph 👍 From your results it seems 25/25 is the most balanced choice. We want the neighborhood size as small as possible to make sure we don't depend too much on the bootstrapping mechanism (to hook in external bootstrap nodes we need to cure our performant bootstrap node addiction 😃). |
The distribution of the number of subscribed channels per-peer (logarithmic, slightly distorted for visibility, w/o personal channels) by @ichorid: #3615 (comment) |
Flagship feature is nearly there I guess. an experimental build .Deb coming days/week would be great to get a feel for 25/25 bootstrap speed... |
Goal: accelerate the genesis discovery of channels.
The experience we give to our user is that when Tribler starts for the first time your screen is just bombarded with newly discovered channels. It goes very fast, barely able to read all the information Tribler is reading of The Internet. After some seconds the most popular channels start to float to the top. Re-sorting based on popularity is done every second or so. Within 30 seconds the most popular channels on Tribler are stable and fill the page. The user is able to scroll to all these new channels.
Out of scope: fast pre-preview of when clocking on a new channel. Also out of scope: fast start of download of channel after subscribe.
By walking multiple times within IPv8 we can attempt to discover 25 new channels per second for first 30 seconds of Tribler startup. The introduction of the first peer is usually 600ms, after that we aim to discover channel entries at full speed.
https://jenkins-ci.tribler.org/job/ipv8/job/performance_peer_discovery/plot/
The text was updated successfully, but these errors were encountered: