Gossiping torrent popularity to scale to millions of torrents #4256

xoriole · 2019-02-26T11:23:49Z

A Tribler Giga channel can scale to billions of torrents (#3971) and is shared in the network, however the health of the torrent (being a dynamic value) is not included in the channel and is not propagated. This leads to having access to large number of torrents but without information if it is alive or not. So, we need a mechanism to share the health of the torrents in the Giga channel so popular content gets to the surface so users can browse through them.

To find out the popularity of the torrents in the Giga channel and to dissiminate the popular torrents in the network, we start with a simple approach based on random selections.

Some key points:

Do DHT health check for a random torrent every 4 mins ?? No need to store the metainfo.
Do only weekly update of the torrent
Gossip daily top 5 and random 5 torrents that you have checked and are alive.

After the first deployment, we will know how well the random approach works and the optimizations can follow.

qstokkink · 2019-02-26T12:38:12Z

https://www.tribler.org/IPv8Datasync/Results/

devos50 · 2019-02-26T15:22:10Z

First initial results of a DHT lookup experiment to establish the amount of traffic it takes to do torrent lookups. In this small experiment, I do multiple lookups of random torrents in an academic dataset using the torrent health endpoint of Tribler (nothing else is running). After every lookup, I query how much bytes the DHT in libtorrent did process (both incoming and outgoing) using the dht.dht_bytes_in and dht.dht_bytes_out fields provided by a libtorrent session. The following graph shows the lookup ID on the horizontal axis and the bytes uploaded/downloaded on the vertical axis, in KB.

As we see, the average number of bytes to do a DHT lookup increases. In addition, we require around 8.5KB of non-DHT traffic per lookup (as reported by the net.sent_bytes and net.recv_bytes).

In total, the session processed around 80MB of incoming DHT traffic and 120MB of outgoing DHT traffic. Full libtorrent DHT statistics after 3638 lookups:

        "dht.dht_allocated_observers": 1023,
        "dht.dht_announce_peer_in": 9413,
        "dht.dht_announce_peer_out": 29168,
        "dht.dht_bytes_in": 81144356,
        "dht.dht_bytes_out": 123248854,
        "dht.dht_find_node_in": 89649,
        "dht.dht_find_node_out": 0,
        "dht.dht_get_in": 0,
        "dht.dht_get_out": 0,
        "dht.dht_get_peers_in": 188270,
        "dht.dht_get_peers_out": 264774,
        "dht.dht_immutable_data": 0,
        "dht.dht_invalid_announce": 415,
        "dht.dht_invalid_get": 0,
        "dht.dht_invalid_get_peers": 0,
        "dht.dht_invalid_put": 0,
        "dht.dht_messages_in": 549747,
        "dht.dht_messages_out": 630264,
        "dht.dht_messages_out_dropped": 106,
        "dht.dht_mutable_data": 0,
        "dht.dht_node_cache": 163,
        "dht.dht_nodes": 391,
        "dht.dht_peers": 267,
        "dht.dht_ping_in": 39504,
        "dht.dht_ping_out": 7798,
        "dht.dht_put_in": 0,
        "dht.dht_put_out": 0,
        "dht.dht_torrents": 182,

Net stats:

        "net.has_incoming_connections": 1,
        "net.limiter_down_bytes": 0,
        "net.limiter_down_queue": 0,
        "net.limiter_up_bytes": 0,
        "net.limiter_up_queue": 0,
        "net.on_accept_counter": 146622,
        "net.on_disk_counter": 2210,
        "net.on_disk_queue_counter": 0,
        "net.on_lsd_counter": 198,
        "net.on_lsd_peer_counter": 0,
        "net.on_read_counter": 728848,
        "net.on_tick_counter": 118829,
        "net.on_udp_counter": 1199577,
        "net.on_write_counter": 168701,
        "net.recv_bytes": 29734253,
        "net.recv_failed_bytes": 0,
        "net.recv_ip_overhead_bytes": 1796640,
        "net.recv_payload_bytes": 0,
        "net.recv_redundant_bytes": 22488,
        "net.recv_tracker_bytes": 0,
        "net.sent_bytes": 1306179,
        "net.sent_ip_overhead_bytes": 3207080,
        "net.sent_payload_bytes": 0,
        "net.sent_tracker_bytes": 0,

Regarding CPU usage during these lookups, it differs between 5% and 10% and is relatively stable.

devos50 · 2019-03-05T16:29:47Z

Tomorrow I will plot all possible statistics regarding DHT traffic and try to figure out why the traffic increases. A possible solution is to reset the DHT session after a specific amount of lookups.

devos50 · 2019-03-06T15:33:26Z

Findings so far:

Getting accurate health information from an info hash only is hard.
Our current way to get torrent health data, based on the information in the libtorrent DHT, is as follows: start downloading the torrent with the speicific infohash in upload mode. As soon as we have received the metadata from the DHT, we take a snapshot of the current list of peers we are connected to, and remove the download. This will only report (a select number of) peers in the swarm and does not consider seeders. I've tested this manually.
The get_peers method seems to be exposed. My next step is to try and use this method to get peers that are actively seeding the torrent. This also avoids that you join the torrent swarm.

devos50 · 2019-03-06T15:58:41Z

Update: I managed to get the peers from the DHT. Now we should find a way to classify them as seeder or leecher.

qstokkink · 2019-03-06T16:01:01Z

#109, also #143

devos50 · 2019-03-06T16:12:28Z

@qstokkink thanks for the links!

devos50 · 2019-03-18T13:27:19Z

I performed 1300 BEP33 lookups with the modified libtorrent code (see #4321 for details).

Some results:

Message sent and received

Here we plot the statistics on messages being sent and received in the DHT, as reported by the libtorrent session. Note that there are many outgoing get_peers messages which is not unexpected since we are doing DHT lookups frequently. Also, we receive numerous incoming get_peers message.

Bandwidth usage

Note how BEP33 does not require any libtorrent session bandwidth, outside for the DHT traffic. Since we are not actually joining the swarm, libtorrent does not have to maintain connections to other peers, only to DHT nodes. The bandwidth uses increases roughly linear and there seems to be no additional bandwidth overhead after each lookup.

After 1300 lookups, each lookup requires 13266 bytes (13.0KB) of incoming traffic and 10731 bytes (10.5KB) of outgoing traffic. In total, each lookup requires 23997 bytes (23.4KB). Most of the traffic is probably related to the bloom filters being sent around. These bloom filters are 256 bytes and there are two of them in each incoming get_peers response.

DHT statistics

We now plot several statistics of the DHT, like the number of nodes in our routing table and the torrents tracked. The number of peers tracked remains around 100. The number of nodes in our routing table quickly increases after startup to around 400. We do no track mutable/immutable data items.

Dead torrents

Of the 1300 lookups, 823 torrents did not have any seeders and they are considered dead (61.9%).

Conclusions

I think we should implement BEP33 lookups in Tribler. The tradeoffs are as follows:

Advantages:

It is a relatively easy method to estimate the size of libtorrent swarms and the results are a reliable approximation of the real size since we can combine multiple bloom filters. I manually verified this by doing lookups for some very healthy torrents (1000+ seeders/leechers), dead torrents (0 seeders) and almost dead torrents (a few seeders and/or leechers). In all situations, BEP33 gave me reliable results.
Results come in quick, usually a few seconds after calling dht_get_peers.
BEP33 bloom filters are already filled by libtorrent and uTorrent implementations.
We do not actually join the swarm when doing a health lookup and libtorrent does not maintain peer connections to others.

Disadvantages:

It requires modifications to the libtorrent code. This means that BEP33 lookups are not supported in Debian, at least not until we support lookups in the upstream code (which requires a PR). We can however ship a modified libtorrent core on Windows and Mac (which accounts for 80-90% of our user base).
We require a dedicated libtorrent session for these lookups and we have to manually decode every incoming DHT packet and extract the bloom filter out of it. This might influence performance.
The size of the bloom filter (256 bytes) limits users to tracking at most 6000 seeders/leechers. I think this is not really a problem here since the difference between 10000 and 6000 seeders is not really impactful on the end user.

synctext · 2019-03-18T18:11:35Z

BEP33 for the win!
Given the scalability and bandwidth efficiency this is very useful. Finally we can make progress with relevance ranking through BEP33 popularity.

devos50 · 2019-03-29T11:46:04Z

With #4434 and Tribler/gumby#409, I consider the second iteration of the PopularityCommunity done. The algorithm is relatively simple: the overlay performs a random walk through the network. Every five seconds, you gossip at most 5 random torrents checked by you, and the top 5 torrents you have checked (based on seeders) to a randomly connected peer.

A new validation experiment has been added to verify the correctness of the popularity community. I plot the total number of torrents that have health information (based on the last_check field) over the duration of the experiment below. Note that torrent health dissemination only starts after 45 seconds (when the random walk in the popularity community starts). The slow start can be explained by the fact that peers do not have at least health information of five torrents to gossip around. The total experiment duration is two minutes.

I now think we should observe how this mechanism behaves in the wild.

devos50 · 2019-05-13T19:06:54Z

I consider this issue appropriately addressed for the 7.3 release. As said before, we should observe how it works in the wild. Therefore I move this issue to 7.4 so we can re-visit and improve our torrent selection algorithm if necessary.

devos50 · 2020-07-09T09:48:34Z

Since this issue is more leaning towards a full research project, I'm unassigning myself. My recommendations for the next steps:

Further testing and better integration of BEP33 across all platforms.
Setup a crawler to get some performance numbers from the deployed network, and iterate on these results.

Highly related to #3868

synctext · 2020-07-14T14:59:59Z

Would like to make this feature a key issue of a next release. Such as 7.6 "Bug Fix" and then: 7.7 "search&popularity".

devos50 added the type: enhancement label Feb 26, 2019

devos50 added this to the V7.3: Gigachannels milestone Feb 26, 2019

devos50 assigned xoriole and devos50 Feb 28, 2019

devos50 mentioned this issue Mar 7, 2019

Our DHT-based torrent health check is unreliable #4321

Closed

This was referenced Mar 22, 2019

Added random torrent health check to TorrentChecker #4423

Merged

Refactored PopularityCommunity #4434

Merged

devos50 modified the milestones: V7.3: Gigachannels, V7.4: Trust-based Lightnet May 13, 2019

devos50 mentioned this issue Jul 6, 2019

send DHT scrape requests to arbitrary info-hashes arvidn/libtorrent#3701

Closed

ichorid modified the milestones: V7.4: Python 3, V7.5: core refactoring Sep 9, 2019

ichorid modified the milestones: V7.5: core refactoring, V7.6: Collective authoring Mar 2, 2020

synctext mentioned this issue Jun 28, 2020

content popularity community: performance evaluation #3868

Open

synctext assigned ichorid and unassigned xoriole Jun 28, 2020

devos50 removed their assignment Jul 9, 2020

ichorid assigned xoriole and unassigned ichorid Jul 16, 2020

ichorid added this to To do in GigaChannels via automation Jul 16, 2020

xoriole mentioned this issue Aug 27, 2020

Checklist for user growth #5545

Open

devos50 mentioned this issue Sep 27, 2020

Experiment 1.1 #5586

Merged

devos50 assigned drew2a and kozlovsky Oct 7, 2020

drew2a added this to To do (current sprint) in Dev team via automation Oct 7, 2020

drew2a moved this from To do (current sprint) to Bookmarks in Dev team Oct 7, 2020

drew2a modified the milestones: 7.6.0 November: Stability, usability, performance, Next-next release Nov 4, 2020

drew2a added the component: channels label Jan 15, 2021

drew2a removed their assignment Sep 14, 2021

drew2a modified the milestones: Next-next release, Backlog Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gossiping torrent popularity to scale to millions of torrents #4256

Gossiping torrent popularity to scale to millions of torrents #4256

xoriole commented Feb 26, 2019 •

edited

qstokkink commented Feb 26, 2019

devos50 commented Feb 26, 2019 •

edited

devos50 commented Mar 5, 2019

devos50 commented Mar 6, 2019

devos50 commented Mar 6, 2019

qstokkink commented Mar 6, 2019 •

edited

devos50 commented Mar 6, 2019

devos50 commented Mar 18, 2019 •

edited

synctext commented Mar 18, 2019

devos50 commented Mar 29, 2019

devos50 commented May 13, 2019

devos50 commented Jul 9, 2020

synctext commented Jul 14, 2020

Gossiping torrent popularity to scale to millions of torrents #4256

Gossiping torrent popularity to scale to millions of torrents #4256

Comments

xoriole commented Feb 26, 2019 • edited

qstokkink commented Feb 26, 2019

devos50 commented Feb 26, 2019 • edited

devos50 commented Mar 5, 2019

devos50 commented Mar 6, 2019

devos50 commented Mar 6, 2019

qstokkink commented Mar 6, 2019 • edited

devos50 commented Mar 6, 2019

devos50 commented Mar 18, 2019 • edited

synctext commented Mar 18, 2019

devos50 commented Mar 29, 2019

devos50 commented May 13, 2019

devos50 commented Jul 9, 2020

synctext commented Jul 14, 2020

xoriole commented Feb 26, 2019 •

edited

devos50 commented Feb 26, 2019 •

edited

qstokkink commented Mar 6, 2019 •

edited

devos50 commented Mar 18, 2019 •

edited