Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow anonymous downloads: Crypto CPU bottleneck #1882

Closed
synctext opened this issue Jan 20, 2016 · 72 comments
Closed

slow anonymous downloads: Crypto CPU bottleneck #1882

synctext opened this issue Jan 20, 2016 · 72 comments
Assignees
Milestone

Comments

@synctext
Copy link
Member

synctext commented Jan 20, 2016

The CPU seams to be the reason for slow <1 MByte/sec anonymous downloads.

possible problem
Running crypto on twisted thread blocks all other Tribler activity. Unclear if we needs 256bit GCM mode. Anything that checks a signature, decrypt a message, etc. needs to be traced.

possible long-term solution
Separate thread for the tunnel community or even an isolated thread for each community.
Low hanging fruit: parallel execution of relaying, make it multi-threaded. Real threads: Twisted reactor versus import multiprocessing...

goal

  1. benchmark raw openssl C code GCM MBps.

  2. Create a minimal benchmark comparing current situation in Tribler with alternatives. Not re-using Tribler code, but a performance test processing 10.000 UDP

  • Only generating packets and decrypting one layer, then another layer. Current approach.
  • Doing the same on a threadpool within Twisted.
  • Separate Python OS process, communicate using a ringbuffer. Twisted has build-in support for this, treating it like just another thread.

EDIT: use 10k UDP packets through exit node as benchmark.

@synctext synctext added this to the V6.6 milestone Jan 20, 2016
@whirm
Copy link
Contributor

whirm commented Jan 20, 2016

@synctext I don't think this is related to 6.6

@synctext synctext modified the milestones: V6.6 WX3, V6.7 credits Feb 16, 2016
@synctext synctext modified the milestones: Backlog, V6.7 credits Apr 19, 2016
@synctext
Copy link
Member Author

synctext commented Apr 19, 2016

Download performance insight

Within the roadmap for high-performance privacy it is important to identify bottlenecks. We do not know if the CPU, IO, database IO, buffering, event handling, queue waiting, threading, or networking is the bottleneck.

The following experiment determines the crypto performance of the HiddenTunnelCommunity at the downloading side.

Step-by-step we build and measure the raw networking throughput expressed in MByte/second in various setups. First is the minimal Dispersy NULL receiver, dropping packets immediately coming from several senders. Next is receiving the packet on the network thread and creating a Twisted event to handle this packet; again by simply dropping it. Third is using the on-data approach of the hidden community. Fourth experiment is benchmarking a single layer of Tor decryption at the receiver. Finally, the full tunnel community.
image

@synctext
Copy link
Member Author

synctext commented May 18, 2016

Connected to: #2137, #2215 #2216, #2217

Note: 4th item for thesis: torrent checker dramatic improvements/repair.

@synctext
Copy link
Member Author

Approach: our exit node helpers are maxed out in CPU. It will be extremely valuable to profile them, identify hotspots, test improvements, and benchmark various changes in the field.
This could move our donated network capacity from 9 TByte/day to perhaps even 15TByte/day.

@lfdversluis
Copy link
Contributor

lfdversluis commented May 26, 2016

I would love to profile them yes. I already have a hunch where bottlenecks are, but it would be great having a profiler output data that can be visualized by e.g. Vallgrind to verify and confirm the bottlenecks.

@synctext synctext changed the title Crypto performance experiment slow anonymous downloads: Crypto CPU bottleneck Jun 23, 2016
@synctext
Copy link
Member Author

synctext commented Jun 23, 2016

Anonymous downloads are quite slow. As of V6.6 pre-release version we have with a solid i7 CPU core at 125% only 330 KByte/sec. Confirms @whirm his view. Both the tunnel helpers and anon downloader are CPU constrained.

tribler_6 6pre_full_cpu_tunnel_downloading_kbytes

@qstokkink
Copy link
Contributor

I have made some measurements for the endpoint.py throughput per logging level:
https://gist.github.com/qstokkink/57203ced56dccd95287193d43e7e6a34

Long story short: choosing the wrong logging level can cost you around 400 MB/s throughput.

@whirm
Copy link
Contributor

whirm commented Jun 24, 2016

@qstokkink hah. Just yesterday I was talking about removing unnecessary log calls when the big cleanup happens with @devos50.

@qstokkink
Copy link
Contributor

Ok, I finally have a design I am happy with and which will provide a significant speed increase.

The main issues I faced were the following:

  1. Data passing through Dispersy clogs up Dispersy and thus the main process.
  2. As per (1), simply offloading received packets to subprocesses will only provide some benefit in encryption/decryption and add overhead for forwarding these packets.
  3. Forwarding peers' circuits to child processes (ergo the set-up happens in the main process), with their own assigned port, would require a lot of extra logic in the Tunnel Community.

So the shocking design I came up with (working with Dispersy, instead of against it, for a change), is as follows:

  • The main process does not join the Dispersy community.
  • Instead, all child processes (subprocesses) join the community, using their own Dispersy on their own process.

This means that the main process controls the circuits (the amount, their statistics and stuff) but no longer needs to be the man in the middle for data passing through. In turn, this means that the main process' Dispersy is freed up and therefore becomes a lot faster. An added benefit of this approach, is that it is a lot easier to implement than the options described in (2) and (3) too, leaving the TunnelCommunity almost completely the same.

These are the reasons for why I believe this to be the superior design, instead of the one of (2) as discussed during the last group session/meeting. Feel free to provide comments or critiques.

@synctext
Copy link
Member Author

wow, wait, what? @qstokkink So one Dispersy instance per community. So we fork a new Python process for each community, with their own IPv4 listen socket, own database, and walker.

Is the process doing the setup on a different listen port then the Tunnel community itself? How does this work with UDP puncturing. Are we still puncturing the correct listen socket?

Sounds quite interesting. I would suggestion doing quick prototyping and get the Tunnel community in it's own Dispersy process.

@qstokkink
Copy link
Contributor

@synctext Correct: one TunnelCommunity for one Dispersy with one unique listen port for one subprocess/forked process. Note that this will also be the only community loaded for a subprocess.

This should work perfectly fine with the UDP puncturing.

Because all of the TunnelCommunity messages are sent directly through the endpoint with NoAuthentication() the database of each subprocess is hardly used (only for Dispersy internals). In spite of this, there is definitely a case to be made for having a shared set of walker candidates and a shared database between subprocesses, as you suggested.
For the first version/prototype of this I will not implement this however, the result should already be very dramatic without these optimizations.

@devos50
Copy link
Contributor

devos50 commented Aug 25, 2016

@qstokkink sounds like an interesting design. I'm curious to know how this works out. I'm a huge supporter of splitting Tribler in various processes, however, we should be careful that we are not overcomplicating communication between processes since I think it's important that (external) developers can easily modify the communication protocol used between the Dispersy processes.

Another advantage of splitting it up in processes is separation of concerns: developers can focus on a single part of the system (and in the future, even implement a small part of the system in another language).

With this design, we can utilise all these additional cores in the CPU 👍

@qstokkink
Copy link
Contributor

qstokkink commented Aug 26, 2016

@devos50 If anything, this should even simplify communications. The three messages being transferred are the following:

  1. Main process to subprocess: create(<>)
  2. Subprocess to main process: stats(<>)
  3. Main process to subprocess: destroy()

That's it.

EDIT: Sorry, there is a fourth: the notifications.

@synctext
Copy link
Member Author

morning, sooo....
When selecting hops for a tunnel, we need a UDP punctured connection to it. Thus the Tunnel community needs to do all Create messages from its own external visible external IP:port.
correct?

@qstokkink
Copy link
Contributor

qstokkink commented Aug 26, 2016

@synctext Correct, each subprocess will need to be punctured separately. If this is really a problem, the design can also use a single port: some very nasty code in endpoint.py already takes care of this.

Do note that using the single port will already hit the performance pretty hard.

@synctext
Copy link
Member Author

soooo v2...
The Tunnel community has a certain UDP punctured peers. It can tunnel through them. The main Dispersy then needs to either take that into account when selecting the hops or let the Tunnel community do everything autonomously.

@qstokkink
Copy link
Contributor

qstokkink commented Oct 30, 2016

Finished the comments/style corrections and sanity check. Almost there.. I might even have the PR done by tonight. The TODO list has become quite short.

@devos50
Copy link
Contributor

devos50 commented Oct 30, 2016

@qstokkink looking forward to it!

Note that you don't have to squash everything into a single commit. Please make a logical units of works that make sense and make sure your commit history is clean (consists of distinct changes so no fixes on fixes).

@qstokkink
Copy link
Contributor

Woops, not happening tonight, I accidentally merged in some devel changes.. 😢 devos50/tribler@single_config...qstokkink:tunnelprocess_rc : I'll have to fix it up tomorrow.

@qstokkink
Copy link
Contributor

qstokkink commented Oct 31, 2016

Ok, I know this needs no further hyping up, but this is pretty cool:
perf
EDIT: Oh and I almost forgot to mention it: but this is casually running 18 circuits.

@synctext
Copy link
Member Author

Repeating: #2106 (comment)
Thesis material:

  • Understanding performance and emergent behavior using Visual Dispersy (correctness)
  • Improving performance and community coding in Dispersy using Protocol buffers (general performance)
  • Exploiting multi-core architectures for Dispersy speedups (multi-core performance)

Together:
Multi-core architecture for anonymous Internet streaming

First priority: PooledTunnelCommunity stable 👏

@synctext
Copy link
Member Author

Further future (by Yawning Angel): #1066 (comment)

@synctext
Copy link
Member Author

synctext commented Nov 1, 2016

Related work fro thesis from MIT Riffle: An Efficient Communication System With Strong Anonymity, uses central servers, lacks incentive mechanism.

@qstokkink
Copy link
Contributor

The first clinical Gumby experiment has finished, the new 1-hop max download speed with the pooled implementation seems to be around 18 MB/s for a consumer quad-core.
speed_download

@devos50
Copy link
Contributor

devos50 commented Dec 15, 2016

@qstokkink nice results, so you are using hidden seeding and not plain seeding?

@qstokkink
Copy link
Contributor

qstokkink commented Dec 15, 2016

@devos50 This is plain seeding, the y-axis is me being too lazy to edit the label.

I also ran this on the 48-core machine (ergo 48 processes per peer) with the same amount of peers (8) and the results are pretty interesting (ignore the 2-hops, which fell back to non-anonymous downloading):

In this case circuits actually seem to fight over data transmission (due to the round-robin implementation), causing the whole thing to slow down. Also note the high base-cost in download speed.

EDIT: By the way, is bbq still O.K.? The above experiment had 1152 circuits and 384 processes running concurrently on the same machine.

@synctext
Copy link
Member Author

18 MByte/sec. Our users are going to love that. Real strange scaling to 48cores. great thesis material.

@qstokkink
Copy link
Contributor

@synctext Well from the point of the 2 seeders they have to serve 6 times as many downloads/leechers. Because of the download mechanism's overhead, it is better for a seeder to serve one leecher at high speed than multiple leechers with lower speed.

This problem should disappear if the amount of seeders scales up with the amount of deployed proxies.

@qstokkink
Copy link
Contributor

qstokkink commented Dec 29, 2016

Alright, so, I just managed to lock up a 48-core Linux machine with my Gumby experiment.
bbq_status
We might need to be careful about allowing people to manage the amount of processes they spawn.

@synctext
Copy link
Member Author

:-) How do you lock a 64GByte, 48-core machine? What resource ran out?
When Tribler bsc and msc students try to mess around with process spawning, it's OK if they get bitten occasionally. Guess we don't want expose this without "seatbelts" within the Tribler Setting menu.

@qstokkink
Copy link
Contributor

@synctext I have no way of knowing what resource ran out (probably either CPU or sockets ran out). Since it hasn't come back online yet, I can only assume it was the CPU and the building has burned down.

And, yes we might have to put a warning symbol above certain settings, or along a slider.
😃💻
   ⬇️
💀🔥

@devos50
Copy link
Contributor

devos50 commented Dec 29, 2016

@synctext @qstokkink I will probably go to EWI tomorrow to reboot the thing :)

@devos50
Copy link
Contributor

devos50 commented Dec 30, 2016

@qstokkink the building did not burn down and bbq is up again :) (took me some effort to get access since my card was not working correctly...)

@qstokkink
Copy link
Contributor

@devos50 Thanks! I'll run a less flamboyant experiment from now on (which I know it can handle).

@synctext
Copy link
Member Author

synctext commented Jan 5, 2017

latest thesis results: MSc_Thesis_v2.pdf

@synctext
Copy link
Member Author

synctext commented Feb 1, 2017

Cardinal & Closing graph of thesis: fast anon download.

or extra chapter with multi-core Javascript + fancy homomorphic crypto math!

When using the 48-core BBQ and entire DAS5 together it should be possible to show nice anon download speed graphs. Goal is to have performance towards 48x on our 48-core download machine. DAS5 then acts as a dedicated seeding and relaying cluster. Showstopper (as always) : Gumby

@qstokkink
Copy link
Contributor

After some additional runs and behavior analysis, it seems like organising the DAS5-bbq experiment will require some fundamental code changes in the pooled tunnel code. Therefore, I have decided to definitively pull the plug on that and focus on the fancy crypto chapter instead.

@synctext
Copy link
Member Author

synctext commented Feb 3, 2017

in science formulas get more respect then running code..

@synctext
Copy link
Member Author

MSc_Thesis_v4.pdf
Update: New chapter with solid formulas:

  • 1.2 4th research questions, can we use Homomorphic cryptography to radically improve performance and/or security
  • chapter 2 after intro stuff "in the upcoming chapters we focus on enhancing the architecture, final chapter we replace crypto"
  • Fig 7.3: more datapoints for straight line evidence
  • Chapter 7: more mathematical depth

@qstokkink
Copy link
Contributor

(First) Release candidate:
MSc_Thesis_v5.pdf

@synctext
Copy link
Member Author

synctext commented Feb 20, 2017

Good story flow!

  • Define all symbols of equation 5.2.1 and beyond, for instance, E⊕(m1), H(M | m1, m2, ..., mk−1), .
  • Section 5.4.1, clarify. Start point and ending of circuits is concealed.
  • "presented the underappreciated technique of message serialization". We show a 80% CPU reduction.
  • please "go overboard" with formulas and repeat prior solid definition work!
  • Figure 7.3: Download speeds for differing amounts of processes, KBytes/sec, Kbits/sec
  • "up to 18 Mbps in this paper", observed once under favorable conditions.
  • remove modesty: 1st line of Conclusions. This thesis presents a solution to the anonymous streaming problem.
  • remove paragraph: Finally, this chapter will end by discussing two limitations
  • differnt
  • combine "Conclusions and Future Work" with compact, to the point writing.

@qstokkink
Copy link
Contributor

qstokkink commented Feb 21, 2017

Fixed in second release candidate:
MSc_Thesis_v6.pdf

P.S. Apparently differnt passes the spelling checker 😕

@synctext
Copy link
Member Author

Good, 6 pages with the math fundamentals.

@synctext
Copy link
Member Author

This 2016 ticket did not yet focus on the latency and uTP protocol. #2620 is dedicated to this. This 2016 ticket documents important ideas from the multi-core and crypto CPU load. Note, we still have stuff open. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants