Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

end-to-end anonymous seeding and download performance test #2548

Open
synctext opened this issue Sep 21, 2016 · 49 comments
Open

end-to-end anonymous seeding and download performance test #2548

synctext opened this issue Sep 21, 2016 · 49 comments

Comments

@synctext
Copy link
Member

Privacy with usability is the core goal of our ongoing 11 year effort.

This end-to-end test is designed to identify problems in our whole Tribler chain, from session key exchange to rendezvous peer discovery.

Scenario:

  • Create a new file, generate magnet, and seed it
  • Start anonymous download on different machine and seed anonymously
  • repeat up to 8x the start download, start 1 new downloader every minute.
  • Disconnect the original public seeder, after the 10th minute in the experiment
  • Start an anonymous download on a different machine using only a hash
  • Observe 1) speed of discovery of hidden seeds, 2) PEX mechanism, and 3) total download speed
@synctext
Copy link
Member Author

This 2016 issue has overlap with newer #3325 issue.

@ichorid
Copy link
Contributor

ichorid commented Apr 4, 2018

I've created and ran a local performance test based on LXC containers.
The test setup is:
seeder <=> router <=> leecher

  • Seeder is libtorrent clien_test utility.
  • Router is tunnel_helper Tribler plugin for Twisted
  • Leecher is a leecher.py from latency-testing repo

My machine is Xeon W-2133 3,6 GHz.

Direct download (without router) is 150 MBytes/sec.
Routed anonymous download is 1,2 MBytes/sec.
Router Twisted process hogs 125% of CPU, distributed between several physical cores.
Leecher/seeder CPU usage is negligible.

Our anonymous download speed is completely obliterated by Python performance on the router.

EDIT:
Leecher CPU usage is non-negligible. It is about 107% CPU. Thus, it is only 15% smaller than that of the router.

@ichorid
Copy link
Contributor

ichorid commented Apr 10, 2018

To estimate the limits of Twisted networking performance, I did a small packet forwarding test with LXC containers and nc | pv | dd method: nc <= twisted forwarder <= nc. As an alternative forwarder, I used nc | dd | nc.

/dev/zero /dev/urandom
nc <= twisted <= nc TCP 320 MBytes/s 180 MBytes/s
nc <= twisted <= nc UDP 120 MBytes/s 120 MBytes/s
nc <= nc <= nc TCP 300~500 MBytes/s 180 MBytes/s
nc <= nc <= nc UDP 270 MBytes/s 170 MBytes/s

It is interesting to note that nc|dd|nc sandwich uses almost 250 % CPU in total, in a kind of functional parallelism. Twisted, on the other hand, uses only a single CPU - always at 100%.
From this table, it is obvious that Twisted UDP performance is the bottleneck in this test. However, we would be very happy to attain even 1/10th of it.

@synctext
Copy link
Member Author

Great to hear Python+twisted does not need replacing.

@ichorid
Copy link
Contributor

ichorid commented Apr 12, 2018

Another simple experiment: CPU usage when bombing Tribler tunnel_helper exit node with packets. Performance of nc -u -l | pv | dd of=/dev/null and simpler Twisted-based, do-nothing socket listener are measured to estimate baseline performance:

nc+pv+dd Twisted listen Tribler abcd*600 Tribler ffffffff+abcd*600 Tribler ffffffffffffffe+abcd*600
1000 Mbit/s 170% 93% 172%
100 Mbit/s 27% 34% 172% 172% 185%
50 Mbit/s 20% 20% 172% 172% 185%
25 Mbit/s 9% 10% 172% 172% 185%
10 Mbit/s 4% 5% 62% 60% 96%
5 Mbit/s 5% 31% 32% 50%
1 Mbit/s 1,3% 8% 8% 10%

CPU performance is given as the sum of CPU usage of all relevant subprocesses as printed by htop.
Tribler's 172% is the sum of 11 python subprocesses of ~9% each and a single process of ~40%. (BTW, where the hell the remaining ~40% hide ?)

Tribler was tested with 3 different types of packets: "random" packet of abcd*300, and packets with two types of headers for tunnel community. Tribler tunnel helper was not alternated in any way, so it just listend to the port, picks up packets, tries to parse them, sees them being garbage and drops them.

With ff..fe header Tribler tries to decode the packet even for unknown ID (that's already a potential problem). Hitting 96% of CPU at around 1 Mbytes/s of throughput corresponds well with my previous test of Tribler tunnels. That test showed ~125% of CPU load at 1,2 Mbytes/s (though, the exit node was both receiving and sending packets).

Aside from the potential pitfall of exception handling, it is highly improbable for a network application to demonstrate better performance when it both receives and sends packets, than when it just receives packets. Therefore, the performance bottleneck in the path of a packet lies somewhere between socket input and tunnel_community.on_data() method.

@ichorid
Copy link
Contributor

ichorid commented Apr 13, 2018

One bottleneck identified: each active community adds about 50% of CPU load at 100Mbit/s UDP bombing test. The problem is, Endpoint.notify_listeners() notifies each listener via broadcast:

    def notify_listeners(self, packet):
 
        for listener in self._listeners:
            if listener.use_main_thread:
                reactor.callFromThread(self._deliver_later, listener, packet)
            elif reactor.running:
                reactor.callInThread(self._deliver_later, listener, packet)

This is very inefficient because each community (listener) does its own pattern matching. It is improbable that a single packet would be addressed simultaneously to several communities at once. And even in this case it should be processed only by the targeted communities.

We need to implement a central tree-like packet demultiplexer, or something like that.

@ichorid
Copy link
Contributor

ichorid commented Apr 13, 2018

A quick hack that stops notifying all communities except hidden tunnel community after 15 seconds of running Tribler increases torrent download speed from 0,9 Mbyte/s to 3,9 Mbyte/s. (Hack should be applied to both leecher and exit node.) Exit node's CPU usage drops to 98%. Of all Twisted threads only one still generates 20% CPU load, and other threads stay idle. The rest of the load is generated by the main thread, I presume.

Apparently, we now hit another bottleneck.

@ichorid
Copy link
Contributor

ichorid commented Apr 13, 2018

Another experiment: measure pure bandwidth of exit node without leecher, in the direction seeder->exit_node->leecher. (Community bypass hack enabled!)
To do that, we first disable method for taking down circuits, so the established circuit/tunnel would last forever. Next, we initiate the usual seeder<-exit_node<-leecher, so the circuit would be initiated properly by the exit node. We sniff the exit port of the circuit Twisted protocol. Now we take down both seeder and leecher client apps, so only the exit node Tribler instance is there.
And finally we start bombing the exit node with UDP packets from the seeder side, pretending to be the seeder app. We do it by using correct (sniffed) src and dst UDP ports. On the both seeder and leecher we run bmon to see the actual bandwidth.
All of this is possible because UDP is stateless.

The result:
Seeder bombs with 139 MByte/s, Leecher receives 21,4 Mbytes/s. Exit node at 100% CPU.
Verdict: our next bottleneck is at the leecher's side.
screenshot from 2018-04-13 16-13-12

Some screenshots from Kcachegrindin' a yappi profile of the exit node in this mode:

Giant "unknown" blob:
exit_prof

Contents of the blob:
exit2
exit3

So, now its mostly SSL stuff.

@synctext
Copy link
Member Author

synctext commented Apr 13, 2018

Impressive progress!
You discovered an architectural violation. Communities should be completely isolated from each other and thus only 1 community should process 1 incoming UDP packer. Sadly, I don't know if AllChannel or other communities critically depend on this feature, which is a bug/hack.

Smart twist in this experiment design! Seeder is replaced by a UDP sender with a fixed test pattern that does not do any congestion control. Fixed rate blasting at the exit node, which then uses full onion encryption to deliver the test pattern to the leecher. Leecher attempt to decypher the incoming data into Tor-like encrypted Bittorrent traffic.

What would 4 cores running 4 seeders with 4 cores with 4 leechers do? The single exit node and real congestion control would hint at bottleneck.

@ichorid
Copy link
Contributor

ichorid commented Apr 16, 2018

Experiment: multiple leechers vs multiple seeders:

  • 1 leecher x 2 seeders: 4,3 Mbytes/s (exit node CPU 98%)
  • 1 leecher x 4 seeders: 4,5 Mbytes/s (exit node CPU 98%)
  • 2 (1+1) leechers x 4 (2+2) seeders (2 independent torrents): 5,3 Mbytes/s (exit node CPU 120%)

These results show that we have bottlenecks both in the leecher and the exit node. Most probably, there are 2 bottlenecks: one in the direction of upload (ACK packets from leecher to seeder), and another one still in the direction of download (DATA packets from seeder to leecher).

@ichorid
Copy link
Contributor

ichorid commented Apr 17, 2018

A superficial analysis of profiler charts revealed another potential problem: we recreate the crypto object every time we need to encrypt or decrypt something. This includes a very costly operation of cipher initialization in OpenSSL, that is responsible for 17% of total CPU usage by the exit node, according to the profiler.

UPDATE: we're using GMAC mode with AES encryption, so if we want to have each packet independent of each other, we must use different IV for every packet/encryption. No low-hanging fruit there. Still, constant calls to _CipherContext.__init__ seem suspicious. Maybe we're not doing it properly, and there is room for improvement. For example, latest versions of SSL recommend using the compound AESGCM procedure instead of setting up AES+GCM by hand.

@ichorid
Copy link
Contributor

ichorid commented Apr 17, 2018

Profiler screenshots showing mysterious "cycle 10" encompassing 80% of CPU ticks (probably, Twisted related):
c10
c10-1

EDIT: according to @devos50, "cycle 10" is Twisted itself. Mystery solved.

@synctext
Copy link
Member Author

mental note: earlier discovery was that Twisted "thread pools" cost performance. By moving to main-thread, numerous context switches are avoided and performance could also double here. @qstokkink @devos50

@ichorid
Copy link
Contributor

ichorid commented Apr 17, 2018

Currently, we poll sockets ourselves in an endless loop.
This is clearly not the thing Twisted developers were expecting from their users, since Twisted has facilities like DatagramReceived and LoopingCall. Most probably, our "endless" loop severely messes up Twisted polling schedule.
In a simple experiment I replaced UDPEndpoint._loop with LoopingCall() (0.001 sec. period).
Performance of 1 seed vs 1 leech increased to 4,5 Mbytes/s, and performance of 2+2 seeds vs 1+1 leeches increased to 7,5 Mbytes/s. On the other hand, the " UDP bombing test" performance dropped to 2 Mbytes/s!
We're doing something wrong with polling it manually. Remind that pure Twisted DatagramReceived based forwarding was capable of pushing 120 Mbytes/s.
It is clear that there is room for improvement here.

@ichorid
Copy link
Contributor

ichorid commented Apr 17, 2018

@synctext, tried that. Disabling "thread pools" does not affect performance in my tests, but removes "unknown" objects in profiler, and makes profiling Twisted manageable.

@qstokkink
Copy link
Contributor

Your comments touch upon a great many Python and Twisted nuances and intricacies, let me try and organize this into a shortlist:

  • The UDPEndpoint does not run on a Twisted thread, but on a (built-in) Thread.
  • The UDPEndpoint does not use a socket managed by Twisted, rather it uses a (built-in) socket.
  • By using a LoopingCall you are scheduling the socket send/receive logic onto the (Twisted) main thread. This may or may not be the Python entry thread.
  • By doing this you are avoiding expensive context switches into the Twisted threadpool, but, this also means that a busy socket will slow down all of the other application logic. This is your dip in throughput under load.
  • In the Twisted back-end the Twisted threadpool is just a fixed set of (built-in) Thread objects, 20 by default, onto which contexts are loaded and run. As we have previously discovered, this is painfully slow. This packing into little balls of context is what is causing your unknown origin processing.
  • The native Python socket implementation can actually handle well over 800 MB/s.

Final points:

  • None of this would be a problem if Python had real threading instead of this GIL stuff.
  • Twisted is a performance killer. Don't let it fool you.

@ichorid
Copy link
Contributor

ichorid commented Apr 17, 2018

@qstokkink, thanks for a detailed explanation!
But why would we even have UDPEndpoing in a built-in thread, if context switching makes our application (at least exit node) slower? What's the point? We don't want 800 MB/s of theoretical throughput in a single part of our application if the rest of the application can't make more than 4 MBytes/s. We want a single thread of our application to do about 20-50Mbytes/s: that's enough to saturate any modern broadband connection.

@qstokkink
Copy link
Contributor

@ichorid we did not have the time, nor the indication that we should do this before. If you want, you can try implementing a full Twisted-based endpoint.

@ichorid
Copy link
Contributor

ichorid commented Apr 18, 2018

UDP bombing with crypto disabled: 70 MBytes/s on the exit node.
Regular download with crypto disabled: 7,5 MBytes/s.
Something fishy happens on the leecher side...

@ichorid
Copy link
Contributor

ichorid commented Apr 18, 2018

While UDP bombing the exit node, its memory usage remains constant. However, the leecher's memory usage grows very fast, as if it internally buffers all the packets it can't process. Even when the stream of packets ends, the leecher's CPU usage remains 100% for a long time, until it finishes processing the buffered data. Its memory usage drops somewhat along the way. This happens even with crypto disabled.
Another interesting detail is that the "bomb" packets are received and buffered by leecher, but they are not sent down through the socks proxy to libtorrent - there is no traffic on the loopback interface.

@qstokkink

This comment has been minimized.

@ichorid

This comment has been minimized.

@synctext
Copy link
Member Author

Brainstorm.., towards microarchitecture. Is there a twisted alternative? Not this one, MagicStack/uvloop#14

@ichorid
Copy link
Contributor

ichorid commented Apr 25, 2018

@qstokkink, when I add this line, the download starts and dies almost immediately.

@ichorid
Copy link
Contributor

ichorid commented Apr 25, 2018

@synctext, it's really a shame that uvloop does not work as a drop-in replacement of reactor loop with Windows (((
However, this fact does not stop us from using libuv for implementing a C-based endpoint, or using uvloop to speed-up Linux Python 3.6+ performance, etc. According to authors, uvloop does not provide any API, it is just a more efficient replacement for the Python 3 core reactor.

@ichorid
Copy link
Contributor

ichorid commented Apr 25, 2018

We have seen good enough benchmark performance for encrypted tunnel throughput (14 Mbytes/s). However, torrent download speed is 4 times worse. Why?

@egbertbouman
Copy link
Member

Just got this when doing direct calls:

Traceback (most recent call last):
  File "C:\Users\Egbert\Desktop\git\tribler\TriblerGUI\event_request_manager.py", line 127, in on_read_data
    raise RuntimeError(json_dict["event"]["text"])
RuntimeError: [Failure instance: Traceback: <type 'exceptions.RuntimeError'>: dictionary changed size during iteration
C:\Python27\lib\site-packages\twisted\internet\base.py:1199:run
C:\Python27\lib\site-packages\twisted\internet\base.py:1208:mainLoop
C:\Python27\lib\site-packages\twisted\internet\base.py:828:runUntilCurrent
C:\Python27\lib\site-packages\twisted\internet\task.py:239:__call__
--- <exception caught here> ---
C:\Python27\lib\site-packages\twisted\internet\defer.py:150:maybeDeferred
C:\Users\Egbert\Desktop\git\tribler\Tribler\pyipv8\ipv8\messaging\anonymization\community.py:217:do_circuits
C:\Users\Egbert\Desktop\git\tribler\Tribler\pyipv8\ipv8\messaging\anonymization\community.py:242:do_remove
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\community.py:327:remove_circuit
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\dispatcher.py:87:circuit_dead
C:\Users\Egbert\Desktop\git\tribler\Tribler\community\triblertunnel\dispatcher.py:87:<genexpr>
]

@qstokkink
Copy link
Contributor

The dictionary size change is due to multiple "threads" accessing the same dictionary. We could use something like in IPv8:

lock = RLock()
def synchronized(f):
    def wrapper(self, *args, **kwargs):
        with lock:
            return f(self, *args, **kwargs)
return wrapper

And then annotate the class methods with @synchronized.

@ichorid
Copy link
Contributor

ichorid commented Apr 26, 2018

That's one option. Another one is to investigate what lines access the dictionary and try to remove that access completely. Or shift all the usages into a single "thread". BTW, this would help with further isolating IPv8 and preparing it to move to a separate process.

@ichorid
Copy link
Contributor

ichorid commented May 3, 2018

OK, the communication scheme between old Dispersy listener and new IPv8 listener is completely broken and brain-dead . It messed up the results of my old experiments.
When I correctly disable Dispersy and enable LoopingCall()-based endpoint on both router and leecher the performance goes to 8 Mbytes/sec! It is important to note that broadcast fix has no effect now, and UDP bombing nets stable 22/11 Mbytes/sec router/leecher performance.

@ichorid
Copy link
Contributor

ichorid commented May 3, 2018

Different call methods effect with LoopingCall endpoint :

Way of calling `deliver_later() leech encrypted MB/s router encrypted MB/s Leak
CallInThread 1,3 21,5 No
CallFromThread 11 22 No
Direct call 13 22 No

So, it's safe to use CallFromThread, since there is no leak now.

Indeed, Twisted is magic.

@qstokkink
Copy link
Contributor

@ichorid Cool: you just fixed 10-year-old crappy code; when are we getting the PR?

@synctext
Copy link
Member Author

synctext commented May 4, 2018

 8 Mbytes/sec! as leecher. Impressive. PR and Jenkins job would be most welcome. We could install a rule that future code contributions should not reduce anon download speed. On all supported platforms have performance regression tests.

@ichorid
Copy link
Contributor

ichorid commented May 4, 2018

I have just reimplemented IPv8 endpoint with Twisted's DatagramProtocol. The results are basically the same as in previous experiment, except the performance is 10% worse, and CPU is 100% only when there are actual packets going through the system.

p.s.
GUI-based Tribler instantly jumps to 6.1 Mbytes/s download speed on Ubuntu 18.04 torrent! Yay!

@ichorid
Copy link
Contributor

ichorid commented May 4, 2018

Screenshot of the day:
61

@synctext
Copy link
Member Author

synctext commented Dec 3, 2018

Latest scientific related work in this area, from OSDI 2018 (still server-based; not suitable for Youtube-like systems):

Karaoke: Distributed Private Messaging Immune to Passive Traffic Analysis
Karaoke achieves high performance by addressing two challenges faced by prior systems. The first is that differential privacy requires continuously adding noise messages, which leads to high overheads. Karaoke avoids this using optimistic indistinguishability: in the common case, Karaoke reveals no information to the adversary, and Karaoke clients can detect precisely when information may be revealed (thus requiring less noise). The second challenge lies in generating sufficient noise in a distributed system where some nodes may be malicious. Prior work either required each server to generate enough noise on its own, or used expensive verifiable shuffles to prevent any message loss. Karaoke achieves high performance using efficient noise verification, generating noise across many servers and using Bloom filters to efficiently check if any noise messages have been discarded. These techniques allow our
prototype of Karaoke to achieve a latency of 6.8 seconds for 2M users.

Like Stadium, Karaoke is distributed over many machines, and must ensure that malicious servers do not
compromise privacy. Stadium uses zero-knowledge proofs (e.g., verifiable shuffles) for this purpose, whereas Karaoke relies on more efficient Bloom filter checks.

@ichorid
Copy link
Contributor

ichorid commented Feb 17, 2019

From https://magic.io/blog/uvloop-blazing-fast-python-networking/

uvloop on Python vs Go and Twisted: Raw TCP
uv

uvloop on Python vs Go: HTTP echo server
uv2

From https://www.nexedi.com/NXD-Document.Blog.UVLoop.Python.Benchmark:
uvloop on Python vs Go: HTTP server with some minimal data processing
pyuv_vs_go

This tells us the bottleneck is typically not the reactor loop, but Python itself.

@synctext
Copy link
Member Author

synctext commented Feb 17, 2019

How would these graphs look with UDP usage?

UVloop with 80.000 req/sec at 10 KiB = 800 MByte/sec, correct?

@ichorid
Copy link
Contributor

ichorid commented Feb 17, 2019

@synctext , yes, that is 800 MBytes/sec. It is known that in simple networking benchmarks Go is typically 2-3 times slower than bare C/C++. That correlates well with other results obtained with libuv, e.g. CJDNS (~1GByte/s without auth/encryption).
It is hard to say what the numbers would be if UDP was used instead. However, I'm sure the proportion would stay the same.

@synctext
Copy link
Member Author

Please post a performance picture of your solution for #4567 within this issue @dsluijk @NULLx76.

@ichorid ichorid added this to To do in Tunnels via automation Jul 16, 2020
@synctext
Copy link
Member Author

We lack a structured performance evaluation suite. The content of #3325 is not available in easy to use dashboard #4999.
This old 2016 issue is still relevant. We played around with various Twisted setting, now we are in the Python3 Async era. No idea if we're still at the 6.1 MByte/sec mark from May 2018, shown above. planning. After the channels plus search made progress, this is next to finish and not visit again for a while.

Related work, see the beautiful writeup at the top NSDI conference. Zero-deployment of these ideas, just simulation code https://github.com/tschorsch/nstor Hard to beat that with our hard work.

@xoriole
Copy link
Contributor

xoriole commented Jan 25, 2024

Since @egbertbouman is the responsible person for the new tunnels upgrade using Rust and there seems already good signs of progress there, I'm unassigning myself and assigning him.

@xoriole xoriole assigned egbertbouman and unassigned xoriole Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Tunnels
  
To do
Development

No branches or pull requests

7 participants