Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed turn() API on the Runtime from 0.2.0-alpha.6 to 0.2.0 #1887

Closed
sdroege opened this issue Dec 3, 2019 · 12 comments
Closed

Removed turn() API on the Runtime from 0.2.0-alpha.6 to 0.2.0 #1887

sdroege opened this issue Dec 3, 2019 · 12 comments

Comments

@sdroege
Copy link
Contributor

sdroege commented Dec 3, 2019

In tokio 0.2.0-alpha.6 it was still possible to construct a "Runtime" yourself by taking tokio-reactor, tokio-timer and tokio-current-thread and putting them together. Then it could be run() or turn()ed.

With tokio 0.2.0 merging everything into a single crate and reorganizing everything this is not possible anymore, which makes it hard to port one of our projects from the alpha version to the stable version.


I should probably start by giving some context. The project in question is a GStreamer plugin, gst-plugin-threadshare that allows to use tokio as a scheduler to use fewer kernel threads for lower resource usage and higher throughput. You can also find a blog post by me with some numbers and more details.

Now the reason for putting together our own runtime here were the following

  1. We want to throttle (per runtime) the number of calls to epoll() or similar, i.e. the reactor. By doing so we reduce the number of wakeups (and thus context switches), which is considerably reducing the CPU usage and increasing the throughput a lot. See my blog post for some details. Maybe this is a feature that would also be useful in tokio?
    1.1 Because of the throttling it was necessary to implement our own timer infrastructure, as tokio's timers don't know anything about the throttling and would usually be triggered much later than needed. By knowing the throttling interval, our own timers would at most trigger half of the interval too early or too late, and not on average half the interval too late. Also back in tokio 0.1 it seemed like the tokio interval timers were actually drifting when throttled.
    1.2 The custom timers implementation had to be wrapped around the calls to turn() and also needed a way to unpark() the reactor whenever the list of timers changed in a way that the next wakeup would be earlier.
  2. We want to use a single thread for the whole runtime (executor, reactor, timers) instead of having it distributed over multiple threads. Reason is again resource usage and overhead from context switches. You can see from the numbers in my blog post that this also made quite some difference.

From what I can see, 2. is not necessary anymore nowadays with the basic_scheduler() feature of the runtime Builder. 1. is still necessary.

What would you suggest for moving forward with this? Adding such a throttling feature to tokio directly? Exposing ways to hook into the runtime behaviour for implementing this outside tokio again somehow? Anything else? :)

@carllerche
Copy link
Member

Interesting. Re 1. it would be useful to get some benchmarks together on tokio 0.2 to demonstrate this.

@sdroege
Copy link
Contributor Author

sdroege commented Dec 3, 2019

Interesting. Re 1. it would be useful to get some benchmarks together on tokio 0.2 to demonstrate this.

I can prepare something, but that's nothing tokio really can mitigate without throttling calls to epoll(). The problem is simply if packets are arriving randomly, you'd wake up your threads all the time and process a small packet just to sleep again and call epoll() for basically every packet. While otherwise you'd be able to handle lots of stuff at once, only call epoll() once for a whole lot of packets. syscalls and context switches are expensive.

I'm unsure however how to prepare such a benchmark. I can show you that tokio 0.2.0-alpha.6 with throttling has a lot more throughput than tokio 0.2.0 stable without throttling, but that's apples and oranges :) I can also show tokio 0.2.0-alpha.6 with vs. without throttling.

@carllerche
Copy link
Member

Just a little app that demonstrates a case where throttling is helpful. That would be something to experiment with.

@carllerche
Copy link
Member

After docs, I plan on setting up a benchmark suite... stuff like ^^ that demonstrates "real world" patterns would be helpful to add.

@sdroege
Copy link
Contributor Author

sdroege commented Dec 3, 2019

See my blog post I linked above, but I can prepare a new version of that just on top of tokio without other dependencies. That probably helps, and that throttling improves the situation then could be shown by simply adding a sleep() at a strategic place inside tokio.

I'll take a look at that later today or tomorrow.

@sdroege
Copy link
Contributor Author

sdroege commented Dec 3, 2019

I have a small example, will clean up and put it up somewhere later. But 1000 UDP sockets, one packet every 20ms on each gives about 22% CPU with a single basic runtime and about 23% with two basic runtimes in separate threads. With throttling to call io::driver::park at most once every 20ms it gives around 16% with a single basic runtime and around 17% with two basic runtimes. This is only receiving 160 byte UDP packets and dropping them.

Compared to my benchmarks a 1.5 years ago, and with tokio 0.1 and additional overhead from GStreamer, these are very similar results. With even more sockets the effect will be more visible, I'll create a table with various results later.

Note: For both cases I changed the MAX_TASKS_PER_TICK in the basic scheduler to infinity, otherwise it would not handle all sockets per tick.

@carllerche
Copy link
Member

I agree that there is probably some strategy we could use to throttle calls to the I/O driver.

@sdroege
Copy link
Contributor Author

sdroege commented Dec 3, 2019

Code can be found here. Run with cargo run --bin sender [num-sockets] and cargo run --bin receiver [num-sockets] [num-threads].

My measurements before we slightly wrong, I implemented the throttling wrong. Patch can be found at the bottom.

Threads Throttle Sockets CPU
X 0ms 1000 35%
1 0ms 1000 11%
2 0ms 1000 12%
1 20ms 1000 10%
2 20ms 1000 10%
X 0ms 2000 72%
1 0ms 2000 22%
2 0ms 2000 23%
1 20ms 2000 18%
2 20ms 2000 20%
X 0ms 4000 147%
1 0ms 4000 48%
2 0ms 4000 50%
1 20ms 4000 28%
2 20ms 4000 36%

The X is with the default runtime, creates 4 threads here.

Patch for throttling below. This does not consider the throttling for the timers (but should):

diff --git a/tokio/src/runtime/basic_scheduler.rs b/tokio/src/runtime/basic_scheduler.rs
index c674b961..c5427f20 100644
--- a/tokio/src/runtime/basic_scheduler.rs
+++ b/tokio/src/runtime/basic_scheduler.rs
@@ -71,6 +71,8 @@ struct LocalState<P> {
 
     /// Thread park handle
     park: P,
+
+    last_tick: Option<std::time::Instant>,
 }
 
 #[derive(Debug)]
@@ -110,7 +112,7 @@ where
                 pending_drop: task::TransferStack::new(),
                 unpark: Box::new(unpark),
             }),
-            local: LocalState { tick: 0, park },
+            local: LocalState { tick: 0, park, last_tick: None },
         }
     }
 
@@ -218,7 +220,7 @@ impl Spawner {
 
 impl SchedulerPriv {
     fn tick(&self, local: &mut LocalState<impl Park>) {
-        for _ in 0..MAX_TASKS_PER_TICK {
+        loop {
             // Get the current tick
             let tick = local.tick;
 
@@ -227,10 +229,7 @@ impl SchedulerPriv {
 
             let task = match self.next_task(tick) {
                 Some(task) => task,
-                None => {
-                    local.park.park().ok().expect("failed to park");
-                    return;
-                }
+                None => break,
             };
 
             if let Some(task) = task.run(&mut || Some(self.into())) {
@@ -240,9 +239,21 @@ impl SchedulerPriv {
             }
         }
 
+        if let Some(last_tick) = local.last_tick {
+            use std::thread;
+
+            let now = std::time::Instant::now();
+            let diff = now - last_tick;
+            const WAIT: std::time::Duration = std::time::Duration::from_millis(20);
+            if diff < WAIT {
+                thread::sleep(WAIT - diff);
+            }
+        }
+        local.last_tick = Some(std::time::Instant::now());
+
         local
             .park
-            .park_timeout(Duration::from_millis(0))
+            .park()
             .ok()
             .expect("failed to park");
     }

@sdroege
Copy link
Contributor Author

sdroege commented Dec 4, 2019

I agree that there is probably some strategy we could use to throttle calls to the I/O driver.

Or alternatively it would be great if there was API that would allow to replace the runtime or parts of it with a custom runtime, like there was before :)

@carllerche
Copy link
Member

I'm not against it. The permutation details need to be figured out.

fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 5, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 6, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 18, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 19, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 19, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 21, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 22, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Dec 25, 2019
fengalin pushed a commit to fengalin/tokio that referenced this issue Jan 5, 2020
fengalin pushed a commit to fengalin/tokio that referenced this issue Jan 8, 2020
fengalin pushed a commit to fengalin/tokio that referenced this issue Feb 25, 2020
fengalin pushed a commit to fengalin/tokio that referenced this issue Feb 28, 2020
fengalin pushed a commit to fengalin/tokio that referenced this issue Apr 11, 2020
@Darksonn
Copy link
Contributor

Darksonn commented Jul 25, 2020

This appears related to #2443, and maybe also #1583, #2545.

@carllerche
Copy link
Member

Closing due to inactivity.

For future reference, I am not necessarily against adding the ability to configure throttling, but I would like to see it demonstrated that doing it at the Tokio level provides measurable benefit over implementing batching logic in userland.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants