Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large sched_yield() and threading overhead (+25-40% perf boost) #2134

Closed
AndreiLux opened this issue Jul 7, 2023 · 33 comments
Closed

Large sched_yield() and threading overhead (+25-40% perf boost) #2134

AndreiLux opened this issue Jul 7, 2023 · 33 comments
Labels

Comments

@AndreiLux
Copy link

Platform: macOS / Apple M2 Pro

Currently the current thread finalisation / synchronisation logic in ggml_graph_compute_thread relies on sched_yield() to spin idly waiting for other threads to complete :

https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L16045

            // wait for other threads to finish
            const int last = node_n;
            do {
                sched_yield();
                node_n = atomic_load(&state->shared->node_n);
            } while (node_n == last);

The problem with that is that this is causing absolutely gigantic amounts of overhead due to context switching and falling back to the kernel with no known time as to when the thread will come back to execution.

When profiling time on an M2 Pro:

./build-llvm16-native/bin/main -m ./models/7B/ggml-model-q4_0.bin -t 8 -n 512 -p "Explain a CPU microarchitecture basics." -s 3215465
main: build = 775 (d7d2e6a)
main: seed  = 3215465
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 5439.94 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB

system_info: n_threads = 8 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0


(....)

llama_print_timings:        load time =   164.84 ms
llama_print_timings:      sample time =   281.18 ms /   414 runs   (    0.68 ms per token,  1472.36 tokens per second)
llama_print_timings: prompt eval time =   397.91 ms /    11 tokens (   36.17 ms per token,    27.64 tokens per second)
llama_print_timings:        eval time = 21528.10 ms /   413 runs   (   52.13 ms per token,    19.18 tokens per second)
llama_print_timings:       total time = 22244.37 ms

We see that in terms of time spent, the thing is wasting a huge amount of time in swtch_pri which is the system library for thread yielding, further then switching into kernel mode.

56.69 s  100.0%	0 s	 	main (84121)
36.71 s   64.7%	36.71 s	 	 ggml_vec_dot_q4_0_q8_0
15.95 s   28.1%	15.95 s	 	 swtch_pri
1.82 s    3.2%	1.82 s	 	 ggml_compute_forward_mul_mat
980.30 ms    1.7%	980.30 ms	 	 ggml_vec_dot_q6_K_q8_K
445.90 ms    0.7%	445.90 ms	 	 ggml_compute_forward_dup
119.30 ms    0.2%	119.30 ms	 	 ggml_graph_compute_thread

Simply commenting out sched_yield():

            // wait for other threads to finish
            const int last = node_n;
            do {
                //sched_yield();
                node_n = atomic_load(&state->shared->node_n);
            } while (node_n == last);

Results in a massive performance boost:

llama_print_timings:        load time =   129.83 ms
llama_print_timings:      sample time =   291.24 ms /   414 runs   (    0.70 ms per token,  1421.53 tokens per second)
llama_print_timings: prompt eval time =   334.44 ms /    11 tokens (   30.40 ms per token,    32.89 tokens per second)
llama_print_timings:        eval time = 14967.86 ms /   413 runs   (   36.24 ms per token,    27.59 tokens per second)
llama_print_timings:       total time = 15630.33 ms

As new profile looks like this:

41.25 s  100.0%	0 s	 	main (84137)
33.05 s   80.1%	33.05 s	 	 ggml_vec_dot_q4_0_q8_0
4.86 s   11.7%	4.86 s	 	 ggml_graph_compute_thread
1.34 s    3.2%	1.34 s	 	 ggml_compute_forward_mul_mat
930.00 ms    2.2%	930.00 ms	 	 ggml_vec_dot_q6_K_q8_K
490.70 ms    1.1%	490.70 ms	 	 ggml_compute_forward_dup
83.40 ms    0.2%	83.40 ms	 	 llama_sample_repetition_penalty

In the above interactive Terminal command-line example we see a 43% performance boost. The % will differ depending on where you call this from due to macOS thread QoS (ssh is around +25%).

This seems like a gigantic performance issue on the operating systems which do actually yield back to the scheduler.
My solution of just commenting out the yield and just looping probably isn't the best for power efficiency, you'd probably want to implement whatever tight atomic lock / wait across the threads. Up to you what to do here, but anything is better than sched_yield().

Another issue is the thread creation; as the threads get created and joined on each layer/operation, we see the cumulative overhead to be extremely large:

1.02 min  100.0%	0 s	 	main (82676)
36.31 s   59.4%	36.31 s	 	 ggml_vec_dot_q4_0_q8_0
16.74 s   27.4%	16.74 s	 	 swtch_pri
4.51 s    7.3%	4.51 s	 	 thread_start
1.44 s    2.3%	1.44 s	 	 ggml_compute_forward_mul_mat
805.01 ms    1.3%	805.01 ms	 	 ggml_vec_dot_q6_K_q8_K
390.01 ms    0.6%	390.01 ms	 	 ggml_compute_forward_dup
130.01 ms    0.2%	130.01 ms	 	 ggml_graph_compute_thread

It's likely we could see another 7% performance boost here by switching the work dispatch architecture to a thread pool.

Thank you for the work on the otherwise great implementation!

@AndreiLux AndreiLux changed the title Large schedule_yield() and threading overhead (+25-40% perf boost) Large sched_yield() and threading overhead (+25-40% perf boost) Jul 7, 2023
@ggerganov
Copy link
Owner

ggerganov commented Jul 7, 2023

The initial implementation indeed was using only busy loops on atomic variables without yield since I've noticed that it can cause significant performance degradation. In #1556 we reimplemented the thread synchronization code and introduced this sched_yield() call. I tried with and without it and it wasn't causing any difference on my machines (one of which is M1 Pro), so I decided to keep it hoping that it will make things slightly more energy efficient.

I just redid the tests and I don't see any performance difference with or without it.
On M1 Pro, I get ~39 ms / tok for that same model and -t 8 -n 64.

I'm surprised that your time on master is so significantly worse.
Anyway, will comment out the sched_yield() call just in case.

Regarding your point about a thread pool - I did try to implement a thread pool some time ago:

ggerganov/whisper.cpp#343

It didn't provide any improvement, but I'm open to revisit this idea today.
The only requirement is to have a minimalistic implementation of the pool - something like in the linked PR.

@mqy
Copy link
Collaborator

mqy commented Jul 7, 2023

I can confirm the sched_yield() introduced significant context switch on macOS 2018 Pro.

Let me list some of the change log of ggml threading I know:

  1. gg wrote the first spin lock busy spin version.
  2. @Safari77 added mem_pause() in use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314, the mem_pause reduced memory contention somewhat.
  3. @zrm added numa and rewrote threading in Numa #1556 which removed atomic_fetch_add/acomic_fetch_sub during spin loop, thus the memory contention mode was changed from read/write to read. sched_yield() should introduced certain overhead, but is not significant because the overall perf (plus numa) is better comparing to previous revision.

The actual behavior of sched_yield() is OS dependent, perhaps we can not simply remove it without cross-platform perf testing.

@ggerganov
Copy link
Owner

The actual behavior of sched_yield() is OS dependent, perhaps we can not simply remove it without cross-platform perf testing.

Hm, I don't think there is a case where removing it can degrade performance. It could only potentially increase the energy consumption maybe

@evanmiller
Copy link
Collaborator

Is there a reason not to use pthread condition variables here?

@mqy
Copy link
Collaborator

mqy commented Jul 7, 2023

Is there a reason not to use pthread condition variables here?

Typical compute graph contains more than a thousand tensor nodes to compute, most of the per-node computing time is several us (even less than 1 us), mul_mat for large prompt may take more than 1 ms. Typical pthread cond wait or notify/broadcast take at least several us, at worst tens of us.

@BrickBee
Copy link

BrickBee commented Jul 8, 2023

Hm, I don't think there is a case where removing it can degrade performance. It could only potentially increase the energy consumption maybe

I might have found such a case:
Running on 32 logical AMD Ryzen 9 cores results in the fastest prompt processing speed and I didn't observe an obvious difference in processing time with and without yielding.
The text generation was roughly 6% slower for me when not yielding, compared to before the change.

There is no point in running so many threads for generation, since it's RAM-bound anyway. When running only 6 threads, there was no difference between yielding and not yielding. The text generation speed was also roughly 12% faster than in the previous test with 32 threads. The prompt processing was of course slower, independent of yielding, because it scales with the number of cores.

The energy consumption appeared surprisingly similar between the two versions. I didn't measure it precisely, but a big difference would have stood out.

Maybe others who have machines with a high number of cores can check if they also observe that effect on their systems?

@AndreiLux
Copy link
Author

Hm, I don't think there is a case where removing it can degrade performance. It could only potentially increase the energy consumption maybe

I might have found such a case: Running on 32 logical AMD Ryzen 9 cores results in the fastest prompt processing speed and I didn't observe an obvious difference in processing time with and without yielding.

It highly depends on what OS you're running on. If that's Linux, then that's good feedback, if it's Windows, then sched_yield() just reduces back to Sleep(0); , so it doesn't mean much.

@wtarreau
Copy link
Contributor

In any case, pauses and sched_yield() ought to only be called after a failure, and never before the first attempt, as it's extremely expensive, due to being a syscall (count ~400 cycles).

In other code, I'm used to use incremental operations in exponential backoff: loop without any pause a few thousand times, then with incremental count of PAUSE insns a few tens of thousands times, then only switch to sched_yield() as a last resort. I ran some test here a few months ago and won less than 1% with this, the contention being elsewhere, but eventually I should redo some tests.

@wtarreau
Copy link
Contributor

Ouch! the removal is catastrophic here (80-core ARM Ampere Altra / Neoverse N1).

Before (commit ed9a54e with sched_yield()):

llama_print_timings:        load time =   588.23 ms
llama_print_timings:      sample time =   138.38 ms /    34 runs   (    4.07 ms per token,   245.71 tokens per second)
llama_print_timings: prompt eval time = 36128.03 ms /   842 tokens (   42.91 ms per token,    23.31 tokens per second)
llama_print_timings:        eval time =  7293.10 ms /    33 runs   (  221.00 ms per token,     4.52 tokens per second)
llama_print_timings:       total time = 43565.30 ms

After (commit 1d16309 with sched_yield() commented out):

llama_print_timings:        load time =   598.99 ms
llama_print_timings:      sample time =   122.21 ms /    30 runs   (    4.07 ms per token,   245.48 tokens per second)
llama_print_timings: prompt eval time = 42567.61 ms /   842 tokens (   50.56 ms per token,    19.78 tokens per second)
llama_print_timings:        eval time = 90640.44 ms /    29 runs   ( 3125.53 ms per token,     0.32 tokens per second)
llama_print_timings:       total time = 133335.19 ms

It's got 15 times slower!

With the sched_yield() moved after the test:

llama_print_timings:        load time =   594.79 ms
llama_print_timings:      sample time =   134.46 ms /    33 runs   (    4.07 ms per token,   245.43 tokens per second)
llama_print_timings: prompt eval time = 40595.16 ms /   842 tokens (   48.21 ms per token,    20.74 tokens per second)
llama_print_timings:        eval time =  6869.41 ms /    32 runs   (  214.67 ms per token,     4.66 tokens per second)
llama_print_timings:       total time = 47604.23 ms

The performance is back (and admittedly the ctx sw is huge).

Let me check with pause instructions first.

@wtarreau
Copy link
Contributor

With 10 dummy loops before triggering sched_yield() I'm already getting this which is better than the original, and context-sw cut in half:

llama_print_timings:        load time =   597.33 ms
llama_print_timings:      sample time =   195.34 ms /    48 runs   (    4.07 ms per token,   245.73 tokens per second)
llama_print_timings: prompt eval time = 36352.44 ms /   842 tokens (   43.17 ms per token,    23.16 tokens per second)
llama_print_timings:        eval time =  9840.69 ms /    47 runs   (  209.38 ms per token,     4.78 tokens per second)
llama_print_timings:       total time = 46395.99 ms

Now with a real pause instruction (dmb ish on aarch64), it's getting even better:

llama_print_timings:        load time =   595.64 ms
llama_print_timings:      sample time =   211.63 ms /    52 runs   (    4.07 ms per token,   245.71 tokens per second)
llama_print_timings: prompt eval time = 35610.44 ms /   842 tokens (   42.29 ms per token,    23.64 tokens per second)
llama_print_timings:        eval time = 10195.07 ms /    51 runs   (  199.90 ms per token,     5.00 tokens per second)
llama_print_timings:       total time = 46024.95 ms

Increasing the sched_yield() trigger threshold to 50 iterations yields even more gains:

llama_print_timings:        load time =   590.06 ms
llama_print_timings:      sample time =   349.96 ms /    86 runs   (    4.07 ms per token,   245.74 tokens per second)
llama_print_timings: prompt eval time = 35435.52 ms /   842 tokens (   42.08 ms per token,    23.76 tokens per second)
llama_print_timings:        eval time = 16199.78 ms /    85 runs   (  190.59 ms per token,     5.25 tokens per second)
llama_print_timings:       total time = 51998.77 ms

So basically we were at 4.52 tok/s with sched_yield(), 0.32 tok/s without, and 5.25 with it slightly arranged, hence 16% faster than the original and 16 times faster than the current master.

The patch should be cut in two parts, one to define a pause() instruction and one to use it. I've only done x86_64 and aarch64 but armv7 could benefit from it as well. I cannot test on windows and I don't know if windows on arm is supported by the project, because the asm statement might differ there.

First patch to define ggml_pause():

diff --git a/ggml.c b/ggml.c
index c10877a..fcc9a5e 100644
--- a/ggml.c
+++ b/ggml.c
@@ -15887,6 +15887,14 @@ typedef pthread_t ggml_thread_t;
 
 #endif
 
+#if defined(__x86_64__) || (defined(_MSC_VER) && defined(_M_AMD64))
+#define ggml_pause()    _mm_pause()
+#elif defined(__aarch_64__)
+#define ggml_pause()    __asm__("dmb ish")
+#else
+#define ggml_pause()    __asm__("")
+#endif
+

Second one to use it:

diff --git a/ggml.c b/ggml.c
index c10877a..fcc9a5e 100644
--- a/ggml.c
+++ b/ggml.c
@@ -16045,10 +16053,19 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
         } else {
             // wait for other threads to finish
             const int last = node_n;
-            do {
-                //sched_yield();
+           unsigned int retries = 0;
+           unsigned int loops;
+
+            while (1) {
                 node_n = atomic_load(&state->shared->node_n);
-            } while (node_n == last);
+                if (node_n != last)
+                    break;
+                retries++;
+                if (retries > 50)
+                    sched_yield();
+                for (loops = 0; loops < (1U << ((retries-1) & 0x7)); loops++)
+                    ggml_pause();
+            }
         }
 
         // check if we should stop

@mqy
Copy link
Collaborator

mqy commented Jul 10, 2023

marl use nop + sched_yield() . Since we are not aiming at energy savings and pause may slow down cpu frequency, I prefer nop.

https://github.com/google/marl/blob/aa9e85b2189d6f5dbba6909275661b37dfb5de69/src/scheduler.cpp#L640

@wtarreau
Copy link
Contributor

The problem doesn't have anything to do with energy savings but with the extreme hammering of a shared cache line that's each other thread tries to get exclusive access on, so nobody can work, each finishing thread has to wait for all other ones to accept to give back that cache line for an exclusive access just to perform the atomic_fetch_sub(). I tried already to split n_active from n_nodes just in case, but it brought very marginal gains. The right approach would be to let the threads loop on a distinct counter that serves to start, and that this value is reset only once everyone is ready. It must obviously be located into a distinct cache line so that it never hurts other threads that are trying to touch it, and that it's set only once and reset only once. A counter can work as well (and is often better), as only the thread finishing will have to increment it.

Also on arm the "dmb ish" (equivalent to pause) is much more efficient than a series of NOP because it only flushes the instruction queue (not the cache), so that effectively adds a hole in the pipeline during which a core doesn't hinder the other ones. On x86 it's very interesting because it relinquishes the execution ports for the sibling thread of the same core for a few tens to hundreds of cycles, which is great as well (better perf on the other one, no contention).

I can try to have a look at this later.

@mqy
Copy link
Collaborator

mqy commented Jul 10, 2023

The right approach would be to let the threads loop on a distinct counter that serves to start, and that this value is reset only once everyone is ready.

A simple demo.

static thread_ret_t demo_compute_thread(void *data) {

@mqy
Copy link
Collaborator

mqy commented Jul 10, 2023

Also on arm the "dmb ish" (equivalent to pause) is much more efficient than a series of NOP

Found DMB ish ; data memory barrier from this article The ARM processor (Thumb-2), part 11: Atomic access and barriers. Not well understood. I'm wondering: since the default memory order of atomics is memory_order_seq_cst, do we need another layer of memory barriers?

In Rust, the spin hint of ARM is a bit complicated than x86-64: https://github.com/rust-lang/stdarch/blob/a4cde4a0ec9dcd05247eff5b2a23d695e46a6850/crates/core_arch/src/arm_shared/hints.rs#L62

@wtarreau
Copy link
Contributor

I'm sorry I don't understand what you're trying to illustrate. This task-allocator.c file is not the one we're working on. And this rust thing has nothing of interest. Regarding "dmb ish" while "dmb" is a data memory barrier, here's it's used in conjunction with the instruction cache and has the effect of only flushing the queue of decoded instruction (think about self-modifying code, it's more or less equivalent to the old jmp $+2 we used to do on x86 several decades ago, except that the equivalent b .+4 has no effect). The nice point about flushing the instruction queue is that it doesn't perform any operation so it's extremely cheap, no bus access etc. Instructions are just reloaded from the L1I thus delayed for a few cycles with no interaction with anything else. Compare this with a syscall that writes to stack and stores/restores all registers to/from a TSS, thus wasting memory bandwidth at the moment where memory contention is what is causing the thread to wait!

@BrickBee
Copy link

BrickBee commented Jul 10, 2023

Interesting, it seems that different systems require different code to achieve the best token generation performance, at least with the current threading approach. The diff with the retries seems to be most beneficial when all cores (can) run at the same speed. On the AMD 7950X3D the cores always have different frequencies by design.
I have made these benchmark runs on Windows, but maybe they still illustrate something useful in this context.

The data is a bit noisy, so I made 20 runs per variant and sorted the data.
threads32
When running with 32 threads it's clear that the provided patch leads to better performance, but a slight modification is even better here.

  • "None": The current version, no sched_yield (which is a Sleep(0) on Windows).
  • "Sleep(0)": The previous version.
  • "YieldProcessor": Replaced Sleep(0) with YieldProcessor (_mm_pause) for comparison.
  • "Sleep-after-50 & Incr. Yield": The provided patch.
  • "Sleep-after-50 & Incr. Yield": The provided patch, but without the for-loop part.
    Not yielding seems to be faster on this system.

But there is more: Running with just 6 threads is faster than running with 32.
threads6
The data is noisier here, but it seems that the regular Sleep(0) from the previous version and the non-yielding loop version have the same performance.
I have manually pinned the 6 threads to different cores and CCDs to give as much cache as possible to each individual thread. Without doing so the effect of running with a lower number of threads was not that strong for me.

@wtarreau maybe you can try to reproduce this effect on your system? Repeat your test with just 10 threads and a short prompt to save time when testing the token generation, like "Write a poem consisting of three paragraphs".
If it's also faster with less threads there, then there could still be a lot of unnecessary pressure on cache or RAM, even in the modified version discussed here.

@wtarreau
Copy link
Contributor

On some machines (e.g. a 24-core AMD 74F3) I already found that halving the threads yielded better results by spending way less time waiting on the threads, but that was about 2 months ago before the changes to threading, the perf top outputs were horrible. I have not retried on that machine since, as I got much better performance on a 8-core xeon at higher frequency (that's classic with thread contention). I wanted to invest more time on this when I have some available but for now I can't.

Just a point. In your case if the time increases with successive runs, it certainly means your CPU has exhausted the max allocated time at the maximum frequency bins to stay under the power envelope, and is progressively shrinking the frequency. This does also have an effect, as with less cores running you'll stay longer (even forever) at the max frequency. That's especially true when AVX2 is being used.

I've rerun a few tests for you (good idea for the poem as a short prompt BTW, I'm doing that with a fixed seed). with a slight improvement of the retries above (in fact I stole the EBO code from my progressive locks and tailored it for this test), and run it at different thread counts:

  • 80 threads: 246 ms per token
  • 40 threads: 156 ms per token
  • 35 threads: 156 ms per token
  • 30 threads: 155 ms per token <-- optimal
  • 25 threads: 156 ms per token
  • 20 threads: 159 ms per token

By comparison the previous "optimal" code (the one hammering on sched_yield) at commit ed9a54e showed 156ms at 30 threads.

Please note that this machine has "only" 6 memory channels and that it will still quickly represent a bottleneck. Also, in an ideal case we should probably mark these memory areas non-cacheable so that reading from there does not cause cache any eviction.

Finally, another point I'd like to study as well is the amount of time a thread spends spinning, waiting for the last thread to finish its work in ggml_compute_thread(). Couldn't we try to pull in more work to make the ready thread restart working instead of all being busy waiting ? Maybe it's not easy (very likely) but if we'd manage to alternate between two sets of data to compute with, it would be possible to have some threads working on one set while the other ones attack the new set, and only when the first set is done, the first read thread prepares these data again. This way there would be almost no time lost waiting for other threads. But I don't know the data model nor memory access patterns, so I could be wrong of course.

@wtarreau
Copy link
Contributor

I just realized I forgot to join the updated patch. It might not necessarily be optimal on x86 (not tried there), though the original one it's derived from was optimized on a large x86 system, so YMMV.

diff --git a/ggml.c b/ggml.c
index c10877a..537fa6b 100644
--- a/ggml.c
+++ b/ggml.c
@@ -16045,10 +16053,28 @@ static thread_ret_t ggml_graph_compute_thread(void * data) {
         } else {
             // wait for other threads to finish
             const int last = node_n;
-            do {
-                //sched_yield();
+            unsigned int wait_time = 1;
+
+            while (1) {
+                unsigned int loops = wait_time;
+
                 node_n = atomic_load(&state->shared->node_n);
-            } while (node_n == last);
+                if (node_n != last)
+                    break;
+
+                if (loops >= 8192) {
+                    for (; loops >= 65536; loops -= 32768)
+                        sched_yield();
+
+                    for (; loops >= 8192; loops -= 256)
+                        ggml_pause();
+               }
+
+                for (; loops >= 100; loops -= 100)
+                    __asm__("");
+
+                wait_time = ((wait_time + (wait_time >> 1)) + 2) & 0x3ffff;
+            }
         }
 
         // check if we should stop

@BrickBee
Copy link

The updated patch performs worse on my system than the simple "Sleep after 50" without _mm_pause. When running with 32 threads it's just a tiny bit slower than the previous variants with and without _mm_pause.
When running with 6 threads the new patch leads to performance that's slightly worse than everything else that I measured for 6 threads so far. When removing the _mm_pause it's just slightly slower than the original Sleep(0) or the variant with 50 retries first.

Finally, another point I'd like to study as well is the amount of time a thread spends spinning, waiting for the last thread to finish its work in ggml_compute_thread(). Couldn't we try to pull in more work to make the ready thread restart working instead of all being busy waiting ?

#2026 does that. I think it came up due to the E-Cores where the effect would be the most noticeable.

In your case if the time increases with successive runs, it certainly means your CPU has exhausted the max allocated time at the maximum frequency bins to stay under the power envelope, and is progressively shrinking the frequency. This does also have an effect, as with less cores running you'll stay longer (even forever) at the max frequency.

Yes, that's what happens during prompt processing. During token generation my CPU stays way below the thermal and power limits. It seems my "sorted the data" remark was not clear enough. The original data of the interleaved runs with pauses between runs and batches was noisy with no visible pattern. So, I sorted the measurements for the graph to provide a better overview, instead of squeezing all 20 measurements into a single candlestick per variant with error bars.

In any case, we can see that there can be small performance differences of the different approaches depending on the system, yet the removal of the sched_yield() doesn't seem to do any good, aside from the M2 Pro case.

@wtarreau
Copy link
Contributor

The updated patch performs worse on my system than the simple "Sleep after 50" without _mm_pause. When running with 32 threads it's just a tiny bit slower than the previous variants with and without _mm_pause. When running with 6 threads the new patch leads to performance that's slightly worse than everything else that I measured for 6 threads so far. When removing the _mm_pause it's just slightly slower than the original Sleep(0) or the variant with 50 retries first.

OK, thanks for testing! As often, optimizing for the last drop on a specific arch makes things worse on another one.

Finally, another point I'd like to study as well is the amount of time a thread spends spinning, waiting for the last thread to finish its work in ggml_compute_thread(). Couldn't we try to pull in more work to make the ready thread restart working instead of all being busy waiting ?

#2026 does that. I think it came up due to the E-Cores where the effect would be the most noticeable.

Good point indeed! That would be the same on big.little ARM configurations.

It seems my "sorted the data" remark was not clear enough. The original data of the interleaved runs with pauses between runs and batches was noisy with no visible pattern. So, I sorted the measurements for the graph to provide a better overview, instead of squeezing all 20 measurements into a single candlestick per variant with error bars.

Ah, got it now, indeed I didn't understand you had sorted it, yeah that does help to compare, and it makes sense.

In any case, we can see that there can be small performance differences of the different approaches depending on the system, yet the removal of the sched_yield() doesn't seem to do any good, aside from the M2 Pro case.

Given that it divides the whole performance by 16 on some machines, it's a significant problem. We'd need to figure a simple balanced solution that's reasonably good everywhere and from there try to do better with pipelined work.

@henk717
Copy link

henk717 commented Jul 12, 2023

On my Ryzen 1700X system using 7 threads without sched_yield I get 4t/s on a 30B 4_0 quant using all layers on the GPU with cuBlas.
With sched_yield this becomes 10t/s so for me this is highly needed otherwise I get a big speed regression.

This was tested on Windows while troubleshooting a speed regression on Koboldcpp.

@wtarreau
Copy link
Contributor

Not surprised, same problem, high contention between threads. And Ryzens do not easily give up on their cache lines. Could you try replacing the sched_yield() with _mm_pause(), and another test with 4 of them separated by a semi-colon ? I guess it will already significantly improve the situation. If you're able to apply patches, the two above that are a bit "smarter" combine pause and sched_yield() depending on the waiting time and should further improve it for you.

@henk717
Copy link

henk717 commented Jul 13, 2023

For now we decided to ship a build to our users with it restored after multiple users complained about 50% loss in performance on Windows both AMD and Intel.

If this is a very platform dependant case would it not make sense to only exclude it for the platforms it benefits and leave it in for everything else? Because unless I missed something my understanding is that it is only a benefit on mac.

@wtarreau
Copy link
Contributor

I suspect that even on mac it's just a loss vs another loss and that a better approach would give better results. I don't know how often that part is called, I'd like to have a look at it maybe this week-end. I suspect that switching to a condwait() after several iterations over pause() could further improve the situation but it's generally tricky to mix these together.

@wtarreau
Copy link
Contributor

Trying with pthread_cond_wait() unfortunately gives horrible performance. Just calling the pthread API there is expensive since the loop is fast. I measured 774k calls/s with 80 threads on a 7B model and 338k on a 13B one. But the real problem is how to wake all threads up, we in practice wake them one at a time, that's too slow. Even with pthread_cond_broadcast() in practice, since all threads must acquire the mutex when waking up, their wakeup is heavily serialized. I thought about trying with a file descriptor instead (e.g. a blocking recv(MSG_PEEK)). But any syscall I add here makes everything much worse, we don't sleep long enough in this loop. So it looks like the approach with the ggml_pause() first and sched_yield() next remains the lightest.

@ggerganov
Copy link
Owner

ggerganov commented Jul 14, 2023

I measured 774k calls/s with 80 threads on a 7B model and 338k on a 13B one.

There is no point in using so many threads. The memory bandwidth is easily saturated on pretty much all hardware with just 8 threads. I would be surprised if you see any performance gain from using more than 8 threads.

I don't think this change degrades the performance - show me a specific example.
There is too much info above and most of it is discussing impractical use cases of 32 threads and beyond.

@BrickBee
Copy link

memory bandwidth is easily saturated on pretty much all hardware with just 8 threads

It is, and there is thus not much point in using more threads for token generation. Running with 6 threads is 0.15% slower than before the change (Sleep 0 vs None in my graph above). That's indeed not too bad.

The situation is different for prompt processing though. Not using all the cores and sticking to 6 or 8 threads for optimal token generation speed means having not even half of the possible prompt processing speed. So, it could be useful to be able to specify a different number of threads for prompt processing and token generation, to achieve the highest possible performance on both.

@wtarreau
Copy link
Contributor

In fact the performance grows till 30-40 threads and stays flat. However, I can significantly gain in total performance by using more processes in parallel, indicating that the memory bandwidth is not that bad (there are 6 DDR4 channels). Here are the perf I'm measuring on the same test with 8 to 80 threads:

8:   263ms (3.80 t/s)
16:  156ms (6.40 t/s)
24:  142ms (7.03 t/s)
32:  141ms (7.11 t/s)
40:  141ms (7.06 t/s)
80:  150ms (6.67 t/s)

So the limit sits between 24 and 32. However, if I start several processes in parallel, I can almost double the performance, indicating that up to half of the time in the current situation is probably spent synchronizing threads. I've cut the machine in 1..4 groups of 20 threads, 1..3 groups of 26 and 5 groups of 16 to compare:

1x26:  140ms (7.14 t/s)
2x26:  161ms (12.38 t/s)
3x26:  260ms (11.51 t/s)

1x20:  146ms (6.83 t/s)
2x20:  209ms (9.56 t/s)
3x20:  268ms (11.19 t/s)
4x20:  305ms (13.12 t/s)

5x16:  408ms (12.25 t/s)

Here at 4x20 we're almost twice as fast as the fastest that a single process delivers.

I know pretty well that many workloads cannot scale linearly for various reasons ranging from data dependency or the cost of distributing the work. I'm not trying to squeeze the last ounce out of it, just to make sure we use the hardware as efficiently as we can. In any case, here, 8 threads is far too low.

With sched_yield() removed, I saw the performance totally ruined (divided by 16) due to extreme contention between the threads. I don't reproduce it anymore at the moment but I know that it depends a lot where the data are mapped in memory regarding the various channels.

At the very least a pause instruction is definitely needed to make sure we don't hammer the other cores' caches while they're trying to use it.

Also, I don't know how this affects prompt processing, as I'm also particularly interested in the speed of prompt processing to analyse contents.

@wtarreau
Copy link
Contributor

An here are some measurements on a large prompt (I'm using the engine to analyze Git commit messages).

The table below shows execution times in seconds at 20, 40 and 80 threads. Note that at 80 threads in the master without sched_yield() I had to press Ctrl-C after several minutes as it was showing something like one word every 3 seconds.

(Note: ed9a54e is before the removal of sched_yield, master is 1d16309, after the removal, pause_only is a single pause
instruction in the loop after the atomic_load, pause+sched is the patch above calling multiple pauses then sched_yield if it lasts too long).

        Execution time in seconds, lower is better

         ed9a54e    master    pause_only   pause+sched
20t       112        112        112           112
40t       73.0       73.3       72.9          72.9
80t       71.7       ----       82.2          65.2

So in this specific use case (which is my target), the 80 threads are almost twice as fast as 20 when avoiding a spinning loop, but are not usable anymore at all in the current master.

If you agree I can send you a PR for the two simple patches above, but I'd first like to make sure we agree that the removal of sched_yield() here definitely causes a regression in such use cases.

@wtarreau
Copy link
Contributor

Actually the test above was run with the machine configured in NUMA mode. In default (unified) mode we have:

         ed9a54e    master    pause_only   pause+sched    EBO
20t       104        104        104           104         104
40t       61.1       61.7       67.5          67.5        67.1
80t       42.2       44.4       46.0          46.0        44.9

Wat it shows is that supporting 80 threads is absolutely mandatory on this platform, and that the performance is not as bad when NUMA is not involved (or conversely as I suspected the loss of sched_yield() really impacts certain memory access patterns). The sched_yield() version is better in every aspect here, but the other ones involving a pause are not as good anymore. But it gives me a better target to analyze now, I need to figure how to recover the original performance without degrading other platforms.

I've also measured output generation per thread number at commit ed9a54e:

Threads     8     12     16     20     24     32     40     48     56     64     72     80
tokens/s    3.57  5.15   6.60   7.90   8.99   10.63  11.32  11.18  11.06  10.99  10.85  10.78

Thus actually it scales pretty well and does not really collapse past the inflection point, indicating that it's worth making efforts to maintain a good thread scalability.

@goerch
Copy link
Collaborator

goerch commented Jul 20, 2023

It's likely we could see another 7% performance boost here by switching the work dispatch architecture to a thread pool.

I'd tend to agree.

@github-actions github-actions bot added the stale label Mar 25, 2024
@Edgar-I
Copy link

Edgar-I commented Apr 3, 2024

Maybe we can choose the fastest way at the runtime? e.g. MAB (multi-armed bandit optimization). IIRC it was used in Clickhose

Edit. I found an article https://clickhouse.com/blog/lz4-compression-in-clickhouse. Scroll down to "Testing on different CPUs" but I suggest to read it from top to bottom

@github-actions github-actions bot removed the stale label Apr 4, 2024
@github-actions github-actions bot added the stale label May 5, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants