Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaMA NUMA could be better #1437

Closed
zrm opened this issue May 13, 2023 · 44 comments
Closed

LLaMA NUMA could be better #1437

zrm opened this issue May 13, 2023 · 44 comments
Labels
performance Speed related topics stale

Comments

@zrm
Copy link
Collaborator

zrm commented May 13, 2023

llama.cpp is memory bound, let's see what has a lot of memory bandwidth:

NVIDIA V100 32GB: 900GB/s
2S Epyc 9000 (12xDDR5-4800/S): 922GB/s
NVIDIA A100 40GB: 1555GB/s
2S Xeon Max (HBM): 2TB/s
NVIDIA A100 80GB: 2TB/s
8S Xeon Scalable v4 (8xDDR5-4800/S): 2.45TB/s

NUMA systems have a lot because there are memory channels (or HBM for Xeon Max) on each socket. Okay, but the cheapest thing there is ~$6000. What if I'm not rich?

(~$350 w/ 16GB, max ~128GB) common PC (2xDDR4-3200): 51GB/s
(~$450 w/ 8GB, ~$600 w/ 16GB) Mac Mini M1: 68GB/s
(~$600 w/ 8GB, ~$800 w/ 16GB) Mac Mini M2: 100GB/s
(~$200 w/ 64GB, max ~768GB) 2S Xeon E5 v1 (4xDDR3-1600/S): 102GB/s [no F16C so f16 models slower]
(~$250 w/ 64GB, max ~768GB) 2S Xeon E5 v2 (4xDDR3-1866/S): 119GB/s
(~$350 w/ 128GB, max ~3000GB) 2S Xeon E5 v4 (4xDDR4-2400/S): 154GB/s

Hmm. Xeon E5-2690 v1 for $9 each on eBay. Let's see how we do.

$ lscpu
...
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:"
...
llama_print_timings: sample time = 406.79 ms / 512 runs ( 0.79 ms per token)
llama_print_timings: prompt eval time = 27899.73 ms / 271 tokens ( 102.95 ms per token)
llama_print_timings: eval time = 74773.93 ms / 510 runs ( 146.62 ms per token)

Not terrible for 11-year-old hardware. Let's try it with two sockets:

$ lscpu
...
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:"
...
llama_print_timings: sample time = 438.34 ms / 512 runs ( 0.86 ms per token)
llama_print_timings: prompt eval time = 27083.17 ms / 271 tokens ( 99.94 ms per token)
llama_print_timings: eval time = 129373.98 ms / 510 runs ( 253.67 ms per token)

Twice as many cores, twice as much memory bandwidth, and it's slower.
Oh, get_num_physical_cores() is broken, it's only returning 8/16 physical cores because "cpu cores" in /proc/cpuinfo is per-socket. I submitted a pull request.

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 16
...
llama_print_timings: sample time = 451.48 ms / 512 runs ( 0.88 ms per token)
llama_print_timings: prompt eval time = 16092.04 ms / 271 tokens ( 59.38 ms per token)
llama_print_timings: eval time = 102018.05 ms / 510 runs ( 200.04 ms per token)

Well, the prompt eval time is better. Maybe it benefits from hyperthreading?

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 399.47 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 14734.68 ms / 271 tokens ( 54.37 ms per token)
llama_print_timings: eval time = 97250.82 ms / 510 runs ( 190.69 ms per token)

Still something's not right.

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 96609 MB
node 0 free: 96320 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 64506 MB
node 1 free: 60183 MB
node distances:
node 0 1
0: 10 20
1: 20 10

There it is. The whole model is loaded into the memory of one node. Let's try node interleave.

# echo 3 > /proc/sys/vm/drop_caches

$ numactl --interleave=0-1 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 397.83 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 14894.56 ms / 271 tokens ( 54.96 ms per token)
llama_print_timings: eval time = 57045.66 ms / 510 runs ( 111.85 ms per token)

That's an improvement. Now it's >30% faster than a single socket and basically the same speed as my Ryzen 5 5600G from 2021, for about half the price. Let's see what happens on a machine with 4 NUMA nodes (16C/32T):

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 16
...
llama_print_timings: sample time = 456.06 ms / 512 runs ( 0.89 ms per token)
llama_print_timings: prompt eval time = 13954.33 ms / 271 tokens ( 51.49 ms per token)
llama_print_timings: eval time = 108925.89 ms / 510 runs ( 213.58 ms per token)

$ ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 514.30 ms / 512 runs ( 1.00 ms per token)
llama_print_timings: prompt eval time = 14288.35 ms / 271 tokens ( 52.72 ms per token)
llama_print_timings: eval time = 109354.09 ms / 510 runs ( 214.42 ms per token)

# echo 3 > /proc/sys/vm/drop_caches

$ numactl --interleave=0-3 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 16
...
llama_print_timings: sample time = 477.99 ms / 512 runs ( 0.93 ms per token)
llama_print_timings: prompt eval time = 14164.87 ms / 271 tokens ( 52.27 ms per token)
llama_print_timings: eval time = 67402.83 ms / 510 runs ( 132.16 ms per token)

$ numactl --interleave=0-3 ./main -m models/7B/ggml-model-q4_0.bin -n 512 -p "Building a website can be done in 10 simple steps:" -t 32
...
llama_print_timings: sample time = 489.53 ms / 512 runs ( 0.96 ms per token)
llama_print_timings: prompt eval time = 14511.16 ms / 271 tokens ( 53.55 ms per token)
llama_print_timings: eval time = 48623.21 ms / 510 runs ( 95.34 ms per token)

125% faster is alright.

I can submit a pull request that does the same thing (with a dependency on libnuma) if you want it.

But this is not the best we can do. Interleave spreads the model across all nodes randomly and there is still heavy slow cross-node memory access, that's just better than all the cores contending for the memory of one node.

The better way is to explicitly load 1/Nth of the model on each node and then have a thread pool per node which is assigned the operations on that subset of the model.

@ggerganov ggerganov added the performance Speed related topics label May 13, 2023
@ggerganov
Copy link
Owner

ggerganov commented May 13, 2023

Dependency to libnuma in ggml is not an option. I guess it could be added as a llama.cpp example, depending on the complexity

But this analysis alone is very informative - thank you for posting this.

Might be interesting to see how some of the AVX-512 formats in #1073 perform with this NUMA fine-tuning

@ghost
Copy link

ghost commented May 14, 2023

A LLAMA_NUMA=on compile option with libnuma might work for this case, considering how this looks like a decent performance improvement. I'm actually surprised that no one else saw this considering I've seen other 2S systems being discussed in previous issues.

Also I'm finding it interesting that hyper-threading is actually improving inference speeds in this example, since the general consensus in the past was that it degrades performance. Perhaps the high memory bandwidth is a factor here?

@constasmile
Copy link

constasmile commented May 14, 2023

I have 2S Xeon E5 2680 v4 256Gb DDR4-2400. I also encountered this problem. It would be great to make NUMA support. I can't help with the code, I'm not a programmer, but I can test it on my machine if needed. Maybe try to use something like "gpu-layers" just for each NUMA node? Then the memory bandwidth will be fully utilized. Good luck to you guys.

@gjmulder
Copy link
Collaborator

Most generic motherboards are two channel, with 1 NUMA node. I have a four channel 1 NUMA node board, but that comes at a cost. If you can afford to care about NUMA, then GPUs are always going to provide better performance IMNSHO.

@rankaiyx
Copy link
Contributor

rankaiyx commented May 16, 2023

Why do your 2 NUMA and 4 NUMA have similar speeds?
Is the total memory bandwidth similar? Can you provide some information about the memory bandwidth of your machines?

@rankaiyx rankaiyx mentioned this issue May 16, 2023
@Sciumo
Copy link

Sciumo commented May 16, 2023

So TR Pro 3995 shows only 1 NUMA nodes, 128 CPU.
Memory bound, then?
Caches (sum of all):
L1d: 2 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 32 MiB (64 instances)
L3: 256 MiB (16 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-127

@gjmulder
Copy link
Collaborator

So TR Pro 3995 shows only 1 NUMA nodes, 128 CPU.

128 hypercores. 64 physical cores.

I've observed using hypercores tends to slow things down. The theory behind hypercores is that if one thread is waiting on memory I/O another thread can be scheduled, however if all threads are memory I/O bound then hyperthreading simply causes a lot of context switching overhead. Linux's perf stat can show you how much time cores are waiting on memory. IPC for memory bound code like llama.cpp is usually around 1.0, indicating those L1d, L2 and L3 caches are all probably idling waiting on main memory access as well.

@zrm
Copy link
Collaborator Author

zrm commented May 16, 2023

LLaMA 65B-f16 is ~122GB. You can get 128GB of DDR4-2400 for around $150 and a 2S 8-channel Xeon E5 v4 to put it in for around $200. How much is any GPU with 128GB of RAM?

Also I'm finding it interesting that hyper-threading is actually improving inference speeds in this example, since the general consensus in the past was that it degrades performance. Perhaps the high memory bandwidth is a factor here?

It's not surprising SMT usually doesn't help. On my 6-core Ryzen 5 5600G the "prompt eval" time goes up proportionally from -t 6 to -t 4 but the "eval" time stays the same. It takes -t 3 before it even goes up a little. Presumably completely memory bound, what are more threads going to do? But slower CPUs with more memory channels might actually get compute-bound sometimes.

Why do your 2 NUMA and 4 NUMA have similar speeds?
Is the total memory bandwidth similar? Can you provide some information about the memory bandwidth of your machines?

The 2N and 4N systems both have 8 memory channels total. The 4N system is DDR4 instead of DDR3, so it has about 2/3rds more memory bandwidth, but because it's 4N, 3/4ths of random memory accesses will be off the local node instead of 1/2. Node interleave doesn't change that, it just stops all the accesses from being to the same node, which is a larger improvement on the 4N system.

So TR Pro 3995 shows only 1 NUMA nodes, 128 CPU.

Threadripper 3000 series sort of has two memory nodes but is effectively configured for node interleave: https://www.anandtech.com/show/15044/the-amd-ryzen-threadripper-3960x-and-3970x-review-24-and-32-cores-on-7nm/3

Epyc and older Threadripper (and basically anything with multiple sockets) show the NUMA nodes unless the user enables node interleave in the BIOS. Interleave is a bandwidth/latency trade off, but turning it on at the system level prevents applications from doing any further optimization. The performance hierarchy for multi-threaded applications is like this: worst) threads on every node accessing data on a single node; better) node interleave; best) threads on each node accessing data on the local node.

System-level interleave removes the possibility for the worst thing by removing the possiblity for the best thing.

It looks like the biggest thing degrading performance on NUMA here is actually preload. The default memory policy is to prefer to load pages into the node of the thread that first accessed it. Call mmap() with MAP_POPULATE and the system will by default try to put the whole model on the node of the calling thread. Using 'numactl --interleave=all' changes the memory policy so that mmap stripes the allocation across all the nodes. We can get the same result by calling 'set_mempolicy(MPOL_INTERLEAVE, numa_all_nodes_ptr->maskp, numa_all_nodes_ptr->size)' at the beginning of main(), but set_mempolicy() requires -lnuma. (It's annoying that they put a syscall wrapper in a library that isn't installed by default. In theory we could use syscall(2) instead.)

But blind interleave still isn't optimal, it's just better than what's happening now. And we can get a similar result without -lnuma just by not using preload on NUMA systems. Then the mapping gets faulted in during first use by all the different work threads, which spreads it out across different nodes. Better yet, the way matrix multiplication is currently done by ggml is already not far from what we want -- the section of the model assigned to each thread is always the same. What happens if we pin work threads to nodes with pthread_setaffinity_np() so each thread is always on the same node it was when that page of the model was first faulted in? Now ye olde pair of $9 CPUs is here:

llama_print_timings: sample time = 399.99 ms / 512 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 14904.62 ms / 271 tokens ( 55.00 ms per token)
llama_print_timings: eval time = 49917.59 ms / 510 runs ( 97.88 ms per token)

At least for the first batch of 512. After that it slows down some:

llama_print_timings: sample time = 1594.96 ms / 2048 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 99318.38 ms / 1813 tokens ( 54.78 ms per token)
llama_print_timings: eval time = 224513.88 ms / 2040 runs ( 110.06 ms per token)

It could be that at the end of the first batch it generates some new soon-to-be frequently accessed data that all gets stuck on one node again.

Currently the fastest thing to do is run once without preload or interleave so the model gets paged in with some relation to the access pattern, then after the OS has it in the page cache, use interleave so mutable state gets spread across nodes:

llama_print_timings: sample time = 1593.20 ms / 2048 runs ( 0.78 ms per token)
llama_print_timings: prompt eval time = 99734.43 ms / 1813 tokens ( 55.01 ms per token)
llama_print_timings: eval time = 196720.76 ms / 2040 runs ( 96.43 ms per token)

But that's only until we can figure out which data is causing the detriment without interleave. We can preferentially load data into any node we want without third party dependencies by setting thread affinity for the current thread to the cores on that node right before first access to that memory. Or if there is some specific data which inherently has to be accessed repeatedly by threads on multiple nodes, it may be faster to keep a copy of it on each node.

I'm still trying to understand some of this code. Like this:

ggml_lock_lock  (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);

There are no operations while the lock is held?

@ggerganov
Copy link
Owner

ggerganov commented May 17, 2023

I'm still trying to understand some of this code. Like this:

ggml_lock_lock  (&state_shared.spin);
ggml_lock_unlock(&state_shared.spin);

There are no operations while the lock is held?

These lock / unlock are actually noops (the spin variable is an unused int).
I had some early implementations where I was trying to use spin locks to synchronize the threads. Later, I dropped this idea since busy-wait on atomic variable gives better performance. I kept the lock / unlock calls by replacing them with macros that effectively do nothing, so you can simply delete these 2 lines and the code will be equivalent:

llama.cpp/ggml.c

Lines 13683 to 13686 in d627025

#define ggml_lock_init(x) UNUSED(x)
#define ggml_lock_destroy(x) UNUSED(x)
#define ggml_lock_lock(x) UNUSED(x)
#define ggml_lock_unlock(x) UNUSED(x)

I am following your analysis with interest.
It's still a bit over my head, but I'm planning to do some tests of these NUMA settings on a 2-socket machine soon and try to understand the effects of node interleaving better

@zrm
Copy link
Collaborator Author

zrm commented May 22, 2023

echo 1 > /proc/sys/vm/numa_interleave

$ numactl --interleave=all ./main -n 512 -m models/65B/ggml-model-q4_0.bin --ignore-eos -p "Someone told me there is a configuration option called \"/proc/sys/vm/numa_interleave\" -- what" -t 32
...
Someone told me there is a configuration option called "/proc/sys/vm/numa_interleave" -- what does this do?
Some of the kernel's resource management features have to deal with both "NUMA-ness" and "SMP-ness". In many cases, the kernel makes some tradeoffs between memory bandwidth (when using NUMA) versus cache performance (when using SMP). Since the kernel must make these tradeoffs at runtime, we need a way for it to get hints from userland as to how we should proceed.
The numa_interleave sysctl controls one such knob. It can be set to 0-100%, where a higher number implies more interleaving of SMP resources (CPU cores) with NUMA nodes. A setting of zero means "don't do any interleaving."
This option only affects process scheduling, not the placement of pages in memory when they are allocated. There is no current way to get hints from userland about page placement.
The default value of numa_interleave is 67%, which was chosen by trial and error on several NUMA systems (including ours) as giving the best performance for a variety of workloads. The system-wide default can be overridden on a per-process basis using the numa_interleave scheduling class hint, described in Documentation/scheduler/sched-numa.txt.
The kernel's NUMA resource management features are described in Documentation/kernel-api/numa.rst.
The scheduling class hints (including numa_interleave) are described in Documentation/scheduler/sched-bwc.txt.
If you want to tune this value, then please do some performance testing on your systems and share the results with us so that we can improve the default for future kernel releases. Please send email to [email protected] or file a bug at http:https://www.kernel.org/.
The value of numa_interleave was chosen by trial and error, and it's likely that some other value would work better on your systems. Unfortunately, I don't have access to any of the original performance testing data used to choose this number.
As for how you should test performance: a simple benchmark like bonnie++ or sysbench is probably fine. See Documentation/admin-guide/benchmarking.rst for more information about some useful performance

llama_print_timings: load time = 13041.53 ms
llama_print_timings: sample time = 466.49 ms / 512 runs ( 0.91 ms per token)
llama_print_timings: prompt eval time = 140717.11 ms / 284 tokens ( 495.48 ms per token)
llama_print_timings: eval time = 333853.21 ms / 510 runs ( 654.61 ms per token)
llama_print_timings: total time = 475246.94 ms

$ ls /proc/sys/vm/numa_interleave
ls: cannot access '/proc/sys/vm/numa_interleave': No such file or directory

/proc/sys/kernel/numa_balancing is enabled by default. Let's try turning it off:

# echo 0 > /proc/sys/kernel/numa_balancing

Now I'm not seeing any further benefit from numactl --interleave. Once you stop prefetching the model (0d23f8c), de facto turning that off was apparently the source of its advantage.

Later, I dropped this idea since busy-wait on atomic variable gives better performance.

The trouble is you're threading something with a lot of serial dependencies in it, so the amount of work each thread can do before it has to synchronize with the other threads is small and you have to do thread synchronization thousands of times a second.

@rankaiyx
Copy link
Contributor

rankaiyx commented May 22, 2023

I was fooled by chatgpt's fictional method, which told me that "echo 1 > / proc/sys/vm/numa_interleave" was a kernel-level setting instead of "numactl-- interleave=all", which actually didn't seem to exist. When I realized this, I deleted the previous comments.

The method you just pull should be the right solution. I am excited about your improvement. I'm assembling a numa computer with 2 CPU and 8 channels of memory (E5-2698Bv3 x2 / DDR3 1866 32G x8), and I'll test it later.

@ggerganov
Copy link
Owner

2S Epyc 9000 (12xDDR5-4800/S): 922GB/s
2S Xeon Max (HBM): 2TB/s
8S Xeon Scalable v4 (8xDDR5-4800/S): 2.45TB/s

Does anybody know if I can rent one of these in a cloud to test?

@Sciumo
Copy link

Sciumo commented Jun 2, 2023

I couldn't find any with that exact CPU config.

I'm going to try multiple processes with: https://github.com/huggingface/text-generation-inference

@AphidGit
Copy link

AphidGit commented Jun 13, 2023

If there is enough memory, it might be faster to load the model weights into memory on both NUMA nodes using numa_alloc_onnode from libnuma, then bind half the threads to node #1, and the other half to node #2. That way there isn't much crosstalk during inference, but it does consume 2x more memory.

If batch size > num_nodes, then there isn't any need for cross-node at all (each generation can run in parallel).

Edit: In simpler terms: Load the model into memory twice. Once on node A, once on node B. Would that be doable?

@rankaiyx
Copy link
Contributor

2S Epyc 9000 (12xDDR5-4800/S): 922GB/s
2S Xeon Max (HBM): 2TB/s
8S Xeon Scalable v4 (8xDDR5-4800/S): 2.45TB/s

Does anybody know if I can rent one of these in a cloud to test?

There should be a high-performance storage device to save time loading the model.

@shaonianyr
Copy link

So what's new in here? @zrm @ggerganov

@q5sys
Copy link

q5sys commented Aug 27, 2023

I was considering nabbing a 2nd gen EPYC system for running models, but without better NUMA support that's not going to be anywhere near as effective as it could be. I've got a TB of DDR4 RAM that needs to be used in something...

@ggerganov ggerganov mentioned this issue Aug 13, 2023
@ggerganov
Copy link
Owner

NUMA support has been merged for a while: #1556

Add --numa flag to main to enable. See the dscussion in the PR for more analysis

@Ph0rk0z
Copy link

Ph0rk0z commented Sep 10, 2023

Should I also be running with no mmap when using numa?

@bmtwl
Copy link
Contributor

bmtwl commented Mar 5, 2024

@ggerganov
I'm trying to create a branch that improves NUMA performance, but am having trouble adapting the existing CPU memory buffer allocation code due to lack of familiarity with the codebase.
The basic idea is that I want to:

  1. Split the existing giant cpu buffer into one buffer per layer
  2. Assign the memory buffers across the numa nodes round-robin with numa_alloc_onnode()
  3. Mark each buffer with the numa node it was associated with
  4. Assign threads to cpus on the correct numa nodes

That means we'll need to know how many numa nodes exist in llm_load_tensors, and which cpuset to assign in the set_numa_thread_affinity function...I don't know if there's a best way to do this, whether its passing variables through structures/functions, or having some kind of mapping in hparams/g_state maybe?

I was hoping to figure out something on my own as a challenge, but I'm burning a lot of time trying to puzzle my way around. Any hints as to where/how to apply these strategies, or ways that my thinking is fundamentally wrong would be very appreciated.

I think making this approach general, as opposed to another #ifdef custom backend, would be a net positive. Most of the new AMD chips above entry-level are starting to expose CCX as numa nodes for numa aware apps. It is shaping up to be an industry trend, with a lot of chiplet-based chips on the horizon.

Also, if you want to test any massively parallel code branches, I'm happy to run them on my rig and report results.

@ggerganov
Copy link
Owner

  1. Split the existing giant cpu buffer into one buffer per layer

To do that, you can check this example from the whisper.cpp repo:

https://github.com/ggerganov/whisper.cpp/pull/1763/files

Here, we limit each memory buffer to have maximum of 1GB data, but it can be easily modified so that each buffer contains a single tensor.

Hope this is enough to get things started, though I don't have a full vision of how to implement the entire thing

@bmtwl
Copy link
Contributor

bmtwl commented Apr 6, 2024

@slaren Pinging you for help. I've been at this for over two months now, and have failed in my attempts to even force through a dirty mechanism just to prove what the performance benefit of memory locality will be, I've yet to produce a usable branch. The gap between my knowledge of the codebase and the surgery required to pull this off has lead me to failure in dozens of attempts now.
I've had some help from JohannesGaessler, who gave up some of his time to get on a call with me and explain some of the structure and approaches of llama.cpp, but the memory alloc code and cpu scheduling code isn't something he was as familiar with as yourself.
There's a lot of plumbing in llama.cpp to allow multiple backend and layer splitting, but wrapping my head around enough of the llama.cpp/ggml.c/backends code to create something that works is something I'm going to need a hand with. Its frustrating because it feels like it wouldn't take more than a few tweaks in the right structs and code paths to make this work.
Based on non-llama.cpp POC code I've done, I believe there is a significant speedup for chiplet and multisocket machines here, so I don't think it would be a waste of your time. I'm willing to put in the work, but I need some kind of mentoring from someone who has a holistic vision of this project because I'm just drowning here...

@slaren
Copy link
Collaborator

slaren commented Apr 6, 2024

If you have any questions just ask. For llama.cpp and other projects using ggml-backend, the memory of the tensors is allocated in a ggml_backend_buffer, typically in ggml_backend_cpu_buffer_type_alloc_buffer but there are other buffer types.

@bmtwl
Copy link
Contributor

bmtwl commented Apr 6, 2024

If you have any questions just ask. For llama.cpp and other projects using ggml-backend, the memory of the tensors is allocated in a ggml_backend_buffer, typically in ggml_backend_cpu_buffer_type_alloc_buffer but there are other buffer types.

Thanks @slaren.
Thanks to stepping through a debugger for countless hours, I was already able to figure that out mostly, and even to allocate smaller buffers per numa node, but the problems I'm having are how to properly and cleanly :

  • determine which buffers are model inference tensor data vs other metadata
  • be able to keep some metadata with the buffers that indicates which numa node each buffer was allocated on (I kind of have this, but its not working how I expect). I think buffer might not be the right level for this metadata?
  • making sure that all the buffers/tensors that a pthread needs to do work on are all on the same numa node

I'm really struggling to understand the relationship between a model, context (seems to have multiple meanings based on which part of the code?), tensors, cgraphs, plans, layers, bufs/bufts and other structs and variables.

I finally tried to cheese it by straight up creating one model/context object per numa node and attempting to reference the right model's data based on the pthread's CPU affinity, but couldn't reason my way through the different structs and the ways they are transformed as the model/context tuple is passed down through from main.cpp, into llama.cpp and eventually into ggml.c and the scheduled threads. Also, how to keep things consistent when re-integrating the hidden/shared context at the end so results are useful.

Laying it all out like this, it seems like it should be so simple, but I'm unable to crack it, and I feel like I'm still failing to ask the right questions somehow. Does any of this make sense?

@slaren
Copy link
Collaborator

slaren commented Apr 6, 2024

  • determine which buffers are model inference tensor data vs other metadata
  • be able to keep some metadata with the buffers that indicates which numa node each buffer was allocated on (I kind of have this, but its not working how I expect). I think buffer might not be the right level for this metadata?

Buffers that are used for model weights have the flag GGML_BACKEND_BUFFER_USAGE_WEIGHTS in buffer->usage. However that is generally decoupled from the buffer implementation, so it may not be very easy to use this. You can create a new buffer type specifically for this and add any state to it that you want, and handle the allocation, tensor initialization, etc in it (see the code at the bottom). Then you would need to modify llm_load_tensors in llama.cpp to use this buffer type.

  • making sure that all the buffers/tensors that a pthread needs to do work on are all on the same numa node

That's not really feasible because all the threads work on the same tensors, just different slices of it.

llama.cpp/ggml-backend.c

Lines 547 to 669 in 54ea069

GGML_CALL static const char * ggml_backend_cpu_buffer_name(ggml_backend_buffer_t buffer) {
return "CPU";
GGML_UNUSED(buffer);
}
GGML_CALL static void * ggml_backend_cpu_buffer_get_base(ggml_backend_buffer_t buffer) {
uintptr_t data = (uintptr_t)buffer->context;
// align the buffer
if (data % TENSOR_ALIGNMENT != 0) {
data = GGML_PAD(data, TENSOR_ALIGNMENT);
}
return (void *)data;
}
GGML_CALL static void ggml_backend_cpu_buffer_free_buffer(ggml_backend_buffer_t buffer) {
free(buffer->context);
}
GGML_CALL static void ggml_backend_cpu_buffer_set_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
memcpy((char *)tensor->data + offset, data, size);
GGML_UNUSED(buffer);
}
GGML_CALL static void ggml_backend_cpu_buffer_get_tensor(ggml_backend_buffer_t buffer, const struct ggml_tensor * tensor, void * data, size_t offset, size_t size) {
memcpy(data, (const char *)tensor->data + offset, size);
GGML_UNUSED(buffer);
}
GGML_CALL static bool ggml_backend_cpu_buffer_cpy_tensor(ggml_backend_buffer_t buffer, const struct ggml_tensor * src, struct ggml_tensor * dst) {
if (ggml_backend_buffer_is_host(src->buffer)) {
memcpy(dst->data, src->data, ggml_nbytes(src));
return true;
}
return false;
GGML_UNUSED(buffer);
}
GGML_CALL static void ggml_backend_cpu_buffer_clear(ggml_backend_buffer_t buffer, uint8_t value) {
memset(buffer->context, value, buffer->size);
}
static struct ggml_backend_buffer_i cpu_backend_buffer_i = {
/* .get_name = */ ggml_backend_cpu_buffer_name,
/* .free_buffer = */ ggml_backend_cpu_buffer_free_buffer,
/* .get_base = */ ggml_backend_cpu_buffer_get_base,
/* .init_tensor = */ NULL, // no initialization required
/* .set_tensor = */ ggml_backend_cpu_buffer_set_tensor,
/* .get_tensor = */ ggml_backend_cpu_buffer_get_tensor,
/* .cpy_tensor = */ ggml_backend_cpu_buffer_cpy_tensor,
/* .clear = */ ggml_backend_cpu_buffer_clear,
/* .reset = */ NULL,
};
// for buffers from ptr, free is not called
static struct ggml_backend_buffer_i cpu_backend_buffer_i_from_ptr = {
/* .get_name = */ ggml_backend_cpu_buffer_name,
/* .free_buffer = */ NULL, // ptr is not owned by the buffer, so it does not need to be freed
/* .get_base = */ ggml_backend_cpu_buffer_get_base,
/* .init_tensor = */ NULL, // no initialization required
/* .set_tensor = */ ggml_backend_cpu_buffer_set_tensor,
/* .get_tensor = */ ggml_backend_cpu_buffer_get_tensor,
/* .cpy_tensor = */ ggml_backend_cpu_buffer_cpy_tensor,
/* .clear = */ ggml_backend_cpu_buffer_clear,
/* .reset = */ NULL,
};
GGML_CALL static const char * ggml_backend_cpu_buffer_type_get_name(ggml_backend_buffer_type_t buft) {
return "CPU";
GGML_UNUSED(buft);
}
GGML_CALL static ggml_backend_buffer_t ggml_backend_cpu_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
size += TENSOR_ALIGNMENT; // malloc may return an address that is not aligned
void * data = malloc(size); // TODO: use GGML_ALIGNED_MALLOC (move to ggml-impl.h)
if (data == NULL) {
fprintf(stderr, "%s: failed to allocate buffer of size %zu\n", __func__, size);
return NULL;
}
return ggml_backend_buffer_init(buft, cpu_backend_buffer_i, data, size);
}
GGML_CALL static size_t ggml_backend_cpu_buffer_type_get_alignment(ggml_backend_buffer_type_t buft) {
return TENSOR_ALIGNMENT;
GGML_UNUSED(buft);
}
GGML_CALL static bool ggml_backend_cpu_buffer_type_supports_backend(ggml_backend_buffer_type_t buft, ggml_backend_t backend) {
return ggml_backend_is_cpu(backend);
GGML_UNUSED(buft);
}
GGML_CALL static bool ggml_backend_cpu_buffer_type_is_host(ggml_backend_buffer_type_t buft) {
return true;
GGML_UNUSED(buft);
}
GGML_CALL ggml_backend_buffer_type_t ggml_backend_cpu_buffer_type(void) {
static struct ggml_backend_buffer_type ggml_backend_cpu_buffer_type = {
/* .iface = */ {
/* .get_name = */ ggml_backend_cpu_buffer_type_get_name,
/* .alloc_buffer = */ ggml_backend_cpu_buffer_type_alloc_buffer,
/* .get_alignment = */ ggml_backend_cpu_buffer_type_get_alignment,
/* .get_max_size = */ NULL, // defaults to SIZE_MAX
/* .get_alloc_size = */ NULL, // defaults to ggml_nbytes
/* .supports_backend = */ ggml_backend_cpu_buffer_type_supports_backend,
/* .is_host = */ ggml_backend_cpu_buffer_type_is_host,
},
/* .context = */ NULL,
};
return &ggml_backend_cpu_buffer_type;
}

@slaren
Copy link
Collaborator

slaren commented Apr 6, 2024

I'm really struggling to understand the relationship between a model, context (seems to have multiple meanings based on which part of the code?), tensors, cgraphs, plans, layers, bufs/bufts and other structs and variables.

When using ggml-backend, as is the case in llama.cpp, ggml_context only hold the tensor "metadata" (ie. the contents of the struct ggml_tensor), but not the tensor data. The tensor data is allocated from a ggml_backend_buffer. Buffer types (ggml_backend_buffer_type) are mainly used as a way to allocate different types of buffers in a generic way. So you can use the same interface to allocate a buffer for the CPU, or CUDA, or Vulkan, etc backends.

ggml_cplan hold some information about the resources necessary to run a graph on the CPU backend, like a work buffer. ggml_cgraph are mostly just a list of operations to run to perform some evaluation. Each operation is a ggml_tensor with an operation set in tensor->op. The layers in llama.cpp are each of the transformer blocks that compose a model. In some places the operations that are run before and after the transformer blocks are also called the input and output layers.

@bmtwl
Copy link
Contributor

bmtwl commented Apr 6, 2024

That's not really feasible because all the threads work on the same tensors, just different slices of it.

Does that mean that my current approach of attempting to ensure memory locality at the buffer/tensor level is in vain? Or that I need to figure out a way to pre-determine which slices of the tensors will be worked on per thread? Is that related to layer or row splitting in any way?
Or should I create a another backend type and try to split across numa nodes ala gpu splitting?
Perhaps there's a reasonable way to make my other attempted approach work, mirroring the entire model x times and feeding each thread its own locally allocated model?

@slaren
Copy link
Collaborator

slaren commented Apr 6, 2024

Is it possible to allocate a contiguous amount of memory and assign different slices of it to different NUMA nodes?

@bmtwl
Copy link
Contributor

bmtwl commented Apr 6, 2024

Is it possible to allocate a contiguous amount of memory and assign different slices of it to different NUMA nodes?

As far as I am aware, no. Allocating on different nodes will result in non-contiguous memory addresses

@bmtwl
Copy link
Contributor

bmtwl commented Apr 6, 2024

Buffer types (ggml_backend_buffer_type) are mainly used as a way to allocate different types of buffers in a generic way. So you can use the same interface to allocate a buffer for the CPU, or CUDA, or Vulkan, etc backends.

This was actually the mechanism I tried to make use on my very first attempt (ggml_backend_numa_buffer_type). I found it difficult to get over the initial learning curve of putting together new versions of all the nested objects, structs and function calls needed to even get a toy version of a new backend to compile. Maybe I should attempt this again if its the most likely to succeed?

@slaren
Copy link
Collaborator

slaren commented Apr 6, 2024

I think you can do it with move_pages. You can try to implement a buffer type for the weights as discussed earlier, and use move_pages to move the rows to different nodes in init_tensor. It is not going to be perfect, because the rows are not always going to start on page boundaries, but it could be a start.

@bmtwl
Copy link
Contributor

bmtwl commented Apr 6, 2024

If models are divisible by rows then I could potentially use numa_alloc_onnode (which forces page alignment) to place them on the correct node at time of buffer allocation? Or do you think using move_pages after the fact might be easier due to the single large buffer that is allocated before loading tensors?
In any case, I'll look into init_tensor and see what can be achieved with minimal impact to code complexity.
Thanks for all the help so far!

@slaren
Copy link
Collaborator

slaren commented Apr 6, 2024

The CPU backend expects the tensors to be in a contiguous buffer. The only alternative is to write a new backend and implement a new matrix multiplication algorithm that can take the data in a different way. move_pages should allow you to assign different parts of a contiguous buffer to different NUMA nodes. Frankly I would be amazed if you figure an easier way to do it than that.

You can simply add a init_tensor function to the CPU buffer in cpu_backend_buffer_i for testing, currently it is set to NULL since the CPU backend doesn't require any initialization.

@bmtwl
Copy link
Contributor

bmtwl commented Apr 7, 2024

That's definitely one side of the problem taken care of...the other side is how to marry up the pages from each numa node to threads scheduled on the same node. I didn't see any obvious way to communicate this information in the existing code paths. Would a tensor-to-page map be workable?

@slaren
Copy link
Collaborator

slaren commented Apr 7, 2024

Take a look at the implementation of ggml_compute_forward_mul_mat to figure what parts of the tensor is used by each thread. You shouldn't worry about other operations for now, this is where most of the time is spent.

@penghongbo
Copy link
Contributor

@bmtwl 2 more questions related to the numa support. Can we add a fine grain binding in set_numa_thread_affinity & clear_numa_thread_affinity so that each thread binds to one logical CPU only? Or we can parse an argument to explicitly set binding for each of the thread?

Current binding binds the threads to nodes (DISTRIBUTE) or current node (ISOLATE) or the cpuset numactl gives to llama.cpp (NUAMCTL). I found this sometimes cause high cpu usage in ggml_graph_compute_thread. But if I use a Fine Grain binding, it helps to reduce time in ggml_graph_compute_thread.

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 12, 2024

@bmtwl 2 more questions related to the numa support. Can we add a fine grain binding in set_numa_thread_affinity & clear_numa_thread_affinity so that each thread binds to one logical CPU only? Or we can parse an argument to explicitly set binding for each of the thread?

Current binding binds the threads to nodes (DISTRIBUTE) or current node (ISOLATE) or the cpuset numactl gives to llama.cpp (NUAMCTL). I found this sometimes cause high cpu usage in ggml_graph_compute_thread. But if I use a Fine Grain binding, it helps to reduce time in ggml_graph_compute_thread.

Yes, I think this is a very good idea as memory locality from the caches all the way up are the key to high performance. I've had a similar scheme in mind for a while now, but I don't see a simple way to add that without other changes farther up the stack as thread scheduling appears to be ad-hoc, assigning work to threads as work appears for threads to do. Maybe @slaren has ideas for a low-impact way to do this in the codebase as it exists today?

I think that there will be an evolution towards moving the higher-level NUMA allocation and scheduling decisions out of ggml.c and into llama.cpp. That context feels like the best place to globally ensure optimal data and execution locality, but I'm still trying to acquire "the knowledge" of this project to the point where I can really understand where and how the changes should be introduced to both perform well and fit into the coding philosophy to be maintainable/extensible long term.

At a high level I think this means that a scheduling pipeline needs to be written take all these factors into account, really before any inference starts, but that seems like an ambitious goal. I'm going to keep focused on my buffer-to-thread memory locality goal for now.

@Sciumo
Copy link

Sciumo commented Apr 12, 2024

llm.c made me curious about OMP https://github.com/karpathy/llm.c/blob/6396e393e319f899bb61ba53f8b70c22cf3b038b/Makefile#L17

openmp appears to only maybe? be enabled for Darwin and? Raspberry Pi?
#2164

I see no reference to ggml_mpi_init anywhere.

struct ggml_mpi_context * ggml_mpi_init(void);

Thread affinity in Slurm looks promising, but we're already stumbling on IPC, which is the issue at hand I guess.
https://hpc-tutorials.llnl.gov/openmp/ProcessThreadAffinity.pdf

@cpumaxx
Copy link
Contributor

cpumaxx commented Apr 12, 2024

MPI

I think MPI has been broken for a long time. I got it working off an old branch as one of my first experiments way back when, but the performance wasn't good enough to pursue further.
I'd be interested in combining it with the improved NUMA affinity controls, but there isn't a way to run it on master as far as I know.

@Sciumo
Copy link

Sciumo commented Apr 12, 2024

yeah. I was really referencing OMP.
doubt @karpathy easily makes mistakes.
omp appears dead simple

https://github.com/karpathy/llm.c/blob/6396e393e319f899bb61ba53f8b70c22cf3b038b/Makefile#L34-L35

  • it does use a few OpenMP pragmas because this is a large speedup at very low cost

#pragma omp parallel for collapse(2)

easy to diagnose
https://www.openmp.org/spec-html/5.0/openmpse61.html

I'll try that with llm.c

@bmtwl
Copy link
Contributor

bmtwl commented May 6, 2024

Update to this thread (I had posted a few updates in related PR #6915)
@slaren after further testing I found I could get the results I wanted with a shorter load time by using mbind instead of move_pages.

Even without changing any CPU affinity code there is a huge shift in performance characteristics. I get a lot better efficiency with a low number of threads (40% speedup over master at 16 threads), probably because I'm better able to saturate the interconnects evenly. Ramping up the number of threads doesn't improve performance to match master with mmap, though.
If I'm now able to modify the code to schedule each thread to be local to the part of the tensor its going to work on, I think I'll start to see some real performance improvements from this branch.
I'm going to dig into debugging this over the next few days, but any hints you could give would be appreciated.
Specifically: what is the relationship between thread "ith" (or anything else available locally in ggml_graph_compute_thread) and the part of the tensor buffer that is worked on?

My branch with the in-progress changes are at https://github.com/bmtwl/llama.cpp/tree/numamovepages, but it is very messy and specific to my setup for now

@slaren
Copy link
Collaborator

slaren commented May 6, 2024

Specifically: what is the relationship between thread "ith" (or anything else available locally in ggml_graph_compute_thread) and the part of the tensor buffer that is worked on?

After #6915 is merged, there won't be a fixed relationship, it will be dynamic. You will need to write a different scheduling mechanism.

@kunnis
Copy link
Contributor

kunnis commented May 6, 2024

ith = 0 is the main thread, it does most of the work for everything... except matrix multiplies. Most of the code is single threaded because it checks if ith != 0, and returns. But very little time is spent in those operations.

During matrix multiplies, ith lets you figure out which section of the data the thread is supposed to work on. Before 6915, there's a formula that maps each ith to a specific section of the output matrix, it iterates through that output, and writes the data. After 6915, the threads basically check out an index of work to do. It'll need to use ith to figure out which numa node it needs to do work from. Current_chunk (the shared counter for the work queue) could be made into an array the size of the max number of numa nodes. Each numa node could have it's own queue, based on whatever is data is local to it.

@github-actions github-actions bot added the stale label Jun 6, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics stale
Projects
None yet
Development

No branches or pull requests