Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable NUMA feature for llama_cpp_python #4040

Merged
merged 9 commits into from
Sep 27, 2023

Conversation

StoyanStAtanasov
Copy link
Contributor

@StoyanStAtanasov StoyanStAtanasov commented Sep 23, 2023

Checklist:

This is a POC implementation of this feature. Hopefully it is correct, otherwise it is here to get things started. Also I'm not sure if the feature works because I don't know what to expect if it works.

Trying to fix #3444

@StoyanStAtanasov StoyanStAtanasov changed the title Enable NUMA feature fro llama_cpp_python Enable NUMA feature for llama_cpp_python Sep 23, 2023
@oobabooga
Copy link
Owner

What is NUMA and what does it do?

@StoyanStAtanasov
Copy link
Contributor Author

@oobabooga NUMA (Non uniform memory access) should speedup inference on systems with more CPUs or more complex CPUs where the memory speed is not the same depending on which core is used. As the memory speed is very important for LLMs, I hope this can bring up 10-20% speedup on servers.

Here is the PR implementing NUMA on llama.cpp ggerganov/llama.cpp#1556

Here is extract of the help of llama.cpp

NUMA support

  • --numa: Attempt optimizations that help on some systems with non-uniform memory access. This currently consists of pinning an equal proportion of the threads to the cores on each NUMA node, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop_caches' as root.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 23, 2023

For me, in simplest terms, it meant that all of the model didn't end up loaded onto one CPU's ram.

@oobabooga
Copy link
Owner

Could you add a checkbox to the UI for this parameter? Just look for all occurences of "mlock" under modules/*.py and add similar entries for numa.

@StoyanStAtanasov
Copy link
Contributor Author

StoyanStAtanasov commented Sep 24, 2023 via email

using mlock as an example
@StoyanStAtanasov
Copy link
Contributor Author

StoyanStAtanasov commented Sep 24, 2023

@oobabooga Added the UI elements following the mlock as an example. Have not tested it, as I have always ran the .zip packages on windows and linux. I have yet to try to run it from source. I guess I better check out the Docker way.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 24, 2023

I'm not sure why you add it as a second init. I think the llama.cpp python package can init itself, numa is just another param since the update. It will whine at you that you didn't disable balancing like: echo 0 > /proc/sys/kernel/numa_balancing and you will know it's working.

According to the commit, only low level API must enable it like you did: abetlen/llama-cpp-python@f4090a0

I had did it like this: Ph0rk0z@bab1491

@StoyanStAtanasov
Copy link
Contributor Author

@Ph0rk0z Hey, I don't know what I am doing :) (I'm a typescript dev)

I thought it had to be set in the initializer. I guess I misunderstood the parameter description:
numa: Enable NUMA support. (NOTE: The initial value of this parameter is used for the remainder of the program as this value is set in llama_backend_init)

So what do we do? Did you manage to test it? Was all OK? If so, then maybe it is better to merge your code 👍Or should I change my code?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Sep 24, 2023

I tested mine a while ago but I thought nobody would want numa or GPU selection here. It works fine as another loading parameter. if you re-initialize the backend I dunno what happens, maybe it works maybe it does something weird.

Most llama loader code is using the high level python API which already does the init you called inside of llama.py

 if not Llama.__backend_initialized:
            if self.verbose:
                llama_cpp.llama_backend_init(numa)
            else:
                with suppress_stdout_stderr():
                    llama_cpp.llama_backend_init(numa)
            Llama.__backend_initialized = True

this numa param is same as n_batch or anything else we already set here when loading. when you run it without numa balancing disabled it will pop out a warning message so you know it's enabled.

@StoyanStAtanasov
Copy link
Contributor Author

I have tested it on windows, the UI works, hopefully NUMA functionality as well.

@oobabooga oobabooga merged commit 7e6ff8d into oobabooga:main Sep 27, 2023
@oobabooga
Copy link
Owner

Thanks for the PR @stoianchoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add numa support when using llama.cpp
4 participants