-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable NUMA feature for llama_cpp_python #4040
Conversation
What is NUMA and what does it do? |
@oobabooga NUMA (Non uniform memory access) should speedup inference on systems with more CPUs or more complex CPUs where the memory speed is not the same depending on which core is used. As the memory speed is very important for LLMs, I hope this can bring up 10-20% speedup on servers. Here is the PR implementing NUMA on llama.cpp ggerganov/llama.cpp#1556 Here is extract of the help of llama.cpp NUMA support
|
For me, in simplest terms, it meant that all of the model didn't end up loaded onto one CPU's ram. |
Could you add a checkbox to the UI for this parameter? Just look for all occurences of "mlock" under modules/*.py and add similar entries for numa. |
[like] Stoyan Atanasov reacted to your message:
…________________________________
From: oobabooga ***@***.***>
Sent: Sunday, September 24, 2023 3:59:39 AM
To: oobabooga/text-generation-webui ***@***.***>
Cc: Stoyan Atanasov ***@***.***>; Author ***@***.***>
Subject: Re: [oobabooga/text-generation-webui] Enable NUMA feature for llama_cpp_python (PR #4040)
Could you add a checkbox to the UI for this parameter? Just look for all occurences of "mlock" under modules/*.py and add similar entries for numa.
—
Reply to this email directly, view it on GitHub<#4040 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALX5D7TWPWPI6YWZQY2I473X36V2XANCNFSM6AAAAAA5D2LHRY>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
using mlock as an example
@oobabooga Added the UI elements following the mlock as an example. Have not tested it, as I have always ran the .zip packages on windows and linux. I have yet to try to run it from source. I guess I better check out the Docker way. |
I'm not sure why you add it as a second init. I think the llama.cpp python package can init itself, numa is just another param since the update. It will whine at you that you didn't disable balancing like: According to the commit, only low level API must enable it like you did: abetlen/llama-cpp-python@f4090a0 I had did it like this: Ph0rk0z@bab1491 |
@Ph0rk0z Hey, I don't know what I am doing :) (I'm a typescript dev) I thought it had to be set in the initializer. I guess I misunderstood the parameter description: So what do we do? Did you manage to test it? Was all OK? If so, then maybe it is better to merge your code 👍Or should I change my code? |
I tested mine a while ago but I thought nobody would want numa or GPU selection here. It works fine as another loading parameter. if you re-initialize the backend I dunno what happens, maybe it works maybe it does something weird. Most llama loader code is using the high level python API which already does the init you called inside of llama.py
this numa param is same as n_batch or anything else we already set here when loading. when you run it without numa balancing disabled it will pop out a warning message so you know it's enabled. |
I have tested it on windows, the UI works, hopefully NUMA functionality as well. |
Thanks for the PR @stoianchoo |
Checklist:
This is a POC implementation of this feature. Hopefully it is correct, otherwise it is here to get things started. Also I'm not sure if the feature works because I don't know what to expect if it works.
Trying to fix #3444