-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU inferencing a lot slower than llama.cpp #10
Comments
This project is based on Maybe you are using GPU in |
Hi Foldl: I'm using AMD EPYC 9654 with 96 cores, I don't have a GPU in the system. My Wechat ID is: 719784 Warm Regards |
Hi Foldl:
I found this project running Yi-34b-chat Q4 a lot slower than the latest llama.cpp, is it because it is not optimized for CPUs?
Such as not supporting AVX, AVX2 and AVX512 support for x86 architectures or
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use?
Thanks
Yuming
The text was updated successfully, but these errors were encountered: