Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU inferencing a lot slower than llama.cpp #10

Open
netspym opened this issue Apr 2, 2024 · 2 comments
Open

CPU inferencing a lot slower than llama.cpp #10

netspym opened this issue Apr 2, 2024 · 2 comments

Comments

@netspym
Copy link

netspym commented Apr 2, 2024

Hi Foldl:

I found this project running Yi-34b-chat Q4 a lot slower than the latest llama.cpp, is it because it is not optimized for CPUs?

Such as not supporting AVX, AVX2 and AVX512 support for x86 architectures or
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use?

Thanks
Yuming

@foldl
Copy link
Owner

foldl commented Apr 2, 2024

This project is based on ggml, just the same as llama.cpp. Performance of both implementations should be similar when using CPU acceleration.

Maybe you are using GPU in llama.cpp?

@netspym
Copy link
Author

netspym commented Apr 4, 2024

Hi Foldl:

I'm using AMD EPYC 9654 with 96 cores, I don't have a GPU in the system.
The infer speed of this project is only 20-30% of the llama.cpp or ollama.
I guess llama.cpp project has more optimization in the past 6 months?

My Wechat ID is: 719784

Warm Regards
Yuming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants