CPU inferencing a lot slower than llama.cpp #10

netspym · 2024-04-02T06:32:52Z

Hi Foldl:

I found this project running Yi-34b-chat Q4 a lot slower than the latest llama.cpp, is it because it is not optimized for CPUs?

Such as not supporting AVX, AVX2 and AVX512 support for x86 architectures or
1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use?

Thanks
Yuming

foldl · 2024-04-02T08:08:05Z

This project is based on ggml, just the same as llama.cpp. Performance of both implementations should be similar when using CPU acceleration.

Maybe you are using GPU in llama.cpp?

netspym · 2024-04-04T04:17:40Z

Hi Foldl:

I'm using AMD EPYC 9654 with 96 cores, I don't have a GPU in the system.
The infer speed of this project is only 20-30% of the llama.cpp or ollama.
I guess llama.cpp project has more optimization in the past 6 months?

My Wechat ID is: 719784

Warm Regards
Yuming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU inferencing a lot slower than llama.cpp #10

CPU inferencing a lot slower than llama.cpp #10

netspym commented Apr 2, 2024

foldl commented Apr 2, 2024

netspym commented Apr 4, 2024

CPU inferencing a lot slower than llama.cpp #10

CPU inferencing a lot slower than llama.cpp #10

Comments

netspym commented Apr 2, 2024

foldl commented Apr 2, 2024

netspym commented Apr 4, 2024