-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bge-reranker is extreamly slow #24
Comments
This is strange. You can check your CPU utilization, and try |
I wrote a simple test to reproduce the issue: #25 PS. I accidentally created a PR which is cancelled. Please just ignore the related notifications. |
I tested with |
I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735. Let me assume that the model file is saved on SSD. The result (>6 sec) looks impossible. |
Could you share the SHA256 of the checkpoint you are using? So that I can check if my conversion is valid. The bin file is converted using your |
You can find some quantized models (BGE-Reranker included) here: https://modelscope.cn/models/judd2024/chatllm_quantized_models/files I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher. |
Well, I downloaded models you mentioned and re-run the tests. With Q4 variants of BGE models, it indeed has latency around 2s. I was testing with Q8 models. https://github.com/RobinQu/chatllm.cpp/blob/perf/test.cpp
BTW, it seems threads more than 12 won't help too much in terms of latency. I would try with a multi-instance setup for higher thourughput in production. Any other advice? |
On a 96C machine, data shows that throughput is saturated at just 48 threads, so RAM throughput is the bottleneck now. A simple calculation: assuming model file size is 2GB, 400 tokens per second requires RAM throughput > 800GB/s. |
Oh, such calculation applies to token generation, but not batch prompt evaluation. |
EPYC 9004 series claim to have 460GB/s bandwidth for single socket configuration. But the benchmarks show that inference won't benfit too much from threads of more than 48, or multi-instances. So I think you are right about the bottleneck on RAM throughput. Maybe optimizaiton like flashattention should be considered. But I am not sure if it perfroms well on CPU. |
With 686 tokens, a single run would take more than 6 secodns on a 96C machine.
Here is the profiling data for compute graph.
bge-reranker-dump.txt
Any advice for better performance?
The text was updated successfully, but these errors were encountered: