bge-reranker is extreamly slow #24

RobinQu · 2024-06-20T08:14:35Z

With 686 tokens, a single run would take more than 6 secodns on a 96C machine.

Here is the profiling data for compute graph.
bge-reranker-dump.txt

Any advice for better performance?

foldl · 2024-06-20T13:22:37Z

This is strange. You can check your CPU utilization, and try -n 96.

RobinQu · 2024-06-20T13:31:53Z

This is strange. You can check your CPU utilization, and try -n 96.

I wrote a simple test to reproduce the issue: #25

PS. I accidentally created a PR which is cancelled. Please just ignore the related notifications.

RobinQu · 2024-06-20T13:34:59Z

I tested with num_thread=1 and num_thread=96. Single thread setup is slower than the 96-thread setup. Within a loop of 100 iterations, all cores are fully burned, so I believe the it's correctly scheduled.

foldl · 2024-06-21T04:23:52Z

I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735.

https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx

Let me assume that the model file is saved on SSD. The result (>6 sec) looks impossible.

RobinQu · 2024-06-21T08:17:53Z

I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735.

https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx

Let me assume that the model file is saved on SSD. The result (>6 sec) looks impossible.

Could you share the SHA256 of the checkpoint you are using? So that I can check if my conversion is valid.

The bin file is converted using your convert.py script, and its SHA256 digest is b3e05dbe06c0aa52fd974d9c9dedbc51292b81f2f285d56113c060a0931a7f0f.

foldl · 2024-06-21T11:10:32Z

You can find some quantized models (BGE-Reranker included) here:

https://modelscope.cn/models/judd2024/chatllm_quantized_models/files

I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher.

RobinQu · 2024-06-23T05:20:28Z

You can find some quantized models (BGE-Reranker included) here:

https://modelscope.cn/models/judd2024/chatllm_quantized_models/files

I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher.

Well, I downloaded models you mentioned and re-run the tests.

With Q4 variants of BGE models, it indeed has latency around 2s. I was testing with Q8 models.

https://github.com/RobinQu/chatllm.cpp/blob/perf/test.cpp

qa_rank: num_threads=192, elapsed=1785
qa_rank: num_threads=96, elapsed=758
qa_rank: num_threads=48, elapsed=762
qa_rank: num_threads=24, elapsed=1013
qa_rank: num_threads=12, elapsed=1545
qa_rank: num_threads=6, elapsed=2439
qa_rank: num_threads=3, elapsed=4390
qa_rank: num_threads=1, elapsed=11932
text_embedding: num_threads=192, elapsed=3444
text_embedding: num_threads=96, elapsed=745
text_embedding: num_threads=48, elapsed=754
text_embedding: num_threads=24, elapsed=1017
text_embedding: num_threads=12, elapsed=1546
text_embedding: num_threads=6, elapsed=2432
text_embedding: num_threads=3, elapsed=4374
text_embedding: num_threads=1, elapsed=11878

BTW, it seems threads more than 12 won't help too much in terms of latency. I would try with a multi-instance setup for higher thourughput in production.

Any other advice?

foldl · 2024-06-23T06:09:26Z

On a 96C machine, data shows that throughput is saturated at just 48 threads, so RAM throughput is the bottleneck now. A simple calculation: assuming model file size is 2GB, 400 tokens per second requires RAM throughput > 800GB/s.

foldl · 2024-06-24T01:42:47Z

Oh, such calculation applies to token generation, but not batch prompt evaluation.

RobinQu · 2024-06-24T03:35:18Z

EPYC 9004 series claim to have 460GB/s bandwidth for single socket configuration. But the benchmarks show that inference won't benfit too much from threads of more than 48, or multi-instances. So I think you are right about the bottleneck on RAM throughput.

Maybe optimizaiton like flashattention should be considered. But I am not sure if it perfroms well on CPU.

RobinQu changed the title ~~bge-reranker is conderably slow~~ bge-reranker is extreamly slow Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bge-reranker is extreamly slow #24

bge-reranker is extreamly slow #24

RobinQu commented Jun 20, 2024

foldl commented Jun 20, 2024

RobinQu commented Jun 20, 2024

RobinQu commented Jun 20, 2024

foldl commented Jun 21, 2024

RobinQu commented Jun 21, 2024 •

edited

Loading

foldl commented Jun 21, 2024

RobinQu commented Jun 23, 2024 •

edited

Loading

foldl commented Jun 23, 2024

foldl commented Jun 24, 2024

RobinQu commented Jun 24, 2024 •

edited

Loading

bge-reranker is extreamly slow #24

bge-reranker is extreamly slow #24

Comments

RobinQu commented Jun 20, 2024

foldl commented Jun 20, 2024

RobinQu commented Jun 20, 2024

RobinQu commented Jun 20, 2024

foldl commented Jun 21, 2024

RobinQu commented Jun 21, 2024 • edited Loading

foldl commented Jun 21, 2024

RobinQu commented Jun 23, 2024 • edited Loading

foldl commented Jun 23, 2024

foldl commented Jun 24, 2024

RobinQu commented Jun 24, 2024 • edited Loading

RobinQu commented Jun 21, 2024 •

edited

Loading

RobinQu commented Jun 23, 2024 •

edited

Loading

RobinQu commented Jun 24, 2024 •

edited

Loading