Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bge-reranker is extreamly slow #24

Open
RobinQu opened this issue Jun 20, 2024 · 10 comments
Open

bge-reranker is extreamly slow #24

RobinQu opened this issue Jun 20, 2024 · 10 comments

Comments

@RobinQu
Copy link

RobinQu commented Jun 20, 2024

With 686 tokens, a single run would take more than 6 secodns on a 96C machine.

Here is the profiling data for compute graph.
bge-reranker-dump.txt

Any advice for better performance?

@RobinQu RobinQu changed the title bge-reranker is conderably slow bge-reranker is extreamly slow Jun 20, 2024
@foldl
Copy link
Owner

foldl commented Jun 20, 2024

This is strange. You can check your CPU utilization, and try -n 96.

@RobinQu
Copy link
Author

RobinQu commented Jun 20, 2024

This is strange. You can check your CPU utilization, and try -n 96.

I wrote a simple test to reproduce the issue: #25

PS. I accidentally created a PR which is cancelled. Please just ignore the related notifications.

@RobinQu
Copy link
Author

RobinQu commented Jun 20, 2024

I tested with num_thread=1 and num_thread=96. Single thread setup is slower than the 96-thread setup. Within a loop of 100 iterations, all cores are fully burned, so I believe the it's correctly scheduled.

@foldl
Copy link
Owner

foldl commented Jun 21, 2024

I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735.

https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx

Let me assume that the model file is saved on SSD. The result (>6 sec) looks impossible.

@RobinQu
Copy link
Author

RobinQu commented Jun 21, 2024

I have test this data, using "hello" as question. With Q8 quantization, it took less than 2sec on a 8-core 7735.

https://raw.githubusercontent.com/huggingface/hf-endpointsdocumentation/main/docs/source/guides/create_endpoint.mdx

Let me assume that the model file is saved on SSD. The result (>6 sec) looks impossible.

Could you share the SHA256 of the checkpoint you are using? So that I can check if my conversion is valid.

The bin file is converted using your convert.py script, and its SHA256 digest is b3e05dbe06c0aa52fd974d9c9dedbc51292b81f2f285d56113c060a0931a7f0f.

@foldl
Copy link
Owner

foldl commented Jun 21, 2024

You can find some quantized models (BGE-Reranker included) here:

https://modelscope.cn/models/judd2024/chatllm_quantized_models/files

I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher.

@RobinQu
Copy link
Author

RobinQu commented Jun 23, 2024

You can find some quantized models (BGE-Reranker included) here:

https://modelscope.cn/models/judd2024/chatllm_quantized_models/files

I have test both Q8 and Q4_1. This model is very small, and throughput should be much higher.

Well, I downloaded models you mentioned and re-run the tests.

With Q4 variants of BGE models, it indeed has latency around 2s. I was testing with Q8 models.

https://github.com/RobinQu/chatllm.cpp/blob/perf/test.cpp

qa_rank: num_threads=192, elapsed=1785
qa_rank: num_threads=96, elapsed=758
qa_rank: num_threads=48, elapsed=762
qa_rank: num_threads=24, elapsed=1013
qa_rank: num_threads=12, elapsed=1545
qa_rank: num_threads=6, elapsed=2439
qa_rank: num_threads=3, elapsed=4390
qa_rank: num_threads=1, elapsed=11932
text_embedding: num_threads=192, elapsed=3444
text_embedding: num_threads=96, elapsed=745
text_embedding: num_threads=48, elapsed=754
text_embedding: num_threads=24, elapsed=1017
text_embedding: num_threads=12, elapsed=1546
text_embedding: num_threads=6, elapsed=2432
text_embedding: num_threads=3, elapsed=4374
text_embedding: num_threads=1, elapsed=11878

BTW, it seems threads more than 12 won't help too much in terms of latency. I would try with a multi-instance setup for higher thourughput in production.

Any other advice?

@foldl
Copy link
Owner

foldl commented Jun 23, 2024

On a 96C machine, data shows that throughput is saturated at just 48 threads, so RAM throughput is the bottleneck now. A simple calculation: assuming model file size is 2GB, 400 tokens per second requires RAM throughput > 800GB/s.

@foldl
Copy link
Owner

foldl commented Jun 24, 2024

Oh, such calculation applies to token generation, but not batch prompt evaluation.

@RobinQu
Copy link
Author

RobinQu commented Jun 24, 2024

EPYC 9004 series claim to have 460GB/s bandwidth for single socket configuration. But the benchmarks show that inference won't benfit too much from threads of more than 48, or multi-instances. So I think you are right about the bottleneck on RAM throughput.

Maybe optimizaiton like flashattention should be considered. But I am not sure if it perfroms well on CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants