Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance with Sycl Backend #5480

Closed
chsasank opened this issue Feb 13, 2024 · 4 comments
Closed

Low performance with Sycl Backend #5480

chsasank opened this issue Feb 13, 2024 · 4 comments

Comments

@chsasank
Copy link

I am working on ollama/ollama#2458 and did some benchmarks to test the performance. I compiled with commit id 3bdc4cd0. Build segfaults with master as in #5469

I used mistral 7b int4 for M2 Air, Intel 12400 and Arc 770 16GB. I used llama-bench and mistral 7b model from here to find tok/s for prompt and text generation tok/s. My llama-bench command is

./build/bin/llama-bench -m models/mistral-7b-v0.1.Q4_0.gguf -p 128,256,512 -n 128,256,512

On M2 Air

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 128 144.47 ± 0.22
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 256 142.95 ± 1.17
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 pp 512 141.36 ± 0.67
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 tg 128 20.06 ± 0.66
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 tg 256 20.26 ± 0.17
llama 7B Q4_0 3.83 GiB 7.24 B Metal 99 tg 512 13.96 ± 1.62

On Intel 12400 (compiled with sycl but made num-gpu-layers (ngl) = 0)

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 pp 128 18.60 ± 3.07
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 pp 256 20.82 ± 0.14
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 pp 512 22.48 ± 0.16
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 tg 128 10.78 ± 0.02
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 tg 256 10.76 ± 0.02
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 0 tg 512 10.69 ± 0.01

On Arc 770

model size params backend ngl test t/s
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 128 407.14 ± 58.05
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 256 583.57 ± 78.24
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 pp 512 757.99 ± 1.48
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 128 24.74 ± 0.27
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 256 24.65 ± 0.20
llama 7B Q4_0 3.83 GiB 7.24 B SYCL 99 tg 512 21.46 ± 2.39

Good news is prompt processing time is somewhat high. Bade news is text generation on Arc GPUs is very low.

This is much slower than what I expected because Arc 770 is significantly faster than both M2 and 12400. You can see the benchmarks of FLOPs and BW here: https://github.com/chsasank/device-benchmarks

@NeoZhangJianyu
Copy link
Collaborator

@chsasank
Thank your feedback!
Currently, SYCL backend is developed for function issues.
We try best to avoid reducing the performance during development.
But performance optimization is not begun.

We encourage all developers to engage this activity.

Thank you!

@github-actions github-actions bot added the stale label Apr 11, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@jwhitehorn
Copy link

@chsasank Thank your feedback! Currently, SYCL backend is developed for function issues. We try best to avoid reducing the performance during development. But performance optimization is not begun.

We encourage all developers to engage this activity.

Thank you!

@NeoZhangJianyu , what is the best way to get involved in this effort?

From the performance testing I've done the A770 is running about 2 to 3 times slower than my M1 Max. If there's anything I could directly to do help improve the SYCL backend, I'd love to contribute. I just don't know where to begin.

@NeoZhangJianyu
Copy link
Collaborator

@jwhitehorn
It's great to know you are interesting in SYCL backend.
M1 is a SOC, but Arc770 is a GPU.
They are not same type device. So it's unfair to compare the performance directly.

  1. Current SYCL backend is using SYCL code to run on EU (or Vector Engine-VXE). In Arc770, there is (Matrix Engine -XMX) not be used by SYCL backend.
  2. esimd is low level develop technology to release Intel GPU performance.

Both powerful technology are not used in SYCL backend by now.
So, I think there is huge potential to improve SYCL backend in Intel GPU.

Now, I'm still working on the function and bug fix, instead of performance.
Because I think most users are run on llama.cpp with single request. That means it service for one client in same time.
due to the human being reading text has a speed limitation, too quick response (like <20ms/token) won't bring more value to single users.
I think the performance of Arc770 are good enough to single user in fact.
That's why I don't put the performance in high priority list.
Of course, I would do some performance tuning step by step in the future.

If you want do something for performance, you could profile the bottleneck of SYCL backend by Intel Vtune (included in Intel oneAPI base toolkit). Then optimize the most hot function firstly.
For LLM, the bottleneck are in both compute and IO bandwidth.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants