[SYCL][Intel GPU] Long Term Features & Issues Tracking #5277

airMeng · 2024-02-02T08:19:09Z

airMeng
Feb 2, 2024
Collaborator

Feel free to drop a note, let's know if you have any feature request or bugs (even unconfirmed)

Multi-card Support. Issue #5282, PR #5806
Multi-batch Support #5272 Low performance with Sycl Backend #5480
CI test error for more than one GPU is detected and used.
Current code returns all SYCL devices, including CPU, GPU (level-zero, opencl), FPGA. SYCL only support GPU. So when CI test on other devices, it will be fault.
Support no-mmap parameter in other application.
There is known issue of SYCL: memcpy() from host (mmap) to device will hang in same cases. It's not resolved now. A work around solution is no use mmap. I have handled it in llama-bench (add --mmap parameter). We need add to more applications in examples.
Clean code for warning and unused macro and variable.
Suggest to handle it after multiple-card is finished. Lots of such unused code will be useful for multiple-card feature.
Support SYCL build for Nvidia and AMD targets #5357
Improve first token performance.

Also let's know if you have taken any tasks here.

cc @NeoZhangJianyu @luoyu-intel @abhilash1910

NeoZhangJianyu · 2024-02-02T08:38:14Z

NeoZhangJianyu
Feb 2, 2024
Collaborator

I'd like handle: Multi-card Support, CI test error for more than one GPU is detected and used.

0 replies

abhilash1910 · 2024-02-02T12:21:39Z

abhilash1910
Feb 2, 2024
Collaborator

For code cleaning and sanitization of compiler runtime I will be adding patch post the previous changes
Also sycl for other vendor build targets.

0 replies

DatCaptainHorse · 2024-02-02T19:15:35Z

DatCaptainHorse
Feb 2, 2024

Would like to see support for SOTA 2-bit quantized models (GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ3_XXS).

Been trying to do this myself for past hour or two, using dpct isn't as trouble-free as it makes you think.

0 replies

characharm · 2024-02-02T20:22:04Z

characharm
Feb 2, 2024

before airMeng:sycl_fix_max_alloc_size

I don't know if these numbers can be considered when the model produces useless output.

Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device

model	size	params	backend	ngl	test	t/s
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	GPU BLAS	99	pp 512	430.52 ± 38.83
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	GPU BLAS	99	tg 128	16.14 ± 0.13

model	size	params	backend	ngl	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	GPU BLAS	99	pp 512	734.90 ± 149.16
llama 7B Q6_K	5.53 GiB	7.24 B	GPU BLAS	99	tg 128	22.00 ± 0.17

after

model	size	params	backend	ngl	test	t/s
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	SYCL	99	pp 512	392.43 ± 33.81
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	SYCL	99	tg 128	9.75 ± 0.06

model	size	params	backend	ngl	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	SYCL	99	pp 512	703.17 ± 122.82
llama 7B Q6_K	5.53 GiB	7.24 B	SYCL	99	tg 128	19.07 ± 0.15

vulkan latest

model	size	params	backend	ngl	test	t/s
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	Vulkan	99	pp 512	33.99 ± 0.53
llama 13B Q5_K - Medium	8.60 GiB	13.02 B	Vulkan	99	tg 128	8.04 ± 0.01

model	size	params	backend	ngl	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	99	pp 512	74.76 ± 3.49
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	99	tg 128	21.26 ± 0.10

0 replies

qnixsynapse · 2024-02-10T16:52:07Z

qnixsynapse
Feb 10, 2024

I wonder why build fails with -DLLAMA_SYCL_F16=ON for my Intel Arc 750..

I think this GPU support f16.

But anyways getting 16 tokens/sec for q4_k_m for 7B with all layers on GPU.

With batch bench, mitral -7B Q4_K_M:

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
128	128	1	256	0.497	257.31	8.129	15.75	8.626	29.68
128	128	2	512	0.539	475.29	43.357	5.90	43.896	11.66
128	128	4	1024	0.799	641.06	30.104	17.01	30.903	33.14
128	128	8	2048	1.546	662.36	30.739	33.31	32.285	63.44
128	128	16	4096	3.261	627.93	32.109	63.78	35.371	115.80
128	256	1	384	0.338	378.75	15.331	16.70	15.669	24.51
128	256	2	768	0.452	566.45	86.797	5.90	87.249	8.80
128	256	4	1536	0.759	674.89	60.522	16.92	61.281	25.06
128	256	8	3072	1.548	661.63	61.703	33.19	63.251	48.57

4 replies

NeoZhangJianyu Feb 12, 2024
Collaborator

@akarshanbiswas
It should be fixed by PR: #5411.
The FP16 build is added to CI. There won't be build issue of FP16 in the feature.

By the way, if you have more issues, please create/update in the "issue" field for it, instead of "discussion" field.

qnixsynapse Feb 12, 2024

Oh okay. The title of this discussion has "issues tracking" so I mentioned that issue here. :)

airMeng Feb 13, 2024
Collaborator Author

@akarshanbiswas Could you report the full performance here like the above comments since we don't have A750 on our side, so a performance baseline on A750 would be quite helpful? Thank you for your effort to help improving SYCL backend.

You can find the standard performance measurements here

qnixsynapse Feb 13, 2024

@airMeng Thank you. I was able to test a little and I have updated my comment here.

kunger97 · 2024-03-01T13:17:57Z

kunger97
Mar 1, 2024

I would like to know if there are plans to support quantization types that are not currently supported like iq3, iq4?

0 replies

characharm · 2024-03-05T15:06:47Z

characharm
Mar 5, 2024

With the recent changes, the model nous-hermes-2-34b-2.69 has seen a significant speedup. From the unusable 3-4 tokens per second, it now reaches 7-8

detect 1 SYCL GPUs: [0,] with Max compute units:512

model	size	params	backend	ngl	test	t/s
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	SYCL	99	pp 512	159.76 ± 8.53
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	SYCL	99	tg 128	8.52 ± 0.02

build: 21b0867 (2345)

for comparison

Vulkan0: Intel(R) Arc(TM) A770 Graphics | uma: 0 | fp16: 1 | warp size: 32

model	size	params	backend	ngl	test	t/s
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	Vulkan	99	pp 512	33.27 ± 0.29
llama 30B Q2_K - Medium	10.76 GiB	34.39 B	Vulkan	99	tg 128	5.05 ± 0.05

build: 82cb31e (2348)

I hope other quantization methods will also see an improvement, for now, they more or less perform similarly to Vulkan.

0 replies

qnixsynapse · 2024-03-08T13:06:55Z

qnixsynapse
Mar 8, 2024

I have a question. During prompt processing or generation, the llama.cpp's SYCL backend seems to use only one of the (I am assuming XMX) engines of my GPU. Although they are tagged 'unknown' in intel_gpu_top. Does anyone knows why? And is it possible to parallelize across all of them?

On Linux, I can't even monitor neither the VRAM usage, nor the temps which is surprising because it is a while since this GPU was launched. My all hopes for the new Xe driver.

0 replies

airMeng · 2024-03-10T11:55:56Z

airMeng
Mar 10, 2024
Collaborator Author

I have a question. During prompt processing or generation, the llama.cpp's SYCL backend seems to use only one of the (I am assuming XMX) engines of my GPU. Although they are tagged 'unknown' in intel_gpu_top. Does anyone knows why? And is it possible to parallelize across all of them?

On Linux, I can't even monitor neither the VRAM usage, nor the temps which is surprising because it is a while since this GPU was launched. My all hopes for the new Xe driver.

@akarshanbiswas can you try https://github.com/intel/xpumanager to monitor the usage?

3 replies

qnixsynapse Mar 10, 2024

It is not yet available for Arch Linux. I think I will have to compile it from source. Also, it is for datacenter GPUs. Will it support Intel Arc?

p.s I didn't find any sysfs interface for monitoring temps or vram usage. Will this work with currently stable mainline kernel?

airMeng Mar 11, 2024
Collaborator Author

will it support Intel Arc

yes it is.

I didn't find any sysfs interface for monitoring temps or vram usage

you can get temps and vram usage here https://github.com/intel/xpumanager/blob/fdcb817c0dfadf9423da4dfe6d99fa712f62f5b6/doc/smi_user_guide.md?plain=1#L637

qnixsynapse May 11, 2024

I ended up successfully building the package and it shows this output(with root):

+-----------------------------+--------------------------------------------------------------------+
| Device ID                   | 0                                                                  |
+-----------------------------+--------------------------------------------------------------------+
| GPU Utilization (%)         | N/A                                                                |
| EU Array Active (%)         | N/A                                                                |
| EU Array Stall (%)          | N/A                                                                |
| EU Array Idle (%)           | N/A                                                                |
|                             |                                                                    |
| Compute Engine Util (%)     | N/A                                                                |
| Render Engine Util (%)      | Engine 0: 0                                                        |
| Media Engine Util (%)       | N/A                                                                |
| Decoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Encoder Engine Util (%)     | Engine 0: 0, Engine 1: 0                                           |
| Copy Engine Util (%)        | Engine 0: 0                                                        |
| Media EM Engine Util (%)    | Engine 0: 0, Engine 1: 0                                           |
| 3D Engine Util (%)          | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| Reset                       | N/A                                                                |
| Programming Errors          | N/A                                                                |
| Driver Errors               | N/A                                                                |
| Cache Errors Correctable    | N/A                                                                |
| Cache Errors Uncorrectable  | N/A                                                                |
| Mem Errors Correctable      | N/A                                                                |
| Mem Errors Uncorrectable    | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+
| GPU Power (W)               | 37                                                                 |
| GPU Frequency (MHz)         | 2400                                                               |
| Media Engine Freq (MHz)     | N/A                                                                |
| GPU Core Temperature (C)    | N/A                                                                |
| GPU Memory Temperature (C)  | N/A                                                                |
| GPU Memory Read (kB/s)      | N/A                                                                |
| GPU Memory Write (kB/s)     | N/A                                                                |
| GPU Memory Bandwidth (%)    | N/A                                                                |
| GPU Memory Used (MiB)       | 6804                                                               |
| GPU Memory Util (%)         | 84                                                                 |
| Xe Link Throughput (kB/s)   | N/A                                                                |
+-----------------------------+--------------------------------------------------------------------+

Got the memory usage now but temps are still N/A.

qnixsynapse · 2024-03-21T14:46:36Z

qnixsynapse
Mar 21, 2024

Adding this here. This may add to the list of todos and fixes. Credit goes to Gemini 1.5 Pro 1 million. :)

Device-to-device memory copy across GPUs: The current implementation uses a workaround (copying data to host and then back to the other device) for device-to-device memory copies across different GPUs. (This goes along with multi card support)
[NeoZhangJianyu] Yes, it's known limitation. More efficient implement depends on other library. This function won't impact the performance due to less call times. ✅
Missing support for certain data types: Some data types, like GGML_TYPE_IQ4_NL, GGML_TYPE_IQ2_S, and GGML_TYPE_IQ4_XS, are not supported in the SYCL backend. Issue: [SYCL] Support newer non linear quantization #5674, PR for GGML_TYPE IQ2_S : [SYCL] iq2_s #6052, A fix (?): IQ1_S: attempt to fix SYCL #6014
Solved in support/fix more IQ OPs #6521 ✅
Limited use of SYCL extensions: The code could potentially benefit from utilizing more SYCL extensions for optimization. For example, the ext_oneapi_usm_device_read_only extension could be used for constant memory, and the ~~ext_intel_free_memory extension could provide more accurate memory usage information.~~
[AkarshanBiswas] ext_intel_free_memory is used but not sure about ext_oneapi_usm_device_read_only
[NeoZhangJianyu] Already use them, like ext_intel_free_memory. ✅
Thread safety: Some parts of the code, like the buffer type map in ggml_backend_sycl_split_buffer_type, might not be thread-safe. This could lead to issues when using multiple models or contexts simultaneously.
[NeoZhangJianyu] Is there any test case for the issue? I don't test with multiple modes case.
[AkarshanBiswas] Yes. Test it with continuous batching with slots > 1 and see if the driver is stable.
~~Memory management: Implementing a buffer pool for SYCL similar to the CUDA version could improve memory allocation and reuse efficiency.~~
[NeoZhangJianyu] The buffer pool is used in SYCL backend. ✅
Flash attention SYCL implementation after ggml : add Flash Attention #5021

Will update when I find anything else with my limited knowledge.

1 reply

airMeng Apr 7, 2024
Collaborator Author

#6521

slaren · 2024-03-21T14:55:45Z

slaren
Mar 21, 2024
Collaborator

The SYCL backend should be updated to adopt these changes in ggml-backend:

Remove the code in ggml.c and implement the offload_op interface: backend : offload large batches to GPU #6083
Remove the code that copies GGML_BACKEND_TYPE_CPU tensors automatically to VRAM. After adopting the previous change, this will not be necessary, all the tensors received by the SYCL backend will always be allocated in a SYCL buffer.
Remove all usage of ggml_tensor::backend, as this will be removed in the future. To support split buffer types, use ggml_tensor::buffer to identify the storage type of the tensor instead.

5 replies

NeoZhangJianyu Mar 22, 2024
Collaborator

Yes, we have a plan to fix them by 2 PRs.

airMeng Mar 22, 2024
Collaborator Author

@slaren May I know more context about offload_op? Seems only a ggml_backend_sycl_offload_op needed?

https://github.com/ggerganov/llama.cpp/pull/6217/files

slaren Mar 22, 2024
Collaborator

The purpose of offload_op is to offload computation to the GPU when the batch size is large enough, even if the weights are not stored in VRAM and have to be copied. For small batch sizes the cost of copying the weights is higher than the cost of the computation, but for large enough batch sizes, it is often faster to offload the computation to the GPU since it is usually much faster than the CPU.

Previously, this was implemented by hooking into the CPU backend in ggml.c in the function ggml_compute_forward, and doing the computation in a call to ggml_sycl_compute_forward. Now, this is handled in ggml_backend_sched entirely, and the backends only need to implement the offload_op function to choose what operations they wish to run even when the weights are on system memory. The weights are copied to VRAM by ggml_backend_sched, so backends do not need to check if the tensors are in RAM and copy them, they can assume that all the tensors they receive in a call to graph_compute will always be allocated in the backend-specific buffer.

Your implementation is a good first step, but you should also remove the code of SYCL in ggml.c. To test this, you can try prompt processing without fully offloading the model. For example, llama-bench -n 0 -ngl 0. For a backend such as SYCL, it may be good to consider the type of accelerator being used, and possibly disable for slow devices. This will also allow you to remove a lot of the code in the SYCL backend that deals with copying tensors between CPU and GPU, and generally simplify this logic.

You would also need to modify llama.cpp in llama_new_context_with_model to always create an instance of the SYCL backend, even when model->n_gpu_layers == 0.

airMeng Mar 24, 2024
Collaborator Author

@slaren for point 2 and 3, can we refer to #6170 ?

slaren Mar 24, 2024
Collaborator

Sure, if you find that useful you can use it. However, there will be further refactoring of the CUDA backend in #6269.

qnixsynapse · 2024-04-19T05:35:37Z

qnixsynapse
Apr 19, 2024

Just an update here: I did not use llama.cpp for like few days because I was busy. I ran it today to test llama-3 and I found out that it hangs here everytime with every model right here:

......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  SYCL_Host KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   266.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =    45.01 MiB
llama_new_context_with_model: graph nodes  = 1062
llama_new_context_with_model: graph splits = 66

I am running with --no-mmap .

In the logs I found out:

kernel: Fence expiration time out i915-0000:03:00.0:server[5816]:14c!

Not sure if this is because of an update that I received on Arch Linux or not, not a single model is running with the same binary that it is used to run before.
I have intel-compute-runtime version 23.48.27912.11-2, on lts kernel 6.6.27, intel-oneapi-base-toolkit is 2024.0.0.49564 and level-zero-loader is 1.15.1-1.
Not opening an issue right now because I am not sure if it is a bug in llama.cpp or not.

Update: Not related to llama.cpp, JAX with intel-extension-for-openxla hangs too. (now confirmed)

Update 2: Came across this: intel/compute-runtime#497
Update 3: linux > 6.6.25 is broken ~~(suspicious commit saving it here for bug reporting upstream.)~~

13 replies

qnixsynapse Apr 22, 2024

Yes. Here is the output of sycl-ls on my PC (I am still trying to debug):

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i3-12100F OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A750 Graphics OpenCL 3.0 NEO  [23.48.27912]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A750 Graphics 1.3 [1.3.27912]

I have reproduced it with llama2-7b-q4 and that's why said earlier than it is not related to llama.cpp. Even JAX hangs. (with intel openxla pjrt plugin) .

qnixsynapse Apr 22, 2024

Update: Fixed it!

There is a reason I did not report an issue at first since I was not sure of what was causing the problem. Now I know. And it is the new ~~linux-firmware version: 20240409.1addd7dc-1~~ linux version > 6.6.25 .

I downgraded to version 6.6.25 and now it works.

Thinking of switching to debian on this system, I really hate updates breaking my important things around.

airMeng Apr 22, 2024
Collaborator Author

can you help to report the issue to https://github.com/oneapi-src/level-zero? much thanks for your help!

qnixsynapse Apr 22, 2024

No problem!
Yes, I will open an issue tomorrow.

qnixsynapse May 7, 2024

Update: The regression happened on i915 kernel driver and potential commits(according to Matt):

a7ff84a6fe5a drm/i915/gt: Enable only one CCS for compute workload
726ff623869d drm/i915/gt: Do not generate the command streamer for all the CCS
c1f7ce2a11a9 drm/i915/gt: Disable HW load balancing for CCS

Quoting him, "That workaround requires that we disable load balancing on the compute engines, assign all underlying hardware compute resources to a single engine, and then hide all except for one of the engines from userspace (which is also why the "physical engines" count goes from 4 down to 1 as you noted). It's a pretty complicated and invasive hardware workaround, so it has the highest likelihood of accidentally introducing issues."

Fix is 4cfca03f7641 ("drm/i915/gt: Automate CCS Mode setting during engine resets") which will hopefully land on stable kernels by next week.

Adding these here so that people who are experiencing similar issues might find these helpful.

airMeng · 2024-06-12T07:30:04Z

airMeng
Jun 12, 2024
Collaborator Author

Latest SYCL is broken due to #7640 (comment), I am looking into it and hope to fix it soon.

cc @NeoZhangJianyu @AidanBeltonS

1 reply

airMeng Jun 17, 2024
Collaborator Author

fixed in #7710

[SYCL][Intel GPU] Long Term Features & Issues Tracking #5277

airMeng Feb 2, 2024 Collaborator

Replies: 13 comments · 27 replies

NeoZhangJianyu Feb 2, 2024 Collaborator

abhilash1910 Feb 2, 2024 Collaborator

NeoZhangJianyu Feb 12, 2024 Collaborator

airMeng Feb 13, 2024 Collaborator Author

airMeng Mar 10, 2024 Collaborator Author

airMeng Mar 11, 2024 Collaborator Author

airMeng Apr 7, 2024 Collaborator Author

slaren Mar 21, 2024 Collaborator

NeoZhangJianyu Mar 22, 2024 Collaborator

airMeng Mar 22, 2024 Collaborator Author

slaren Mar 22, 2024 Collaborator

airMeng Mar 24, 2024 Collaborator Author

slaren Mar 24, 2024 Collaborator

airMeng Apr 22, 2024 Collaborator Author

airMeng Jun 12, 2024 Collaborator Author

airMeng Jun 17, 2024 Collaborator Author

airMeng
Feb 2, 2024
Collaborator

Replies: 13 comments 27 replies

NeoZhangJianyu
Feb 2, 2024
Collaborator

abhilash1910
Feb 2, 2024
Collaborator

NeoZhangJianyu Feb 12, 2024
Collaborator

airMeng Feb 13, 2024
Collaborator Author

airMeng
Mar 10, 2024
Collaborator Author

airMeng Mar 11, 2024
Collaborator Author

airMeng Apr 7, 2024
Collaborator Author

slaren
Mar 21, 2024
Collaborator

NeoZhangJianyu Mar 22, 2024
Collaborator

airMeng Mar 22, 2024
Collaborator Author

slaren Mar 22, 2024
Collaborator

airMeng Mar 24, 2024
Collaborator Author

slaren Mar 24, 2024
Collaborator

airMeng Apr 22, 2024
Collaborator Author

airMeng
Jun 12, 2024
Collaborator Author

airMeng Jun 17, 2024
Collaborator Author