Replies: 13 comments 27 replies
-
I'd like handle: Multi-card Support, CI test error for more than one GPU is detected and used. |
Beta Was this translation helpful? Give feedback.
-
For code cleaning and sanitization of compiler runtime I will be adding patch post the previous changes |
Beta Was this translation helpful? Give feedback.
-
Would like to see support for SOTA 2-bit quantized models (GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ3_XXS). Been trying to do this myself for past hour or two, using dpct isn't as trouble-free as it makes you think. |
Beta Was this translation helpful? Give feedback.
-
before airMeng:sycl_fix_max_alloc_size I don't know if these numbers can be considered when the model produces useless output. Using device 0 (Intel(R) Arc(TM) A770 Graphics) as main device
after
vulkan latest
|
Beta Was this translation helpful? Give feedback.
-
I wonder why build fails with I think this GPU support f16. But anyways getting 16 tokens/sec for q4_k_m for 7B with all layers on GPU. With batch bench, mitral -7B Q4_K_M:
|
Beta Was this translation helpful? Give feedback.
-
I would like to know if there are plans to support quantization types that are not currently supported like iq3, iq4? |
Beta Was this translation helpful? Give feedback.
-
With the recent changes, the model nous-hermes-2-34b-2.69 has seen a significant speedup. From the unusable 3-4 tokens per second, it now reaches 7-8 detect 1 SYCL GPUs: [0,] with Max compute units:512
build: 21b0867 (2345) for comparison Vulkan0: Intel(R) Arc(TM) A770 Graphics | uma: 0 | fp16: 1 | warp size: 32
build: 82cb31e (2348) I hope other quantization methods will also see an improvement, for now, they more or less perform similarly to Vulkan. |
Beta Was this translation helpful? Give feedback.
-
I have a question. During prompt processing or generation, the llama.cpp's SYCL backend seems to use only one of the (I am assuming XMX) engines of my GPU. Although they are tagged 'unknown' in intel_gpu_top. Does anyone knows why? And is it possible to parallelize across all of them? On Linux, I can't even monitor neither the VRAM usage, nor the temps which is surprising because it is a while since this GPU was launched. My all hopes for the new Xe driver. |
Beta Was this translation helpful? Give feedback.
-
@akarshanbiswas can you try https://github.com/intel/xpumanager to monitor the usage? |
Beta Was this translation helpful? Give feedback.
-
Adding this here. This may add to the list of todos and fixes. Credit goes to Gemini 1.5 Pro 1 million. :)
Will update when I find anything else with my limited knowledge. |
Beta Was this translation helpful? Give feedback.
-
The SYCL backend should be updated to adopt these changes in ggml-backend:
|
Beta Was this translation helpful? Give feedback.
-
Just an update here: I did not use llama.cpp for like few days because I was busy. I ran it today to test llama-3 and I found out that it hangs here everytime with every model right here:
I am running with --no-mmap . In the logs I found out:
Not sure if this is because of an update that I received on Arch Linux or not, not a single model is running with the same binary that it is used to run before. Update: Not related to llama.cpp, JAX with intel-extension-for-openxla hangs too. (now confirmed) Update 2: Came across this: intel/compute-runtime#497 |
Beta Was this translation helpful? Give feedback.
-
Latest SYCL is broken due to #7640 (comment), I am looking into it and hope to fix it soon. |
Beta Was this translation helpful? Give feedback.
-
Feel free to drop a note, let's know if you have any feature request or bugs (even unconfirmed)
Current code returns all SYCL devices, including CPU, GPU (level-zero, opencl), FPGA. SYCL only support GPU. So when CI test on other devices, it will be fault.
There is known issue of SYCL: memcpy() from host (mmap) to device will hang in same cases. It's not resolved now. A work around solution is no use mmap. I have handled it in llama-bench (add --mmap parameter). We need add to more applications in examples.
Suggest to handle it after multiple-card is finished. Lots of such unused code will be useful for multiple-card feature.
Also let's know if you have taken any tasks here.
cc @NeoZhangJianyu @luoyu-intel @abhilash1910
Beta Was this translation helpful? Give feedback.
All reactions