Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Integrate with unified SYCL backend for Intel GPUs #2690

Merged
merged 92 commits into from
Jan 28, 2024

Conversation

abhilash1910
Copy link
Collaborator

@abhilash1910 abhilash1910 commented Aug 21, 2023

Motivation:
Thanks for creating llama.cpp. There has been quite an effort to integrate OpenCL runtime for AVX instruction sets.
However for running on Intel graphics cards , there needs to be additional sycl runtime porting over the OpenCL runtime.
This is a feature enabling PR which is now in final stages with expectations for community feedback in terms of performance and improvements.
Co authored by : @NeoZhangJianyu , @airMeng , @luoyu-intel and thanks to @AidanBeltonS (Codeplay) for suggestions and recommendations. Thanks to all associated to help in improving and shaping the PR in terms of feedback and future performance .
Thanks to @jacob1218 for running initial benchmarks :

image

Since the development is based on SYCLomatic runtime which is evolving with latest upgrades, feedbacks/suggestions and comments are welcome .

Tagging @ggerganov .

@abhilash1910 abhilash1910 marked this pull request as draft August 21, 2023 16:19
@ggerganov
Copy link
Owner

This looks interesting, but I need some more context and numbers. What hardware is this useful for?

@Jacoby1218
Copy link

Jacoby1218 commented Oct 4, 2023

This looks interesting, but I need some more context and numbers. What hardware is this useful for?

Discrete Intel GPUs (Intel ARC and the professional variants.) Very interested to see performance numbers with this vs CLBlast.

@abhilash1910
Copy link
Collaborator Author

This looks interesting, but I need some more context and numbers. What hardware is this useful for?

@ggerganov yes this is for Intel dGPUs (max and flex) including arc GPUs which rely on sycl backend. Currently OpenCL is already supported but for perf and better optimization (from Intel llvm) this PR is raised, currently I am testing its perf and making it stable. I will flag it off for review once things provide proper stability .

@unbrice
Copy link

unbrice commented Oct 18, 2023

I'm also interested in this feature. @abhilash1910 are you actively working on it, or is it available for grabs?

@abhilash1910
Copy link
Collaborator Author

abhilash1910 commented Oct 18, 2023

@unbrice yes it is under dev , but if you are able to compile then great. There are some configs and tasks which are pending to be added.

@shibe2 shibe2 linked an issue Oct 20, 2023 that may be closed by this pull request
@itlackey
Copy link

itlackey commented Dec 3, 2023

This looks interesting, but I need some more context and numbers. What hardware is this useful for?

I have put together a repo that shows and example of building llama.cpp with OpenCL and running in on an Intel A770 via Docker. The docker file and all associated scripts showing how the container is built, ran, and tested are included. There is an example log file that shows more of the console logs from the docker container including responding to the curl command in the test.sh file.

https://github.com/itlackey/llama.cpp-opencl

I have the A770 and a 4060ti 16GB running in the same machine. Below are the examples of output when running the same model on either card. The 4060 is 10x faster than the Arc. This is not the case when running things like Intel Extensions for Pytorch. These cards should perform vary similarly when running optimally. This leads me to believe that the OpenCL support in llama.cpp is not using the card to it's fullest potential. Hopefully adding SYCL (or Vulkan) support would bring the Arc up to speed.

Hopefully this is helpful.

A770 Logs:
print_timings: prompt eval time = 871.14 ms / 14 tokens ( 62.22 ms per token, 16.07 tokens per second)
print_timings: eval time = 14133.84 ms / 128 runs ( 110.42 ms per token, 9.06 tokens per second)
print_timings: total time = 15004.98 ms
slot 0 released (143 tokens in cache)
{"timestamp":1701583081,"level":"INFO","function":"log_server_request","line":2601,"message":"request","remote_addr":"172.17.0.1","remote_port":43572,"status":200,"method":"POST","path":"/completion","params":{}}

4060ti Logs:

llama_print_timings: load time = 808.00 ms
llama_print_timings: sample time = 49.21 ms / 128 runs ( 0.38 ms per token, 2600.99 tokens per second)
llama_print_timings: prompt eval time = 86.05 ms / 14 tokens ( 6.15 ms per token, 162.71 tokens per second)
llama_print_timings: eval time = 2867.97 ms / 127 runs ( 22.58 ms per token, 44.28 tokens per second)
llama_print_timings: total time = 3033.07 ms
{"timestamp":1701583788,"level":"INFO","function":"log_server_request","line":1244,"message":"request","remote_addr":"127.0.0.1","remote_port":57650,"status":200,"method":"POST","path":"/completion","params":{}}

@JohnnyOpcode
Copy link

I'm starting my SYCL research and development here and it's looking like a decent sized effort. Macs might not play well with this.

https://github.com/JohnnyOpcode/ggml-sycl

@ggerganov
Copy link
Owner

Btw, the existing OpenCL implementation offloads only the matrix multiplications to the GPU - the rest of the ops are still running on the CPU and there is overhead from constantly moving the activations back and forth between host and device memory.

Ideally, the entire graph computation should be offloaded, similar to the CUDA and Metal backends

@AlexFierro9
Copy link

@unbrice yes it is under dev , but if you are able to compile then great. There are some configs and tasks which are pending to be added.

@abhilash1910 You need any support with adding any remaining configurations or is it complete?

@netrunnereve
Copy link
Contributor

@itlackey Have you tried the WIP Vulkan backend at #2059?

@itlackey
Copy link

itlackey commented Dec 3, 2023

@itlackey Have you tried the WIP Vulkan backend at #2059?

I have not. I might try to pull the fork down and see what the performance looks like. Any idea if this will be merged soon?

@netrunnereve
Copy link
Contributor

netrunnereve commented Dec 7, 2023

I have not. I might try to pull the fork down and see what the performance looks like. Any idea if this will be merged soon?

I have no idea but it's been working perfectly for me with Llama and Mistral models. While I don't think there are shaders for all the ops yet Vulkan uses 100% of my GPU (unlike OpenCL) and it runs 2x faster.

@abhilash1910
Copy link
Collaborator Author

@ggerganov could you help trigger CI ? Thanks

@koech-v
Copy link

koech-v commented Jan 14, 2024

🤞

@mgolub2
Copy link

mgolub2 commented Jan 17, 2024

I have not. I might try to pull the fork down and see what the performance looks like. Any idea if this will be merged soon?

I have no idea but it's been working perfectly for me with Llama and Mistral models. While I don't think there are shaders for all the ops yet Vulkan uses 100% of my GPU (unlike OpenCL) and it runs 2x faster.

I've been trying to get this branch to build to play around with my a770, but so far have had no luck. What environment/dependencies does one need to build this? I've tried various oneAPI containers but none seem to be able to find SYCL during cmake configuration.

Edit - I realize now you were talking about the Vulcan fork not this one, sorry.

Copy link
Collaborator

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments regarding the README and exmpales

CMakeLists.txt Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README_sycl.md Show resolved Hide resolved
README_sycl.md Show resolved Hide resolved
examples/sycl/README.md Show resolved Hide resolved
examples/sycl/ls-sycl-device.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from trying to compile this application with open source DPCPP release

common/common.cpp Outdated Show resolved Hide resolved
common/common.cpp Outdated Show resolved Hide resolved
common/common.cpp Outdated Show resolved Hide resolved
common/common.cpp Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
README_sycl.md Outdated Show resolved Hide resolved
ggml-sycl.h Outdated Show resolved Hide resolved
ggml-sycl.cpp Outdated Show resolved Hide resolved
@Jacoby1218
Copy link

Jacoby1218 commented Jan 22, 2024

This still doesn't compile.

  138 |         return src->buffer->iface.cpy_tensor(dst_buf, src, dst);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jacoby1218/llama.cpp-sycl/ggml-backend.c:494:30: error: incompatible function pointer types initializing 'void (*)(ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'void (*)(struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') with an expression of type 'bool (ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'bool (struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') [-Wincompatible-function-pointer-types]
  494 |     /* .cpy_tensor      = */ ggml_backend_cpu_buffer_cpy_tensor,
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jacoby1218/llama.cpp-sycl/ggml-backend.c:507:30: error: incompatible function pointer types initializing 'void (*)(ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'void (*)(struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') with an expression of type 'bool (ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'bool (struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') [-Wincompatible-function-pointer-types]
  507 |     /* .cpy_tensor      = */ ggml_backend_cpu_buffer_cpy_tensor,
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~```

@NeoZhangJianyu
Copy link
Collaborator

This still doesn't compile.

  138 |         return src->buffer->iface.cpy_tensor(dst_buf, src, dst);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jacoby1218/llama.cpp-sycl/ggml-backend.c:494:30: error: incompatible function pointer types initializing 'void (*)(ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'void (*)(struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') with an expression of type 'bool (ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'bool (struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') [-Wincompatible-function-pointer-types]
  494 |     /* .cpy_tensor      = */ ggml_backend_cpu_buffer_cpy_tensor,
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/jacoby1218/llama.cpp-sycl/ggml-backend.c:507:30: error: incompatible function pointer types initializing 'void (*)(ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'void (*)(struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') with an expression of type 'bool (ggml_backend_buffer_t, const struct ggml_tensor *, struct ggml_tensor *)' (aka 'bool (struct ggml_backend_buffer *, const struct ggml_tensor *, struct ggml_tensor *)') [-Wincompatible-function-pointer-types]
  507 |     /* .cpy_tensor      = */ ggml_backend_cpu_buffer_cpy_tensor,
      |                              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~```

Yes. I'm fixing this issue.
We found after rebase with latest code, there is such issue.

@NeoZhangJianyu
Copy link
Collaborator

@ggerganov
It's great to see your approval!
Before merge, does the CI need to be triggered?

@NeoZhangJianyu
Copy link
Collaborator

@ggerganov
Could you merge the PR? Looks everything is ready.

@ggerganov
Copy link
Owner

Likely will merge later today or tomorrow

README_sycl.md Outdated Show resolved Hide resolved
@ngxson
Copy link
Collaborator

ngxson commented Jan 27, 2024

Thanks for all your hard work guys! I've been able to easily compile and run it using intel/hpckit docker image without any problem.

When running inside container, you can pass the GPU through the container with this argument: --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1

I'm using a iGPU (Intel(R) Iris(R) Xe Graphics) and be able to utilize 100% of its power, though unfortunately the performance is not better than just using CPU. I definitely should get myself an external GPU.

I'll update the guide for compiling & running with docker in the future.

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Jan 28, 2024

Thanks for all your hard work guys! I've been able to easily compile and run it using intel/hpckit docker image without any problem.

When running inside container, you can pass the GPU through the container with this argument: --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1

I'm using a iGPU (Intel(R) Iris(R) Xe Graphics) and be able to utilize 100% of its power, though unfortunately the performance is not better than just using CPU. I definitely should get myself an external GPU.

I'll update the guide for compiling & running with docker in the future.

Thank your update for docker!

Intel iGPU has many EUs. In general, the iGPU includes 32 EUs. It's slow. If you try it on iGPU of Meteor Lake (new Intel Core iGPU), or Intel Arc/Flex/Max dGPU, the performance is good.

@ggerganov ggerganov merged commit 0f64857 into ggerganov:master Jan 28, 2024
46 checks passed
@sorasoras
Copy link

@abhilash1910
I have hard time compile this on my windows

cmake .. -G "Ninja" -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER="C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe" -DCMAKE_CXX_COMPILER="C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang++.exe"
-- The C compiler identification is IntelLLVM 2024.0.2 with GNU-like command-line
-- The CXX compiler identification is IntelLLVM 2024.0.2 with GNU-like command-line
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe
-- Check for working C compiler: C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe - broken
CMake Error at C:/Strawberry/c/share/cmake-3.26/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: W:/git/llama.cpp/sycl/CMakeFiles/CMakeScratch/TryCompile-vi8w7r

    Run Build Command(s):C:/Strawberry/c/bin/ninja.exe -v cmTC_b1bda && [1/2] C:\PROGRA~2\Intel\oneAPI\compiler\2024.0\bin\compiler\clang.exe  /nologo   /DWIN32 /D_WINDOWS /W3  /MDd /Zi /Ob0 /Od /RTC1 -QMD -QMT CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj -QMF CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj.d /FoCMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj /FdCMakeFiles\cmTC_b1bda.dir\ -c W:\git\llama.cpp\sycl\CMakeFiles\CMakeScratch\TryCompile-vi8w7r\testCCompiler.c
    FAILED: CMakeFiles/cmTC_b1bda.dir/testCCompiler.c.obj
    C:\PROGRA~2\Intel\oneAPI\compiler\2024.0\bin\compiler\clang.exe  /nologo   /DWIN32 /D_WINDOWS /W3  /MDd /Zi /Ob0 /Od /RTC1 -QMD -QMT CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj -QMF CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj.d /FoCMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj /FdCMakeFiles\cmTC_b1bda.dir\ -c W:\git\llama.cpp\sycl\CMakeFiles\CMakeScratch\TryCompile-vi8w7r\testCCompiler.c
    clang: error: unknown argument: '-QMD'
    clang: error: unknown argument: '-QMT'
    clang: error: unknown argument: '-QMF'
    clang: error: no such file or directory: '/nologo'
    clang: error: no such file or directory: '/DWIN32'
    clang: error: no such file or directory: '/D_WINDOWS'
    clang: error: no such file or directory: '/W3'
    clang: error: no such file or directory: '/MDd'
    clang: error: no such file or directory: '/Zi'
    clang: error: no such file or directory: '/Ob0'
    clang: error: no such file or directory: '/Od'
    clang: error: no such file or directory: '/RTC1'
    clang: error: no such file or directory: 'CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj'
    clang: error: no such file or directory: 'CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj.d'
    clang: error: no such file or directory: '/FoCMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj'
    clang: error: no such file or directory: '/FdCMakeFiles\cmTC_b1bda.dir\'
    ninja: build stopped: subcommand failed.





  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:2 (project)


-- Configuring incomplete, errors occurred!

Do you have any suggestion?

@abhilash1910
Copy link
Collaborator Author

Yes @sorasoras WIN build support is next in our development plan. We are working to provide the build option .

@NeoZhangJianyu
Copy link
Collaborator

@abhilash1910 I have hard time compile this on my windows

cmake .. -G "Ninja" -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER="C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe" -DCMAKE_CXX_COMPILER="C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang++.exe"
-- The C compiler identification is IntelLLVM 2024.0.2 with GNU-like command-line
-- The CXX compiler identification is IntelLLVM 2024.0.2 with GNU-like command-line
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe
-- Check for working C compiler: C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe - broken
CMake Error at C:/Strawberry/c/share/cmake-3.26/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "C:/Program Files (x86)/Intel/oneAPI/compiler/2024.0/bin/compiler/clang.exe"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: W:/git/llama.cpp/sycl/CMakeFiles/CMakeScratch/TryCompile-vi8w7r

    Run Build Command(s):C:/Strawberry/c/bin/ninja.exe -v cmTC_b1bda && [1/2] C:\PROGRA~2\Intel\oneAPI\compiler\2024.0\bin\compiler\clang.exe  /nologo   /DWIN32 /D_WINDOWS /W3  /MDd /Zi /Ob0 /Od /RTC1 -QMD -QMT CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj -QMF CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj.d /FoCMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj /FdCMakeFiles\cmTC_b1bda.dir\ -c W:\git\llama.cpp\sycl\CMakeFiles\CMakeScratch\TryCompile-vi8w7r\testCCompiler.c
    FAILED: CMakeFiles/cmTC_b1bda.dir/testCCompiler.c.obj
    C:\PROGRA~2\Intel\oneAPI\compiler\2024.0\bin\compiler\clang.exe  /nologo   /DWIN32 /D_WINDOWS /W3  /MDd /Zi /Ob0 /Od /RTC1 -QMD -QMT CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj -QMF CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj.d /FoCMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj /FdCMakeFiles\cmTC_b1bda.dir\ -c W:\git\llama.cpp\sycl\CMakeFiles\CMakeScratch\TryCompile-vi8w7r\testCCompiler.c
    clang: error: unknown argument: '-QMD'
    clang: error: unknown argument: '-QMT'
    clang: error: unknown argument: '-QMF'
    clang: error: no such file or directory: '/nologo'
    clang: error: no such file or directory: '/DWIN32'
    clang: error: no such file or directory: '/D_WINDOWS'
    clang: error: no such file or directory: '/W3'
    clang: error: no such file or directory: '/MDd'
    clang: error: no such file or directory: '/Zi'
    clang: error: no such file or directory: '/Ob0'
    clang: error: no such file or directory: '/Od'
    clang: error: no such file or directory: '/RTC1'
    clang: error: no such file or directory: 'CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj'
    clang: error: no such file or directory: 'CMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj.d'
    clang: error: no such file or directory: '/FoCMakeFiles\cmTC_b1bda.dir\testCCompiler.c.obj'
    clang: error: no such file or directory: '/FdCMakeFiles\cmTC_b1bda.dir\'
    ninja: build stopped: subcommand failed.





  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:2 (project)


-- Configuring incomplete, errors occurred!

Do you have any suggestion?

The Windows building is in end of stage. We will create PR soon.

@sorasoras
Copy link

Cool,Cannot wait to test this against vulkan build

@NeoZhangJianyu
Copy link
Collaborator

Cool,Cannot wait to test this against vulkan build

Windows build PR is created: #5208

Please join to review.

@mudler
Copy link
Contributor

mudler commented Feb 1, 2024

Is this supposed to work with laptop/low-end iGPUs? I was getting some acceleration with openBlas but wanted this give a shot locally, but fails with:

$ GGML_SYCL_DEBUG=1 GGML_SYCL_DEVICE=0 ./bin/main -m ../../../../../models/c0c3c83d0ec33ffe925657a56b06771b -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 > logs.txt 2>&1

Log start
main: build = 2038 (ce320601)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.2 (2024.0.2.20231213) for x86_64-unknown-linux-gnu
main: seed  = 1706787641
ggml_init_sycl: GGML_SYCL_FP16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 2 SYCL devices:
  Device 0: 12th Gen Intel(R) Core(TM) i7-1280P,	compute capability 3.0,
	max compute_units 20,	max work group size 8192,	max sub group size 64,	global mem size 67084083200
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
	max compute_units 20,	max work group size 67108864,	max sub group size 64,	global mem size 67084083200
Using device 0 (12th Gen Intel(R) Core(TM) i7-1280P) as main device
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from ../../../../../models/c0c3c83d0ec33ffe925657a56b06771b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  195 tensors
llama_model_loader: - type q8_0:  130 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 2.75 GiB (8.51 BPW) 
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.25 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  2686.46 MiB
llm_load_tensors:        CPU buffer size =   132.81 MiB
............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     6.01 MiB
llama_new_context_with_model:            compute buffer size =   121.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     5.50 MiB
llama_new_context_with_model: graph splits (measure): 3
GGML_SYCL_DEBUG=1
call ggml_sycl_norm
The program was built for 1 devices
Build program log for '12th Gen Intel(R) Core(TM) i7-1280P':
Compilation started
Compilation done
Linking started
Linking done
Device build started
Options used by backend compiler: 
Failed to build device program
CompilerException Failed to lookup symbol _ZTSZZL13norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_
JIT session error: Symbols not found: [ _Z11fmax_commonDv32_fS_S_ ]
Failed to materialize symbols: { (main, { _ZTSZL17sum_rows_f32_syclPKfPfiiPN4sycl3_V15queueEEUlNS3_7nd_itemILi3EEEE_, _ZGVdN32uuuuuuu__ZTSZZL17soft_max_f32_syclPKfS0_PfiiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZTSZZL13norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZTSZZL13norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZGVdN32uuuuuu__ZTSZZL17rms_norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZTSZZL19group_norm_f32_syclPKfPfiiiPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZTSZZL17rms_norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZTSZZL17soft_max_f32_syclPKfS0_PfiiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZTSZZL17rms_norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZTSZZL19group_norm_f32_syclPKfPfiiiPN4sycl3_V15queueEENKUlRNS3_7handlerEE_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZGVdN32uuuuuu__ZTSZZL19group_norm_f32_syclPKfPfiiiPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_, _ZGVdN32uuuuuu__ZTSZZL13norm_f32_syclPKfPfiifPN4sycl3_V15queueEENKUlRNS3_7handlerEE0_clES7_EUlNS3_7nd_itemILi3EEEE_ }) }

 -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)Exception caught at file:/home/mudler/_git/LocalAI/backend/cpp/llama/llama.cpp/ggml-sycl.cpp, line:12651

does that mean that the op is not supported by the onboard GPU? If so I'd be happy to add it in the known issues in the docs

@airMeng
Copy link
Collaborator

airMeng commented Feb 1, 2024

found 2 SYCL devices:
Device 0: 12th Gen Intel(R) Core(TM) i7-1280P, compute capability 3.0,
max compute_units 20, max work group size 8192, max sub group size 64, global mem size 67084083200
Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 20, max work group size 67108864, max sub group size 64, global mem size 67084083200

@mudler seems there are some issues about your oneapi installation that no gpu device are detected. Can you run ./build/bin/ls-sycl-device and sycl-ls, then paste the output here?

for example, if the iGPU are detected, there will be separate iGPU(Intel(R) Graphics [0x7d55]) and CPU. And sycl backend can work on iGPU.

hengyume@9049fa09fde4:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 1003H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x7d55] OpenCL 3.0 NEO  [23.43.27642.21]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x7d55] 1.3 [1.3.27642]

@mudler
Copy link
Contributor

mudler commented Feb 1, 2024

found 2 SYCL devices:
Device 0: 12th Gen Intel(R) Core(TM) i7-1280P, compute capability 3.0,
max compute_units 20, max work group size 8192, max sub group size 64, global mem size 67084083200
Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 20, max work group size 67108864, max sub group size 64, global mem size 67084083200

@mudler seems there are some issues about your oneapi installation that no gpu device are detected. Can you run ./build/bin/ls-sycl-device and sycl-ls, then paste the output here?

for example, if the iGPU are detected, there will be separate iGPU(Intel(R) Graphics [0x7d55]) and CPU. And sycl backend can work on iGPU.

hengyume@9049fa09fde4:~$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 7 1003H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Graphics [0x7d55] OpenCL 3.0 NEO  [23.43.27642.21]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Graphics [0x7d55] 1.3 [1.3.27642]

mmm alright I see, here I just have:

$ ./bin/ls-sycl-device 
found 2 SYCL devices:
  Device 0: 12th Gen Intel(R) Core(TM) i7-1280P,        compute capability 3.0,
        max compute_units 20,   max work group size 8192,       max sub group size 64,  global mem size 67084083200
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 20,   max work group size 67108864,   max sub group size 64,  global mem size 67084083200
$ sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i7-1280P OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
$ clinfo -l                          
Platform #0: Intel(R) OpenCL
 `-- Device #0: 12th Gen Intel(R) Core(TM) i7-1280P
Platform #1: Intel(R) FPGA Emulation Platform for OpenCL(TM)
 `-- Device #0: Intel(R) FPGA Emulation Device
$ hwinfo --display
32: PCI 02.0: 0300 VGA compatible controller (VGA)              
  [Created at pci.386]
  Unique ID: _Znp.usB9nIk3U2E
  SysFS ID: /devices/pci0000:00/0000:00:02.0
  SysFS BusID: 0000:00:02.0
  Hardware Class: graphics card
  Model: "Intel VGA compatible controller"
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x46a6 
  SubVendor: pci 0x1028 "Dell"
  SubDevice: pci 0x0b08 
  Revision: 0x0c
  Driver: "i915"
  Driver Modules: "i915"
  Memory Range: 0x6054000000-0x6054ffffff (rw,non-prefetchable)
  Memory Range: 0x4000000000-0x400fffffff (ro,non-prefetchable)
  I/O Ports: 0x3000-0x303f (rw)
  Memory Range: 0x000c0000-0x000dffff (rw,non-prefetchable,disabled)
  IRQ: 187 (44937037 events)
  Module Alias: "pci:v00008086d000046A6sv00001028sd00000B08bc03sc00i00"
  Driver Info #0:
    Driver Status: i915 is active
    Driver Activation Cmd: "modprobe i915"
  Config Status: cfg=new, avail=yes, need=no, active=unknown

Primary display adapter: #32

so maybe something is wrong my setup (even if I see all the drivers loaded 🙄 ), anyway thanks for double checking! maybe we can add a mention in the docs that there should be listed a gpu device (with the [opencl:gpu:..] prefix) and that's otherwise an error you can get by selecting the wrong one

@airMeng
Copy link
Collaborator

airMeng commented Feb 1, 2024

@mudler have you add yourself into video group as ?

b. Add user to group: video, render.

sudo usermod -aG render username
sudo usermod -aG video username

If the problem still exists, could you raise an issue then we can talk there instead of this closed PR?

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <[email protected]>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <[email protected]>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <[email protected]>
Co-authored-by: luoyu-intel <[email protected]>
Co-authored-by: Meng, Hengyu <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
@ElliottDyson
Copy link

For better Arc support, is there a way we can have layers offload to the GPU in chunks? You see, intel have made it so that it doesn't allow for moving chunks greater than 4GB in size at any one time.

@Jacoby1218
Copy link

For better Arc support, is there a way we can have layers offload to the GPU in chunks? You see, intel have made it so that it doesn't allow for moving chunks greater than 4GB in size at any one time.

this was already an issue, mentioned and fixed in #5250 and #5270, respectively.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* first update for migration

* update init_cublas

* add debug functio, commit all help code

* step 1

* step 2

* step3 add fp16, slower 31->28

* add GGML_LIST_DEVICE function

* step 5 format device and print

* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue

* support main device is non-zero

* step7 add debug for code path, rm log

* step 8, rename all macro & func from cuda by sycl

* fix error of select non-zero device, format device list

* ren ggml-sycl.hpp -> ggml-sycl.h

* clear CMAKE to rm unused lib and options

* correct queue: rm dtct:get_queue

* add print tensor function to debug

* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481

* summary dpct definition in one header file to replace folder:dpct

* refactor device log

* mv dpct definition from folder dpct to ggml-sycl.h

* update readme, refactor build script

* fix build with sycl

* set nthread=1 when sycl, increase performance

* add run script, comment debug code

* add ls-sycl-device tool

* add ls-sycl-device, rm unused files

* rm rear space

* dos2unix

* Update README_sycl.md

* fix return type

* remove sycl version from include path

* restore rm code to fix hang issue

* add syc and link for sycl readme

* rm original sycl code before refactor

* fix code err

* add know issue for pvc hang issue

* enable SYCL_F16 support

* align pr4766

* check for sycl blas, better performance

* cleanup 1

* remove extra endif

* add build&run script, clean CMakefile, update guide by review comments

* rename macro to intel hardware

* editor config format

* format fixes

* format fixes

* editor format fix

* Remove unused headers

* skip build sycl tool for other code path

* replace tab by space

* fix blas matmul function

* fix mac build

* restore hip dependency

* fix conflict

* ren as review comments

* mv internal function to .cpp file

* export funciton print_sycl_devices(), mv class dpct definition to source file

* update CI/action for sycl code, fix CI error of repeat/dup

* fix action ID format issue

* rm unused strategy

* enable llama_f16 in ci

* fix conflict

* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml

* fix ci cases for unsupported data type

* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL

* revert hip cmake changes

* fix indent

* add prefix in func name

* revert no mmq

* rm cpu blas duplicate

* fix no_new_line

* fix src1->type==F16 bug.

* pass batch offset for F16 src1

* fix batch error

* fix wrong code

* revert sycl checking in test-sampling

* pass void as arguments of ggml_backend_sycl_print_sycl_devices

* remove extra blank line in test-sampling

* revert setting n_threads in sycl

* implement std::isinf for icpx with fast math.

* Update ci/run.sh

Co-authored-by: Georgi Gerganov <[email protected]>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <[email protected]>

* Update examples/sycl/run-llama2.sh

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* Update CMakeLists.txt

Co-authored-by: Georgi Gerganov <[email protected]>

* add copyright and MIT license declare

* update the cmd example

---------

Co-authored-by: jianyuzh <[email protected]>
Co-authored-by: luoyu-intel <[email protected]>
Co-authored-by: Meng, Hengyu <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Support for Intel GPUs with sycl