Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Integrate with unified SYCL backend for Intel GPUs #2690

Merged
merged 92 commits into from
Jan 28, 2024
Merged
Show file tree
Hide file tree
Changes from 82 commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
7a4343d
first update for migration
NeoZhangJianyu Dec 27, 2023
2338769
update init_cublas
NeoZhangJianyu Dec 28, 2023
0c00b4f
add debug functio, commit all help code
NeoZhangJianyu Dec 29, 2023
ff83711
step 1
NeoZhangJianyu Dec 29, 2023
02dffb6
step 2
NeoZhangJianyu Dec 29, 2023
43f2c35
step3 add fp16, slower 31->28
NeoZhangJianyu Dec 31, 2023
da752ed
add GGML_LIST_DEVICE function
NeoZhangJianyu Dec 31, 2023
6dd3278
step 5 format device and print
NeoZhangJianyu Dec 31, 2023
3a9d2c5
step6, enhance error check, remove CUDA macro, enhance device id to f…
NeoZhangJianyu Jan 4, 2024
65f895d
support main device is non-zero
NeoZhangJianyu Jan 4, 2024
3b1a743
step7 add debug for code path, rm log
NeoZhangJianyu Jan 6, 2024
c2ef7a9
step 8, rename all macro & func from cuda by sycl
NeoZhangJianyu Jan 7, 2024
69d76c8
fix error of select non-zero device, format device list
NeoZhangJianyu Jan 8, 2024
c709c3c
ren ggml-sycl.hpp -> ggml-sycl.h
NeoZhangJianyu Jan 9, 2024
fa3a586
clear CMAKE to rm unused lib and options
NeoZhangJianyu Jan 9, 2024
3645f25
correct queue: rm dtct:get_queue
NeoZhangJianyu Jan 10, 2024
bd38129
add print tensor function to debug
NeoZhangJianyu Jan 12, 2024
5b53899
fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
NeoZhangJianyu Jan 13, 2024
a47f5ec
summary dpct definition in one header file to replace folder:dpct
NeoZhangJianyu Jan 13, 2024
c67c2ab
refactor device log
NeoZhangJianyu Jan 13, 2024
c3c5b20
mv dpct definition from folder dpct to ggml-sycl.h
NeoZhangJianyu Jan 15, 2024
ca2cb69
update readme, refactor build script
NeoZhangJianyu Jan 15, 2024
95daece
fix build with sycl
NeoZhangJianyu Jan 15, 2024
a8936f4
set nthread=1 when sycl, increase performance
NeoZhangJianyu Jan 15, 2024
79d30d7
add run script, comment debug code
NeoZhangJianyu Jan 15, 2024
0d6e721
add ls-sycl-device tool
NeoZhangJianyu Jan 15, 2024
7350fd4
add ls-sycl-device, rm unused files
NeoZhangJianyu Jan 15, 2024
09b5619
rm rear space
NeoZhangJianyu Jan 15, 2024
d80dd65
dos2unix
NeoZhangJianyu Jan 15, 2024
593ce00
Update README_sycl.md
NeoZhangJianyu Jan 18, 2024
57e9fba
fix return type
luoyu-intel Jan 18, 2024
d5f7d36
remove sycl version from include path
luoyu-intel Jan 18, 2024
35a0daa
restore rm code to fix hang issue
NeoZhangJianyu Jan 18, 2024
ae941b1
add syc and link for sycl readme
NeoZhangJianyu Jan 19, 2024
e3481fa
rm original sycl code before refactor
NeoZhangJianyu Jan 19, 2024
623d803
fix code err
luoyu-intel Jan 19, 2024
f396a3b
add know issue for pvc hang issue
NeoZhangJianyu Jan 20, 2024
f008cc7
enable SYCL_F16 support
luoyu-intel Jan 22, 2024
67e6b3c
align pr4766
airMeng Jan 23, 2024
533c647
check for sycl blas, better performance
NeoZhangJianyu Jan 23, 2024
dd7f139
cleanup 1
abhilash1910 Jan 23, 2024
b403784
remove extra endif
airMeng Jan 23, 2024
a0a1304
add build&run script, clean CMakefile, update guide by review comments
NeoZhangJianyu Jan 23, 2024
27c08c0
Merge branch 'sycl' of https://github.com/abhilash1910/llama.cpp into…
NeoZhangJianyu Jan 23, 2024
97cbe18
rename macro to intel hardware
NeoZhangJianyu Jan 23, 2024
1ddaf44
editor config format
abhilash1910 Jan 23, 2024
bd716b2
format fixes
abhilash1910 Jan 23, 2024
be31379
format fixes
abhilash1910 Jan 23, 2024
d097e2a
editor format fix
abhilash1910 Jan 23, 2024
88f64b7
Remove unused headers
abhilash1910 Jan 23, 2024
756c4ac
skip build sycl tool for other code path
NeoZhangJianyu Jan 23, 2024
b42a32d
replace tab by space
NeoZhangJianyu Jan 23, 2024
5f83a12
fix blas matmul function
abhilash1910 Jan 23, 2024
d6fc1a0
fix mac build
abhilash1910 Jan 23, 2024
c7e745e
restore hip dependency
abhilash1910 Jan 23, 2024
3bfb846
fix conflict
NeoZhangJianyu Jan 23, 2024
498121b
ren as review comments
NeoZhangJianyu Jan 24, 2024
91b1461
mv internal function to .cpp file
NeoZhangJianyu Jan 24, 2024
816f480
export funciton print_sycl_devices(), mv class dpct definition to sou…
NeoZhangJianyu Jan 24, 2024
7a44a95
update CI/action for sycl code, fix CI error of repeat/dup
NeoZhangJianyu Jan 24, 2024
7babd76
fix action ID format issue
NeoZhangJianyu Jan 24, 2024
04a46c4
rm unused strategy
NeoZhangJianyu Jan 24, 2024
799af05
enable llama_f16 in ci
airMeng Jan 24, 2024
ec5c8bc
fix conflict
NeoZhangJianyu Jan 24, 2024
22e1b45
fix build break on MacOS, due to CI of MacOS depend on external ggml,…
NeoZhangJianyu Jan 24, 2024
238ec31
Merge branch 'master' into sycl
abhilash1910 Jan 24, 2024
67de350
fix ci cases for unsupported data type
NeoZhangJianyu Jan 24, 2024
fb15de3
revert unrelated changed in cuda cmake
airMeng Jan 24, 2024
96186a7
revert hip cmake changes
airMeng Jan 24, 2024
d07a88d
fix indent
airMeng Jan 24, 2024
8dd1b60
add prefix in func name
NeoZhangJianyu Jan 24, 2024
3aabd8a
revert no mmq
airMeng Jan 24, 2024
18742f7
rm cpu blas duplicate
abhilash1910 Jan 24, 2024
0e235fb
fix no_new_line
airMeng Jan 24, 2024
5600118
fix src1->type==F16 bug.
luoyu-intel Jan 24, 2024
eef5faa
pass batch offset for F16 src1
luoyu-intel Jan 24, 2024
5bb93d4
fix batch error
luoyu-intel Jan 24, 2024
0635f84
fix wrong code
luoyu-intel Jan 24, 2024
f1bab50
revert sycl checking in test-sampling
airMeng Jan 25, 2024
66e24c2
pass void as arguments of ggml_backend_sycl_print_sycl_devices
airMeng Jan 25, 2024
b06dca6
remove extra blank line in test-sampling
airMeng Jan 25, 2024
05b7f9b
revert setting n_threads in sycl
airMeng Jan 25, 2024
d6a6505
implement std::isinf for icpx with fast math.
luoyu-intel Jan 26, 2024
174c9a0
Update ci/run.sh
abhilash1910 Jan 26, 2024
c08fec2
Update examples/sycl/run-llama2.sh
abhilash1910 Jan 26, 2024
2cba564
Update examples/sycl/run-llama2.sh
abhilash1910 Jan 26, 2024
f707051
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
45b0618
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
5531754
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
b9ffaab
Update CMakeLists.txt
abhilash1910 Jan 26, 2024
2ab9715
add copyright and MIT license declare
NeoZhangJianyu Jan 26, 2024
d394ca7
update the cmd example
NeoZhangJianyu Jan 28, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,47 @@ jobs:
cd build
ctest --verbose

ubuntu-22-cmake-sycl:
runs-on: ubuntu-22.04

continue-on-error: true

steps:
- uses: actions/checkout@v2

- name: add oneAPI to apt
shell: bash
run: |
cd /tmp
wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
rm GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
sudo add-apt-repository "deb https://apt.repos.intel.com/oneapi all main"

- name: install oneAPI dpcpp compiler
shell: bash
run: |
sudo apt update
sudo apt install intel-oneapi-compiler-dpcpp-cpp

- name: install oneAPI MKL library
shell: bash
run: |
sudo apt install intel-oneapi-mkl-devel

- name: Clone
id: checkout
uses: actions/checkout@v3

- name: Build
id: cmake_build
run: |
source /opt/intel/oneapi/setvars.sh
mkdir build
cd build
cmake -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx ..
cmake --build . --config Release -j $(nproc)

# TODO: build with LLAMA_NO_METAL because test-backend-ops fail on "Apple Paravirtual device" and I don't know
# how to debug it.
# ref: https://github.com/ggerganov/llama.cpp/actions/runs/7131777249/job/19420981052#step:5:1124
Expand Down
49 changes: 44 additions & 5 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
cmake_minimum_required(VERSION 3.14) # for add_link_options and implicit target directories.
project("llama.cpp" C CXX)
include(CheckIncludeFileCXX)

set(CMAKE_EXPORT_COMPILE_COMMANDS ON)

Expand Down Expand Up @@ -103,6 +104,8 @@ option(LLAMA_METAL_NDEBUG "llama: disable Metal debugging"
option(LLAMA_METAL_SHADER_DEBUG "llama: compile Metal with -fno-fast-math" OFF)
option(LLAMA_MPI "llama: use MPI" OFF)
option(LLAMA_QKK_64 "llama: use super-block size of 64 for k-quants" OFF)
option(LLAMA_SYCL "llama: use SYCL" OFF)
option(LLAMA_SYCL_F16 "llama: use 16 bit floats for sycl calculations" OFF)

option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
Expand All @@ -121,8 +124,12 @@ include(${CMAKE_CURRENT_SOURCE_DIR}/scripts/build-info.cmake)
#
# Compile flags
#
if (LLAMA_SYCL)
set(CMAKE_CXX_STANDARD 17)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How deep is the C++17 dependency in the SYCL backend?

It's okay to optionally include it like this, but I'm wondering if it is realistic to implement this in C++11 at some point - it would be in better harmony with the rest of the codebase.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the icpx compiler expects C++17 standard and SYCL has dependency on that version. We thought about this process having same version C++11 but it causes compilation errors due to dependency on c++17 headers.
@NeoZhangJianyu , @AidanBeltonS and others can add on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add a bit more info, it is not just that the SYCL compiler, icpx, expects C++17. C++17 is a core aspect within the SYCL open standard. Any SYCL2020 code is expected to be C++17 conformant, so the relationship is deeper than just the specific implementation of the Khronos specification. I would say the dependency between SYCL and C++17 is hard, and it would likely not work well if SYCL specific features were compiled with C++11.

From the spec: https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
The SYCL specification is now based on the core language of C++17, as described in Section 3.9.1. Features of C++17 are now enabled within the specification, such as deduction guides for class template argument deduction

else()
set(CMAKE_CXX_STANDARD 11)
endif()

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD_REQUIRED true)
set(CMAKE_C_STANDARD 11)
set(CMAKE_C_STANDARD_REQUIRED true)
Expand Down Expand Up @@ -454,6 +461,35 @@ if (LLAMA_HIPBLAS)
endif()
endif()


if (LLAMA_SYCL)
if ( NOT DEFINED ENV{ONEAPI_ROOT})
message(FATAL_ERROR "Not detect ENV {ONEAPI_ROOT}, please install oneAPI & source it, like: source /opt/intel/oneapi/setvars.sh")
abhilash1910 marked this conversation as resolved.
Show resolved Hide resolved
endif()
#todo: AOT

find_package(IntelSYCL REQUIRED)
if (LLAMA_SYCL_F16)
add_compile_definitions(GGML_SYCL_F16)
endif()
add_compile_definitions(GGML_USE_SYCL)

add_compile_options(-I./) #include DPCT
add_compile_options(-I/${SYCL_INCLUDE_DIR})

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-narrowing")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsycl -L${MKLROOT}/lib")

set(GGML_HEADERS_SYCL ggml.h ggml-sycl.h)
set(GGML_SOURCES_SYCL ggml-sycl.cpp)

set(LLAMA_EXTRA_LIBS ${LLAMA_EXTRA_LIBS} sycl OpenCL mkl_core pthread m dl mkl_sycl_blas mkl_intel_ilp64 mkl_tbb_thread)

endif()



abhilash1910 marked this conversation as resolved.
Show resolved Hide resolved
function(get_flags CCID CCVER)
set(C_FLAGS "")
set(CXX_FLAGS "")
Expand All @@ -479,10 +515,12 @@ function(get_flags CCID CCVER)
set(CXX_FLAGS ${CXX_FLAGS} -Wextra-semi)
endif()
elseif (CCID MATCHES "Intel")
# enable max optimization level when using Intel compiler
set(C_FLAGS -ipo -O3 -static -fp-model=fast -flto -fno-stack-protector)
set(CXX_FLAGS -ipo -O3 -static -fp-model=fast -flto -fno-stack-protector)
add_link_options(-fuse-ld=lld -static-intel)
if (NOT LLAMA_SYCL)
# enable max optimization level when using Intel compiler
set(C_FLAGS -ipo -O3 -static -fp-model=fast -flto -fno-stack-protector)
set(CXX_FLAGS -ipo -O3 -static -fp-model=fast -flto -fno-stack-protector)
add_link_options(-fuse-ld=lld -static-intel)
endif()
abhilash1910 marked this conversation as resolved.
Show resolved Hide resolved
endif()

set(GF_C_FLAGS ${C_FLAGS} PARENT_SCOPE)
Expand Down Expand Up @@ -795,6 +833,7 @@ add_library(ggml OBJECT
${GGML_SOURCES_METAL} ${GGML_HEADERS_METAL}
${GGML_SOURCES_MPI} ${GGML_HEADERS_MPI}
${GGML_SOURCES_EXTRA} ${GGML_HEADERS_EXTRA}
${GGML_SOURCES_SYCL} ${GGML_HEADERS_SYCL}
abhilash1910 marked this conversation as resolved.
Show resolved Hide resolved
)

target_include_directories(ggml PUBLIC . ${LLAMA_EXTRA_INCLUDES})
Expand Down
11 changes: 10 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ The main goal of `llama.cpp` is to run the LLaMA model using 4-bit integer quant
- AVX, AVX2 and AVX512 support for x86 architectures
- Mixed F16 / F32 precision
- 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit integer quantization support
- CUDA, Metal and OpenCL GPU backend support
- CUDA, Metal, OpenCL, SYCL GPU backend support

The original implementation of `llama.cpp` was [hacked in an evening](https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022).
Since then, the project has improved significantly thanks to many contributions. This project is mainly for educational purposes and serves
Expand Down Expand Up @@ -597,6 +597,15 @@ Building the program with BLAS support may lead to some performance improvements

You can get a list of platforms and devices from the `clinfo -l` command, etc.

- #### SYCL

SYCL is a higher-level programming model to improve programming productivity on various hardware accelerators.

llama.cpp based on SYCL is used to support Intel GPU (Data Center Max series, Flex series, Arc series, Built-in GPU and iGPU).
abhilash1910 marked this conversation as resolved.
Show resolved Hide resolved

For detailed info, please refer to [llama.cpp for SYCL](README_sycl.md).


### Prepare Data & Run

```bash
Expand Down
Loading
Loading