-
Notifications
You must be signed in to change notification settings - Fork 926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[experimental]backend: add new oneDNN backend #855
base: master
Are you sure you want to change the base?
Conversation
Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.
This will be supported through
Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to |
Very good idea @rfsaliev . |
Thank you @slaren for your response.
oneDNN supports at least int8 quantization. Unfortunately oneDNN quantization method (per-tensor or per-dimension) differ than GGML (per-block). Anyway I will look for opportunities to support quantizations.
Thanks, it looks like possible to do some fusing like MatMul+BiasAdd in
In case of oneDNN, weights, buffer layout depends on an operation type which uses weights. Can you please point me a method I can follow to identify user operation in |
src/CMakeLists.txt
Outdated
set(GGML_HEADERS_DNNL ggml-dnnl.h) | ||
set(GGML_SOURCES_DNNL ggml-dnnl.cpp) | ||
|
||
set(GGML_EXTRA_INCS ${GGML_EXTRA_INCS} ${CLBLAST_INC} ${OPENCL_INC}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLBLAST vars look out of place here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - it was copy-pasted with mistake.
I've fixed it and some other parts of this file.
Still does not work when 'backend_kv = DNNL'
examples/gpt-2-sched functionality done.
* use dnnl::inner_product_forward for 2D matrix multiplication. * gpt-2 sample hacked to enforce FP32 weights in case of GGML_USE_DNNL
This PR is the Proof-of-Concept for oneDNN (DNNL) library integration to GGML.
I created this PR rather than an Issue to start discussion about oneDNN backend from working demo.
Motivation: oneDNN is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. The backend will allow GGML to utilize latest Intel CPU/GPU instructions sets performance features (e.g. AMX) out-of-box.
Known issues and TODOs:
Functionality:
Performance:
@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?
Some technical details:
ggml-dnnl.h
,ggml-dnnl.cpp
. The backend re-uses CPU buffer type - custom buffer type is under development an wrapped byUSE_DNNL_BACKEND
macros.GGML_DNNL
configuration option.gpt2-sched
is modified to convert model weights fromFP16
toFP32
if DNNL backend enabled - current oneDNN release version does not support MatMul cases withsrc_type=dst_type=f32
andweights_type=fp16