[experimental]backend: add new oneDNN backend #855

rfsaliev · 2024-06-12T15:44:10Z

This PR is the Proof-of-Concept for oneDNN (DNNL) library integration to GGML.

I created this PR rather than an Issue to start discussion about oneDNN backend from working demo.

Motivation: oneDNN is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. The backend will allow GGML to utilize latest Intel CPU/GPU instructions sets performance features (e.g. AMX) out-of-box.

Known issues and TODOs:

Functionality:
- Limited set of operations are implemented - just to support GPT2 sample model.
It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..
- This PoC supports CPU engine only.
- By default, the backend utilize CPU buffer type - own buffer type is under development.
Performance:
- Operations fusing is not implemented - oneDNN allows to fuse several operations to 1 call which significantly improve performance due to reduced R/W memory access.
- oneDNN MatMul and InnerProduct (Linear) primitives executed in non-optimal mode because weights provided in a plain memory layout. To gain maximum performance , it is recommended to 'reorder' at least weights to a blocked layout which is effective for memory access and AI acceleration instructions. See (oneDNN Memory Format Propagation)[https://oneapi-src.github.io/oneDNN/page_memory_format_propagation_cpp.html].

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Some technical details:

Added files: ggml-dnnl.h, ggml-dnnl.cpp. The backend re-uses CPU buffer type - custom buffer type is under development an wrapped by USE_DNNL_BACKEND macros.
CMake files modified to support GGML_DNNL configuration option.
gpt2-sched is modified to convert model weights from FP16 to FP32 if DNNL backend enabled - current oneDNN release version does not support MatMul cases with src_type=dst_type=f32 and weights_type=fp16

slaren · 2024-06-12T16:24:10Z

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..

This will be supported through ggml_backend_sched after ggerganov/llama.cpp#6210 is merged.

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type. For example, the CUDA backend has a split buffer type that splits the tensors between multiple GPUs. Since this buffer type would only be used to store weights, in most cases it would be ok to leave some functionality unimplemented, such as support for creating views, or reading data back through get_tensor.

WilliamTambellini · 2024-06-12T16:45:16Z

Very good idea @rfsaliev .
Most recent intel CPUs support bf16.
int8b should be easy to add support to (vnni).

rfsaliev · 2024-06-14T17:14:45Z

Thank you @slaren for your response.

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

oneDNN supports at least int8 quantization. Unfortunately oneDNN quantization method (per-tensor or per-dimension) differ than GGML (per-block). Anyway I will look for opportunities to support quantizations.

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Thanks, it looks like possible to do some fusing like MatMul+BiasAdd in graph_compute. IMHO full support of graph_plan_create + graph_plan_compute would give best opportunities for backend-side optimizations.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type.

In case of oneDNN, weights, buffer layout depends on an operation type which uses weights. Can you please point me a method I can follow to identify user operation in set_tensor call?
I found that buffer.init_tensor is called on every operation during model execution - should I rely on such behavior or it will be changed in future? I mean, should I expect that init_tensor will be called for e.g. all MatMul operations assigned to backend? If yes, is there any design rules which prevent me from changing/replacing op->src with own buffer?

ggerganov · 2024-06-14T17:59:08Z

src/CMakeLists.txt

+ set(GGML_HEADERS_DNNL ggml-dnnl.h)
+ set(GGML_SOURCES_DNNL ggml-dnnl.cpp)
+
+ set(GGML_EXTRA_INCS ${GGML_EXTRA_INCS} ${CLBLAST_INC} ${OPENCL_INC})


CLBLAST vars look out of place here

Thank you - it was copy-pasted with mistake.
I've fixed it and some other parts of this file.

Still does not work when 'backend_kv = DNNL'

examples/gpt-2-sched functionality done.

* use dnnl::inner_product_forward for 2D matrix multiplication. * gpt-2 sample hacked to enforce FP32 weights in case of GGML_USE_DNNL

ggerganov reviewed Jun 14, 2024

View reviewed changes

rfsaliev added 8 commits June 17, 2024 09:38

Initial OneDNN backend code with simple example

0ca63e8

some changes

29c6340

DNNL: examples/gpt-2-sched works with '-ngl 1'

23efd5d

DNNL: examples/gpt-2-sched works with '-ngl 11'

d331653

Still does not work when 'backend_kv = DNNL'

DNNL: Fix Tensor to DNNL mem_desc conversion

7318eda

examples/gpt-2-sched functionality done.

[DNNL] Good performance for oneDNN backend

fe450a3

* use dnnl::inner_product_forward for 2D matrix multiplication. * gpt-2 sample hacked to enforce FP32 weights in case of GGML_USE_DNNL

Small code refactoring

2a22db5

Post-rebase fixes due to backend interface changes

06dabec

rfsaliev force-pushed the onednn-backend branch from 919d80a to 06dabec Compare June 18, 2024 09:14

Code review fix

9040e9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental]backend: add new oneDNN backend #855

[experimental]backend: add new oneDNN backend #855

rfsaliev commented Jun 12, 2024

slaren commented Jun 12, 2024 •

edited

WilliamTambellini commented Jun 12, 2024

rfsaliev commented Jun 14, 2024

ggerganov Jun 14, 2024

rfsaliev Jun 18, 2024

[experimental]backend: add new oneDNN backend #855

Are you sure you want to change the base?

[experimental]backend: add new oneDNN backend #855

Conversation

rfsaliev commented Jun 12, 2024

slaren commented Jun 12, 2024 • edited

WilliamTambellini commented Jun 12, 2024

rfsaliev commented Jun 14, 2024

ggerganov Jun 14, 2024

Choose a reason for hiding this comment

rfsaliev Jun 18, 2024

Choose a reason for hiding this comment

slaren commented Jun 12, 2024 •

edited