Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[experimental]backend: add new oneDNN backend #855

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

rfsaliev
Copy link

This PR is the Proof-of-Concept for oneDNN (DNNL) library integration to GGML.

I created this PR rather than an Issue to start discussion about oneDNN backend from working demo.

Motivation: oneDNN is optimized for Intel(R) Architecture Processors, Intel Graphics, and Arm* 64-bit Architecture (AArch64)-based processors. The backend will allow GGML to utilize latest Intel CPU/GPU instructions sets performance features (e.g. AMX) out-of-box.

Known issues and TODOs:

  • Functionality:

    • Limited set of operations are implemented - just to support GPT2 sample model.

    It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..

    • This PoC supports CPU engine only.
    • By default, the backend utilize CPU buffer type - own buffer type is under development.
  • Performance:

    • Operations fusing is not implemented - oneDNN allows to fuse several operations to 1 call which significantly improve performance due to reduced R/W memory access.
    • oneDNN MatMul and InnerProduct (Linear) primitives executed in non-optimal mode because weights provided in a plain memory layout. To gain maximum performance , it is recommended to 'reorder' at least weights to a blocked layout which is effective for memory access and AI acceleration instructions. See (oneDNN Memory Format Propagation)[https://oneapi-src.github.io/oneDNN/page_memory_format_propagation_cpp.html].

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Some technical details:

  • Added files: ggml-dnnl.h, ggml-dnnl.cpp. The backend re-uses CPU buffer type - custom buffer type is under development an wrapped by USE_DNNL_BACKEND macros.
  • CMake files modified to support GGML_DNNL configuration option.
  • gpt2-sched is modified to convert model weights from FP16 to FP32 if DNNL backend enabled - current oneDNN release version does not support MatMul cases with src_type=dst_type=f32 and weights_type=fp16

@slaren
Copy link
Collaborator

slaren commented Jun 12, 2024

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

It would be great if a backend be able to delegate/offload non-supported operations to other backends: CPU, SYCL, OpenCL etc..

This will be supported through ggml_backend_sched after ggerganov/llama.cpp#6210 is merged.

@ggerganov, @slaren, can you please advice proper method to effectively implement operations fusing and weights pre-packing?

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type. For example, the CUDA backend has a split buffer type that splits the tensors between multiple GPUs. Since this buffer type would only be used to store weights, in most cases it would be ok to leave some functionality unimplemented, such as support for creating views, or reading data back through get_tensor.

@WilliamTambellini
Copy link
Contributor

Very good idea @rfsaliev .
Most recent intel CPUs support bf16.
int8b should be easy to add support to (vnni).

@rfsaliev
Copy link
Author

Thank you @slaren for your response.

Looks interesting, though if it is not possible to implement support for quantized types using oneDNN, its usefulness may be limited.

oneDNN supports at least int8 quantization. Unfortunately oneDNN quantization method (per-tensor or per-dimension) differ than GGML (per-block). Anyway I will look for opportunities to support quantizations.

Operation fusing: there isn't a common framework to implement this at the moment, but it is something that we would like to do in the future. For now, you could analyze the graph and look for opportunities to fuse multiple operations in the call to graph_compute.

Thanks, it looks like possible to do some fusing like MatMul+BiasAdd in graph_compute. IMHO full support of graph_plan_create + graph_plan_compute would give best opportunities for backend-side optimizations.

Weights pre-packing: in principle it should be possible to do any transformations to the data during the call to set_tensor by creating a new buffer type.

In case of oneDNN, weights, buffer layout depends on an operation type which uses weights. Can you please point me a method I can follow to identify user operation in set_tensor call?
I found that buffer.init_tensor is called on every operation during model execution - should I rely on such behavior or it will be changed in future? I mean, should I expect that init_tensor will be called for e.g. all MatMul operations assigned to backend? If yes, is there any design rules which prevent me from changing/replacing op->src with own buffer?

set(GGML_HEADERS_DNNL ggml-dnnl.h)
set(GGML_SOURCES_DNNL ggml-dnnl.cpp)

set(GGML_EXTRA_INCS ${GGML_EXTRA_INCS} ${CLBLAST_INC} ${OPENCL_INC})
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLBLAST vars look out of place here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - it was copy-pasted with mistake.
I've fixed it and some other parts of this file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

4 participants