Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: proposal for block level APIs #1852

Open
wants to merge 7 commits into
base: rfcs
Choose a base branch
from

Conversation

mgouicem
Copy link
Contributor

@mgouicem mgouicem commented Apr 1, 2024

Proposal to expose block level API on CPU, namely brgemm and transforms.

Rendered version

@vpirogov vpirogov added the RFC A design document label Apr 1, 2024
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
Copy link

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add sample code that uses the APIs to demonstrate the usage flow?

rfcs/20240326-block-level-api/README.md Outdated Show resolved Hide resolved
Comment on lines 169 to 171
Given the small number of layouts, we will expose a layout tag enum
(next section will address the lack of need for metadata handling by
oneDNN brgemm).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these layouts be well documented and exposed publicly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those are part of the external API. Note that those layout only account for packing required by the HW (so packed along K by 32 bit, or 64 bit). They do not account for blocking sizes like for oneDNN primitive.
For best performance, the brgemm user can apply the same brgemm blocking to weights packing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the blocking will be handled from the caller side. Meanwhile, since it is something defined by the HW archs, the layout can be well-defined and documented to align with the HW spec.

The recommendation as a starting point is to make query mandatory: no
layout forcing on user side. This simplifies the configurations
supported by brgemm. Given that we have no support internally for
transa/tranb, those will not be exposed as well. We will also support
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean no support for transa/transb on the layout conversion API or on the BRGEMM API? It would be beneficial to have transa/transb (or only transa when b is VNNI) supported at the BRGEMM API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion API will have support for transpose, but not the BRGEMM API.
The main reason is that there is no support for fused transposition in current implementation. We could use that information to generate different kernels internally but in practice, it will likely not be faster than doing the transposition in a temporary buffer before calling brgemm.

Copy link

@jgong5 jgong5 Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion API will have support for transpose, but not the BRGEMM API. The main reason is that there is no support for fused transposition in current implementation. We could use that information to generate different kernels internally but in practice, it will likely not be faster than doing the transposition in a temporary buffer before calling brgemm.

Would it be more flexible to have the transposition flags supported at the API level? BTW, in scaled-dot-product-attention, the first bmm has B transposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have no support whatsoever in the implementation for fusing transposition, hence why the proposal does not include it.

If there is value in adding this feature in the future, we would extend the API to support that.


Note that current oneDNN implementation just takes ISA as a maximum
ISA. To simplify the programming model, isa selection will not appear
in block level APIs. However, `dnnl::set_max_cpu_isa` will still be
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context, if there is no ISA setting at the BRGEMM level, I don't think we will use this global API to configure things. I assume oneDNN would make wise decision to choose right ISA given input shapes, is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the plan is to pick best available ISA under the hood. If ISA control is needed (there are some corner cases), that global API can be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the global API will cause problem when do parallel compiling, like one thread want to generate AVX512 brgemm while another thread want to generate AMX brgemm. Suggest to have another way to specify the ISA per brgemm creation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the plan is to pick best available ISA under the hood. If ISA control is needed (there are some corner cases), that global API can be used.

What do you mean by "best available ISA"? If AVX is a better choice for a particular shape than AMX, will you choose AVX or AMX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhennanQin do you have a usage scenario for using different ISA in a single process?

@jgong5 best available means highest performing ISA on the platform that is creating that brgemm object (e.g. AMX > AVX512 > AVX2 ..., more details here)

In the current proposal, we cannot use the shape to select ISA, since ISA selection could impact data layout (e.g. we currently might not require 32-bit packed/VNNI layout when lower ISA requires emulation).
We could make layout independent of ISA but this would require changes in current oneDNN kernel generation logic (or disabling some datatype/isa combinations).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current proposal, we cannot use the shape to select ISA, since ISA selection could impact data layout (e.g. we currently might not require 32-bit packed/VNNI layout when lower ISA requires emulation).
We could make layout independent of ISA but this would require changes in current oneDNN kernel generation logic (or disabling some datatype/isa combinations).

Suppose we want to compute with int8 (a typical case is woq int4 with int8 compute), INT8 VNNI might work better than INT8 AMX with smaller shapes while both would have same VNNI4 weight layout. How do you decide the ISA in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhennanQin do you have a usage scenario for using different ISA in a single process?

A valid usage scenario is, if the input shape is not dividable by the tiling size, we may use a small brgemm to handle the tail shape. This small brgemm only execute once, AVX512-VNNI may be faster than AMX considering the tile reconfigure overhead. Then for this mamtul, both AMX and AVX512-VNNI brgemms are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgong5 in the case you provide, there is no implication of isa wrt. Currently, isa has layout implication mostly when f16 emulation happens. But there are other scenarios, where some architectures have different instructions for matrix multiplication acceleration that require different layout (e.g. bfmmla, and SME on ARM).

How do you decide the ISA in this case?

We just use the highest available ISA for this datatype, independently of shape. See here for the exact logic

Then for this mamtul, both AMX and AVX512-VNNI brgemms are used.

Makes sense, but why would you create those in separate threads?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but why would you create those in separate threads?

MLIR support multi-thread compilation, we can't guarantee the brgemm creations happen within a single thread or separate threads.

Comment on lines +265 to +266
We also have internal attributes to control prefetch hints, prefetch
distances and non-temporal stores. However those are not yet fully
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you consider to expose this in the future? These knobs give us flexibility to further tune the performance.

Copy link
Contributor Author

@mgouicem mgouicem Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, there is no plan to expose those since there is no support internally.
The only prefetching we do internally is to prefetch the destination buffer (matrix C) at the beginning of the accumulation loop so that it is ready to accumulate result/write into. And this can be enabled under the hood, without exposing API.

For other prefetching, we would have to assess potential speedups and use cases before considering exposing them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about temporal/non-temporal loads/stores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same there is no support for temporal/non-temporal loads/stores as of now.
But that can be considered at a later time depending on potential upside.



To make a first version of the API publicly available faster,
we will start by relying on existing dnnl::primitive_attr objects.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess starting with a separate attribute abstraction would leave more flexibility to revise in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New attributes would require new entry points, testing, ... that would delay first release of the feature, hence the suggestion to start with existing primitive attribute for first iteration, and then migrate to dedicated attributes.

change the API when we introduce the feature, or add a placeholder
parameter if we are confident this feature will gain traction soon.

Also, note that currently, no kernel-level cache exists for block
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caching support is a must for performance. Frameworks have to provide their own cache support if oneDNN doesn't support it. oneDNN supports it at the beginning would be ideal.

// Execution function for the advanced brgemm variant
// Here the C matrix is just an input to accumulation
// final result after postop/conversion will be in D
// Computes D = \beta C + \alpha \sum_i A_i \cdot B_i + bias
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here, D has different dtype from C?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes D is used for post-op and conversion fusion, hence why it has a different datatype.

Comment on lines +150 to +152
- in practice, when data is pre-packed, blocking information (so brgemm
object creation parameters) have to be known in order to layout data
in an optimal way.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the input shape is unknown in the prepacking phase so we are not able to build a right brgemm object then.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, adding to that, how do you envision those objects' handling should look like from a compiler point of view that lowers to brgemm calls? Would this mean it would need to allocate memory, call constructors, and then call methods? Or is it expected that those will be preconfigured separately and a compiler just makes calls using some kind of a naming scheme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the input shape is unknown in the prepacking phase so we are not able to build a right brgemm object then.

There are two possibilities here:

  • brgemm parameters are independent of your input shapes. For example, you can only use blocks of size 32x32 and 1x32. In that case, you can create the brgemm objects independently of the input shapes, and do blocking with respect to those. This is typically what oneDNN primitives do with blocked layout . You can see in those that they are independant of input shape (e.g. AB16b64a uses blocks 16x64 in plain format, AB4b64a4b uses blocks 16x64 in VNNI layout, ...).
  • brgemm parameters are dependant of your input shapes. In that case, you cannot do efficient pre-packing of the weights. You can only put them in VNNI format or not.

If you are in the latter situation, we can make these static with two implications:

  • oneDNN brgemm will never be able to do isa selection based on shapes (e.g. it might be faster to use a lower isa for some shapes that are memory bound).
  • you will not get the best performance from pre-packing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kurapov-peter both options are possible, as laid out above.

Now there is a balance between jitting time (increases with the number of brgemm objects created) and execution time (specialized objects will typically provide better performance). That balance will typically depend on the application.

In any case, hoisting out the kernel creation and memory allocation (e.g. through the use of preallocated memory pools) is desirable to avoid overheads in hot-loops.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If brgemm parameters are independent of your input shapes can be guaranteed, I vote for option 2 as querying with each brgemm object is redundant. Also, I think this API is to let user understand the preference of brgemm implementation, but not mandatory to follow. User have more understanding about the whole computation looks like and can make the final decision of which layout should be used and pass it during brgemm creation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgouicem as we synced offline, we can safely assume that the layout of B is independent from the input shapes and we can create brgemm object with arbitrary input shapes, or M in particular, to query the layout for B, correct? If so, please get these assumptions documented. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this API is to let user understand the preference of brgemm implementation, but not mandatory to follow. User have more understanding about the whole computation looks like and can make the final decision of which layout should be used and pass it during brgemm creation.

I guess there are 2 parts to the memory layout:

  • one aligned with computation blocking. For this one you are right, the brgemm user has better knowledge on which block sizes make sense, and they can use that block size for pre-packing matrices, and for brgemm computation shapes
  • one aligned with HW capabilities. This is an implementation details whether the brgemm API will use instruction requiring plain, packed_32, packed_64 or any other layout.

as we synced offline, we can safely assume that the layout of B is independent from the input shapes and we can create brgemm object with arbitrary input shapes, or M in particular, to query the layout for B, correct?

Yes we can achieve that. However note that if we do so, we lose opportunity to select isa/layout based on brgemm input shapes. Is that ok ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can achieve that. However note that if we do so, we lose opportunity to select isa/layout based on brgemm input shapes. Is that ok ?

If not, how do you suggest we decide the layout in the weight prepacking stage when the input sizes are unknown?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, how do you suggest we decide the layout in the weight prepacking stage when the input sizes are unknown?

If shapes are unknown, the layout cannot depend on shape (e.g. same layout is required for gemv and large block gemm). Here it would likely depend on hardware capabilities (e.g. always use pack_32 when AMX/VNNI is available).

applicability, it seems unnecessary to expose it.

To make the public brgemm API simpler, we propose to expose only the array
of pointers flavor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assume this is also the best option performance-wise if the user can provide those as compile-time constants. Or would the choice effects be negligible here?

Copy link
Contributor Author

@mgouicem mgouicem Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what you mean by compile-time constants? Do you mean brgemm object creation time?

In general, a few things:

  • if brgemm object creation takes pointers as parameters, it will severely reduce the possibility to reuse brgemm objects/kernels across an application, and introduce jitting overheads. With that respect, base pointer + array of offsets might be more reusable.
  • even if the pointers are constant, we typically still need to put those pointers in register. IIRC, the fastest way to put a 64 bit value in register on x64 is to actually do a lea, in which case whether the pointer is constant or not should not matter. Do I miss something here?
  • the pointers are typically not constant at brgemm object creation-time. However, the offsets might be. In that case, the base pointer + array of offset would be the fastest option, as it can remove some overhead related to pointer arithmetic (by reusing buffer of offsets without resetting it before every brgemm call).

Actually this discussion is making me want to revise to using base pointer + array of offsets :).
Adding @ankalinin and @nivas-x86 for discussion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pointers are typically not constant at brgemm object creation-time. However, the offsets might be. In that case, the base pointer + array of offset would be the fastest option, as it can remove some overhead related to pointer arithmetic

Yup, that was my line of thinking. Take the inference case as an example: the compiler will be able to prepopulate the weight matrix offsets in advance, cutting the cost of address computation. While good on paper, it is unclear to me whether that would make any real difference, hence the question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair observation. The usages I saw in IPEX, openVINO and oneDNN all rely on the pointer based flavor.
I will let @nivas-x86 and @ankalinin comment about potential speedups when using precomputed offsets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graph compiler uses base pointer + fixed stride flavor for matmul implementation and array of pointers for convolution. For most DL compilers, it's not easy to describe pointer arithmetic. Take brgemm in Linalg dialect(https://mlir.llvm.org/docs/Dialects/Linalg/#linalgbatch_reduce_matmul-linalgbatchreducematmulop) as example, base pointer + fixed stride can perfectly match its semantic. I suggest providing both base pointer + fixed stride for ease-of-use and array of pointers for advanced and flexible usage.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo @ZhennanQin .

Copy link
Contributor

@nivas-x86 nivas-x86 Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having ptr+offsets help in the case of brgemm kernles when the work done by brgemm is small.
In depthwise conv (brdgmm) we cache the entire bactch (all combinations of bs_size, offsets required) and it seemed to have provided some perf benefit.
But the argument of having constant pointers is counter-intutive to me... no benefit and limited flexibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't thing there is any argument for having constant pointers, quite the opposite.
Offsets can be constants, but pointers will have to be recomputed always, even when offsets are constant.

For the brgemm API specifically, an array of equal sizes for A and B
blocks are passed. To guarantee these are of equal size, and simplify
pointer arithmetic logic inside generated kernels, we will take an
array of pairs of pointers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a potential problem in the case of some lookup pointer arrays to both A and B? I'm imagining some cases of embeddings. In that case, the API that accepts a vector of pairs would likely force some memory movement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand. Could you give an example or more detail on the concern?

Because of the batch reduction semantic, we have to take as many A_i blocks as B_i blocks. Here the two options are:

  • take two arrays of equal size (one for A_i, one for B_i)
  • take a single array of pairs (to interleave pointers to A_i/B_i)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't understand why merging A and B into single array can guarantee they are of equal size. What if the array size is odd? The sizes of A and B pointer array are specified by batch parameter. User should guarantee both A and B array contains batch number of pointers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgouicem, FWIW, taking two vectors will save us from having to copy pointers from the pairs to another C-like array to pass them forward to the C API, which means less overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@densamoilov I had that exact conversation with @dzarukin :-).

Currently, internal implementation takes a single array of structures containing A_i and B_i pointers. The interleaving of pointers allows to save on GPR usage IIRC.

Regarding extra copies, there are two sources of copies:

  • C++ to C API. Whether we use 2 vectors or vector of pairs, we can pass those directly. Albeit, it is cleaner with 2 separate vectors, it is still doable without a copy for vector of pairs (see here)
  • external API to internal implementation. This one is tricky since internally, we use a multipurpose structure (see brgemm_batch_element_t, so we have to copy anyway because of this internal structure (until we clean that up).

Now, I think the vector of pair version is preferable, since there is a path to remove all copies without modification to kernel generation (by aligning internal structure to size of pair).
If we go with the 2 separate vectors solution, removing all copies would require modification to internal structure and APIs, as well as modification to kernels.

Copy link
Contributor

@densamoilov densamoilov Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ to C API. Whether we use 2 vectors or vector of pairs, we can pass those directly. Albeit, it is cleaner with 2 separate vectors, it is still doable without a copy for vector of pairs (see here)

@mgouicem, I think there are two problems with your example:

  • You are assuming that the data members of std::pair are contiguous in memory. The standard doesn't seem to guarantee that. Your example works because of the way std::pair is implemented (link). In our case std::pair is just a struct with two pointers therefore no padding is required hence the example works.
  • You are basically passing a C++ object (std::pair) as void * to the C API. By design, we forbid C++ objects from crossing the library boundary (a.k.a C API) unless there is no way around it (e.g. SYCL interop API)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree standard does not guarantee std::pair is contiguous in memory. However, all compilers that I know of implement it this way. Same as type punning, or reinterpret_cast: wrt to standard, they are not safe to use, but in practice, compilers tend to preserve behaviors that are not required by standard.

In any case, we can replace std::pair with a dedicated struct, no big deal.

Copy link
Contributor

@densamoilov densamoilov Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, we can replace std::pair with a dedicated struct, no big deal.

If we are dead set on passing pairs of pointers then I think that's what we need to do (I prefer to err on the side of caution), this way we won't rely on the implementation details of std::pair.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is the decision here to go with an array of pointer pairs, accepting potentially higher overhead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as a follow-up to the offline conversation, expanding a little on the original comment:

Can this be a potential problem in the case of some lookup pointer arrays to both A and B? I'm imagining some cases of embeddings. In that case, the API that accepts a vector of pairs would likely force some memory movement.

Consider an example when one of the inputs is a weight matrix and another one is accessed indirectly as you'd do in the case of an embedding lookup (there are also other similar cases found in beam search in LLMs, image attention masks, roi_heads in detection). Restricting the API to accept pairs of pointers hinders optimization possibilities. So, those cases can be addressed with a separate interface, yet the feasibility should also be estimated.

#### How to express HW specific layouts

There are two main ways oneDNN can expose those layouts:
- through layout tags. This has the advantage of being very simple,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate a bit on what those tags would look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they are specified in the API proposal below. They will only express HW layout requirement, not blocking logic.
So for now, there would only be plain, packed_32 (values packed by 32 bit, needed for VNNI/AMX) and packed_64 (values packed by 64-bit, maps to ARM *MMLA)

Comment on lines +150 to +152
- in practice, when data is pre-packed, blocking information (so brgemm
object creation parameters) have to be known in order to layout data
in an optimal way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, adding to that, how do you envision those objects' handling should look like from a compiler point of view that lowers to brgemm calls? Would this mean it would need to allocate memory, call constructors, and then call methods? Or is it expected that those will be preconfigured separately and a compiler just makes calls using some kind of a naming scheme?



There are two options here:
- expose two APIs, `brgemm::set_hw_context()` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to run two brgemm kernels, do we need to run brgemm::release_hw_context() and brgemm::set_hw_context() in between? Can we merge them into brgemm:reset_hw_context() to reduce the overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a comment on API level that no release is needed between two calls to set_hw_context().
However adding resest_hw_context is cleaner, so adding that to the proposal. Thanks.

The recommendation is to go with the first option, with explicit set
and release functions, that the user can hoist as they see fit.

With respect to OS syscall, we recommend to make it transparent to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the AMX palette buffer management? Is it possible to expose this to user so that user can implement their own brgemm::set_hw_context() and brgemm::release_hw_context() to avoid those function calls?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the problem of hiding this inside oneDNN?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related to an optimization opportunity. Based on current design, you must reconfigure hw context if multiple brgemm kernels are used. But if we can manage AMX palette, we don't need to reconfigure hw context if those brgemms use the same AMX palette.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZhennanQin , we would like the API to be as ISA agnostic as possible.
Internally, we will not reconfigure if the same pallette is already loaded on the core.


## Handling of attributes

First, here is the list of attributes we plan on exposing:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding ISA and accumulation_datatype as attributes as well for end-user to control.

// The batch size is the size of the vector.
// pointers are void*, datatypes are specified in constructor
// Computes C = \beta C + \alpha \sum_i A_i \cdot B_i
void execute(const std::vector<std::pair<void *, void *>> &A_B,
Copy link
Contributor

@ZhennanQin ZhennanQin Apr 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very difficult to handle STL containers inside MLIR, will you provide C interface as well? I want to check what the C interface looks like.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Also, std::vector might incur overhead in its memory management.

// - Release is static
// - No release necessary between different calls to set_hw_context
void set_hw_context() const;
static void release_hw_context() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. "set_hw_context()" seems very generic (maybe intentionally). But it raises a question about what is being modified (mxcsr, etc.. ). Maybe provide a documentation of what features it is allowed to modify, for transparency?
  2. By static do you mean, it is safe to call it regardless of a previous call to set_hw_context()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this is a valid point and I left it fuzzy on purpose :-). In general it just changes the state of the core the function is executed on. If need be, we can be more specific in documentation and specify for each ISA, what is being changed.
  2. just that it can be called without a brgemm object. This is to highlight that releasing HW context should be independent of how it was set.


## All-in-all

### Interface proposal
Copy link
Contributor

@huanghaixin008 huanghaixin008 Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see the new APIs hide the details of palette management. And I still have a few concerns regarding the future API usage in Graph Compiler.

Currently Graph Compiler calls internal BRGEMM APIs from compiler IR world in both base pointer + fixed stride and array of pointers favors. The compiler IR only provides very limited grammar, and the internal APIs require complex structs as parameters and notable low-level control, which is beyond the capability of IR. Hence we built a few runtime wrappers to manage things like parameters construction, kernel reuse and palette management.

Upon Graph Compiler transfering to MLIR, it's no more feasible to maintain a compiler-specific downstream runtime for upstream MLIR, so we tend to remove the runtime wrappers entirely. It seems that runtime wrappers are still inevitable using the new APIs:

  • It still takes fairly complex structs as parameters, which are hard to construct by IR. E.g: converting base pointer + fixed stride to array of pointers, and the construction of brgemm::desc
  • The APIs are C++ based, which requires extra handling for name mangling and STL structures, as mentioned by @ZhennanQin before

I think this is also a common issue for invocation from compiler IR world. Is it possible to provide an alternative sets of C APIs taking only plain parameters and executing in a one-line fashion? That would make our lives a lot easier.

For example:

uint64_t brgemm_dispatch(dim_t batch, dim_t M, dim_t N, dim_t K, ...); // return handle of JITed kernel

void set_hw_context(uint64_t kernel_handle);
void release_hw_context(uint64_t kernel_handle);

void execute(uint64_t kernel_handle, void * A[], void * B[], ...); // `array of pointers` interface
void execute(uint64_t kernel_handle, void * A, void * B, uint32_t batch_size, uint64_t stride_A, uint64_t stride_B, ...); // helper function for `base pointer + fixed stride` usage, convert to `array of pointers` interface internally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huanghaixin008 oneDNN does never provide C++ API without a C API layer underneath it. You can always switch to that if it's a better choice for MLIR.


// we run it
brg.execute(A_B, c_buff, dst + (r[0] * BM * N + r[1] * BN) * sizeof(bfloat16_t), scratch);
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the original value of c_buff be accumulated to the final result? Is it possible to allow user to control the original value of C to participate in the accumulation. This can save the overhead of setting values of c_buf to 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is controlled by the add_C attribute. When set to false, it will just override C with the result (same as if c_buff is 0).

// both src, dst share same dt, no fused conversion
transform(dim_t M, dim_t N, data_type dt,
pack_type tag_src, dim_t ld_src,
pack_type tag_dst, dim_t ld_dst);
Copy link

@CaoE CaoE Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there still be limitations on leading dimensions of transform?

Copy link
Contributor Author

@mgouicem mgouicem Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No API limitations. However we will likely have some implementation limitations in first release.
Those will be removed as the implementation matures.


First, here is the list of attributes we plan on exposing:
- fp_math mode, in order for user to control input conversions semantics;
- scales/zero-point in order to support upconversion before computation.
Copy link

@Valentine233 Valentine233 Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it impossible to do scales/zero-point with BRGEMM_POST_OP, because it only accepts one binary memory during execution. So here are 2 questions:

  1. Would both BRGEMM and BRGEMM_POST_OP support scales/zero-point?
  2. How many pairs of scales/zero-point can we support? Taking the case of uint8 IN - uint8 OUT as example, we need to feed 2 pairs respectively for input A and B, and receive 1 pair for output C.


struct brgemm {
// Vanilla version of brgemm with no post-op or destination conversion.
brgemm(dim_t batch, dim_t M, dim_t N, dim_t K,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given dtA and dtB both dt::u8, what dtC and dtD (if with post ops) are supported for now?

void *C, void *scratch = nullptr);

// Execution function for the advanced brgemm<> variant
// Here the C matrix is just an input to accumulation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose accumulation is just an intermediate buffer. When accumulation and output have the same type, why we need to create an additional buffer accumulation?

@vpirogov vpirogov added this to the v3.5 milestone Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet