rfc: proposal for block level APIs #1852

mgouicem · 2024-04-01T05:01:06Z

Proposal to expose block level API on CPU, namely brgemm and transforms.

Rendered version

rfcs/20240326-block-level-api/README.md

jgong5

Can you also add sample code that uses the APIs to demonstrate the usage flow?

rfcs/20240326-block-level-api/README.md

jgong5 · 2024-04-04T09:21:36Z

rfcs/20240326-block-level-api/README.md

+Given the small number of layouts, we will expose a layout tag enum
+(next section will address the lack of need for metadata handling by
+oneDNN brgemm).


Will these layouts be well documented and exposed publicly?

Yes, those are part of the external API. Note that those layout only account for packing required by the HW (so packed along K by 32 bit, or 64 bit). They do not account for blocking sizes like for oneDNN primitive.
For best performance, the brgemm user can apply the same brgemm blocking to weights packing.

Yes, the blocking will be handled from the caller side. Meanwhile, since it is something defined by the HW archs, the layout can be well-defined and documented to align with the HW spec.

jgong5 · 2024-04-04T09:23:41Z

rfcs/20240326-block-level-api/README.md

+The recommendation as a starting point is to make query mandatory: no
+layout forcing on user side. This simplifies the configurations
+supported by brgemm. Given that we have no support internally for
+transa/tranb, those will not be exposed as well. We will also support


Do you mean no support for transa/transb on the layout conversion API or on the BRGEMM API? It would be beneficial to have transa/transb (or only transa when b is VNNI) supported at the BRGEMM API.

The conversion API will have support for transpose, but not the BRGEMM API.
The main reason is that there is no support for fused transposition in current implementation. We could use that information to generate different kernels internally but in practice, it will likely not be faster than doing the transposition in a temporary buffer before calling brgemm.

The conversion API will have support for transpose, but not the BRGEMM API. The main reason is that there is no support for fused transposition in current implementation. We could use that information to generate different kernels internally but in practice, it will likely not be faster than doing the transposition in a temporary buffer before calling brgemm.

Would it be more flexible to have the transposition flags supported at the API level? BTW, in scaled-dot-product-attention, the first bmm has B transposed.

Currently we have no support whatsoever in the implementation for fusing transposition, hence why the proposal does not include it.

If there is value in adding this feature in the future, we would extend the API to support that.

jgong5 · 2024-04-04T09:26:38Z

rfcs/20240326-block-level-api/README.md

+
+Note that current oneDNN implementation just takes ISA as a maximum
+ISA. To simplify the programming model, isa selection will not appear
+in block level APIs. However, `dnnl::set_max_cpu_isa` will still be


In this context, if there is no ISA setting at the BRGEMM level, I don't think we will use this global API to configure things. I assume oneDNN would make wise decision to choose right ISA given input shapes, is that correct?

Yes the plan is to pick best available ISA under the hood. If ISA control is needed (there are some corner cases), that global API can be used.

Using the global API will cause problem when do parallel compiling, like one thread want to generate AVX512 brgemm while another thread want to generate AMX brgemm. Suggest to have another way to specify the ISA per brgemm creation.

Yes the plan is to pick best available ISA under the hood. If ISA control is needed (there are some corner cases), that global API can be used.

What do you mean by "best available ISA"? If AVX is a better choice for a particular shape than AMX, will you choose AVX or AMX?

@ZhennanQin do you have a usage scenario for using different ISA in a single process?

@jgong5 best available means highest performing ISA on the platform that is creating that brgemm object (e.g. AMX > AVX512 > AVX2 ..., more details here)

In the current proposal, we cannot use the shape to select ISA, since ISA selection could impact data layout (e.g. we currently might not require 32-bit packed/VNNI layout when lower ISA requires emulation).
We could make layout independent of ISA but this would require changes in current oneDNN kernel generation logic (or disabling some datatype/isa combinations).

In the current proposal, we cannot use the shape to select ISA, since ISA selection could impact data layout (e.g. we currently might not require 32-bit packed/VNNI layout when lower ISA requires emulation).
We could make layout independent of ISA but this would require changes in current oneDNN kernel generation logic (or disabling some datatype/isa combinations).

Suppose we want to compute with int8 (a typical case is woq int4 with int8 compute), INT8 VNNI might work better than INT8 AMX with smaller shapes while both would have same VNNI4 weight layout. How do you decide the ISA in this case?

@ZhennanQin do you have a usage scenario for using different ISA in a single process?

A valid usage scenario is, if the input shape is not dividable by the tiling size, we may use a small brgemm to handle the tail shape. This small brgemm only execute once, AVX512-VNNI may be faster than AMX considering the tile reconfigure overhead. Then for this mamtul, both AMX and AVX512-VNNI brgemms are used.

@jgong5 in the case you provide, there is no implication of isa wrt. Currently, isa has layout implication mostly when f16 emulation happens. But there are other scenarios, where some architectures have different instructions for matrix multiplication acceleration that require different layout (e.g. bfmmla, and SME on ARM).

How do you decide the ISA in this case?

We just use the highest available ISA for this datatype, independently of shape. See here for the exact logic

Then for this mamtul, both AMX and AVX512-VNNI brgemms are used.

Makes sense, but why would you create those in separate threads?

Makes sense, but why would you create those in separate threads?

MLIR support multi-thread compilation, we can't guarantee the brgemm creations happen within a single thread or separate threads.

jgong5 · 2024-04-04T09:29:06Z

rfcs/20240326-block-level-api/README.md

+We also have internal attributes to control prefetch hints, prefetch
+distances and non-temporal stores. However those are not yet fully


will you consider to expose this in the future? These knobs give us flexibility to further tune the performance.

So far, there is no plan to expose those since there is no support internally.
The only prefetching we do internally is to prefetch the destination buffer (matrix C) at the beginning of the accumulation loop so that it is ready to accumulate result/write into. And this can be enabled under the hood, without exposing API.

For other prefetching, we would have to assess potential speedups and use cases before considering exposing them.

How about temporal/non-temporal loads/stores?

Same there is no support for temporal/non-temporal loads/stores as of now.
But that can be considered at a later time depending on potential upside.

jgong5 · 2024-04-04T09:30:59Z

rfcs/20240326-block-level-api/README.md

+
+
+To make a first version of the API publicly available faster,
+we will start by relying on existing dnnl::primitive_attr objects.


I guess starting with a separate attribute abstraction would leave more flexibility to revise in the future?

New attributes would require new entry points, testing, ... that would delay first release of the feature, hence the suggestion to start with existing primitive attribute for first iteration, and then migrate to dedicated attributes.

jgong5 · 2024-04-04T09:32:54Z

rfcs/20240326-block-level-api/README.md

+change the API when we introduce the feature, or add a placeholder
+parameter if we are confident this feature will gain traction soon.
+
+Also, note that currently, no kernel-level cache exists for block


The caching support is a must for performance. Frameworks have to provide their own cache support if oneDNN doesn't support it. oneDNN supports it at the beginning would be ideal.

jgong5 · 2024-04-04T09:39:34Z

rfcs/20240326-block-level-api/README.md

+ // Execution function for the advanced brgemm variant
+ // Here the C matrix is just an input to accumulation
+ // final result after postop/conversion will be in D
+ // Computes D = \beta C + \alpha \sum_i A_i \cdot B_i + bias


so here, D has different dtype from C?

Yes D is used for post-op and conversion fusion, hence why it has a different datatype.

jgong5 · 2024-04-04T09:46:10Z

rfcs/20240326-block-level-api/README.md

+- in practice, when data is pre-packed, blocking information (so brgemm
+ object creation parameters) have to be known in order to layout data
+ in an optimal way.


The problem is that the input shape is unknown in the prepacking phase so we are not able to build a right brgemm object then.

Also, adding to that, how do you envision those objects' handling should look like from a compiler point of view that lowers to brgemm calls? Would this mean it would need to allocate memory, call constructors, and then call methods? Or is it expected that those will be preconfigured separately and a compiler just makes calls using some kind of a naming scheme?

The problem is that the input shape is unknown in the prepacking phase so we are not able to build a right brgemm object then.

There are two possibilities here:

brgemm parameters are independent of your input shapes. For example, you can only use blocks of size 32x32 and 1x32. In that case, you can create the brgemm objects independently of the input shapes, and do blocking with respect to those. This is typically what oneDNN primitives do with blocked layout . You can see in those that they are independant of input shape (e.g. AB16b64a uses blocks 16x64 in plain format, AB4b64a4b uses blocks 16x64 in VNNI layout, ...).

brgemm parameters are dependant of your input shapes. In that case, you cannot do efficient pre-packing of the weights. You can only put them in VNNI format or not.

If you are in the latter situation, we can make these static with two implications:

oneDNN brgemm will never be able to do isa selection based on shapes (e.g. it might be faster to use a lower isa for some shapes that are memory bound).

you will not get the best performance from pre-packing.

@kurapov-peter both options are possible, as laid out above.

Now there is a balance between jitting time (increases with the number of brgemm objects created) and execution time (specialized objects will typically provide better performance). That balance will typically depend on the application.

In any case, hoisting out the kernel creation and memory allocation (e.g. through the use of preallocated memory pools) is desirable to avoid overheads in hot-loops.

If brgemm parameters are independent of your input shapes can be guaranteed, I vote for option 2 as querying with each brgemm object is redundant. Also, I think this API is to let user understand the preference of brgemm implementation, but not mandatory to follow. User have more understanding about the whole computation looks like and can make the final decision of which layout should be used and pass it during brgemm creation.

@mgouicem as we synced offline, we can safely assume that the layout of B is independent from the input shapes and we can create brgemm object with arbitrary input shapes, or M in particular, to query the layout for B, correct? If so, please get these assumptions documented. Thanks.

Also, I think this API is to let user understand the preference of brgemm implementation, but not mandatory to follow. User have more understanding about the whole computation looks like and can make the final decision of which layout should be used and pass it during brgemm creation.

I guess there are 2 parts to the memory layout:

one aligned with computation blocking. For this one you are right, the brgemm user has better knowledge on which block sizes make sense, and they can use that block size for pre-packing matrices, and for brgemm computation shapes

one aligned with HW capabilities. This is an implementation details whether the brgemm API will use instruction requiring plain, packed_32, packed_64 or any other layout.

as we synced offline, we can safely assume that the layout of B is independent from the input shapes and we can create brgemm object with arbitrary input shapes, or M in particular, to query the layout for B, correct?

Yes we can achieve that. However note that if we do so, we lose opportunity to select isa/layout based on brgemm input shapes. Is that ok ?

Yes we can achieve that. However note that if we do so, we lose opportunity to select isa/layout based on brgemm input shapes. Is that ok ?

If not, how do you suggest we decide the layout in the weight prepacking stage when the input sizes are unknown?

If not, how do you suggest we decide the layout in the weight prepacking stage when the input sizes are unknown?

If shapes are unknown, the layout cannot depend on shape (e.g. same layout is required for gemv and large block gemm). Here it would likely depend on hardware capabilities (e.g. always use pack_32 when AMX/VNNI is available).

kurapov-peter · 2024-04-04T13:46:49Z

rfcs/20240326-block-level-api/README.md

+ applicability, it seems unnecessary to expose it.
+
+To make the public brgemm API simpler, we propose to expose only the array
+of pointers flavor.


I'd assume this is also the best option performance-wise if the user can provide those as compile-time constants. Or would the choice effects be negligible here?

@ankalinin @nivas-x86 to comment on that.

Could you clarify what you mean by compile-time constants? Do you mean brgemm object creation time?

In general, a few things:

if brgemm object creation takes pointers as parameters, it will severely reduce the possibility to reuse brgemm objects/kernels across an application, and introduce jitting overheads. With that respect, base pointer + array of offsets might be more reusable.

even if the pointers are constant, we typically still need to put those pointers in register. IIRC, the fastest way to put a 64 bit value in register on x64 is to actually do a lea, in which case whether the pointer is constant or not should not matter. Do I miss something here?

the pointers are typically not constant at brgemm object creation-time. However, the offsets might be. In that case, the base pointer + array of offset would be the fastest option, as it can remove some overhead related to pointer arithmetic (by reusing buffer of offsets without resetting it before every brgemm call).

Actually this discussion is making me want to revise to using base pointer + array of offsets :).
Adding @ankalinin and @nivas-x86 for discussion.

the pointers are typically not constant at brgemm object creation-time. However, the offsets might be. In that case, the base pointer + array of offset would be the fastest option, as it can remove some overhead related to pointer arithmetic

Yup, that was my line of thinking. Take the inference case as an example: the compiler will be able to prepopulate the weight matrix offsets in advance, cutting the cost of address computation. While good on paper, it is unclear to me whether that would make any real difference, hence the question.

Fair observation. The usages I saw in IPEX, openVINO and oneDNN all rely on the pointer based flavor.
I will let @nivas-x86 and @ankalinin comment about potential speedups when using precomputed offsets.

Graph compiler uses base pointer + fixed stride flavor for matmul implementation and array of pointers for convolution. For most DL compilers, it's not easy to describe pointer arithmetic. Take brgemm in Linalg dialect(https://mlir.llvm.org/docs/Dialects/Linalg/#linalgbatch_reduce_matmul-linalgbatchreducematmulop) as example, base pointer + fixed stride can perfectly match its semantic. I suggest providing both base pointer + fixed stride for ease-of-use and array of pointers for advanced and flexible usage.

Echo @ZhennanQin .

Having ptr+offsets help in the case of brgemm kernles when the work done by brgemm is small.
In depthwise conv (brdgmm) we cache the entire bactch (all combinations of bs_size, offsets required) and it seemed to have provided some perf benefit.
But the argument of having constant pointers is counter-intutive to me... no benefit and limited flexibility.

I don't thing there is any argument for having constant pointers, quite the opposite.
Offsets can be constants, but pointers will have to be recomputed always, even when offsets are constant.

kurapov-peter · 2024-04-04T13:53:24Z

rfcs/20240326-block-level-api/README.md

+For the brgemm API specifically, an array of equal sizes for A and B
+blocks are passed. To guarantee these are of equal size, and simplify
+pointer arithmetic logic inside generated kernels, we will take an
+array of pairs of pointers.


Can this be a potential problem in the case of some lookup pointer arrays to both A and B? I'm imagining some cases of embeddings. In that case, the API that accepts a vector of pairs would likely force some memory movement.

I am not sure I understand. Could you give an example or more detail on the concern?

Because of the batch reduction semantic, we have to take as many A_i blocks as B_i blocks. Here the two options are:

take two arrays of equal size (one for A_i, one for B_i)

take a single array of pairs (to interleave pointers to A_i/B_i)

I also don't understand why merging A and B into single array can guarantee they are of equal size. What if the array size is odd? The sizes of A and B pointer array are specified by batch parameter. User should guarantee both A and B array contains batch number of pointers.

@mgouicem, FWIW, taking two vectors will save us from having to copy pointers from the pairs to another C-like array to pass them forward to the C API, which means less overhead.

@densamoilov I had that exact conversation with @dzarukin :-).

Currently, internal implementation takes a single array of structures containing A_i and B_i pointers. The interleaving of pointers allows to save on GPR usage IIRC.

Regarding extra copies, there are two sources of copies:

C++ to C API. Whether we use 2 vectors or vector of pairs, we can pass those directly. Albeit, it is cleaner with 2 separate vectors, it is still doable without a copy for vector of pairs (see here)

external API to internal implementation. This one is tricky since internally, we use a multipurpose structure (see brgemm_batch_element_t, so we have to copy anyway because of this internal structure (until we clean that up).

Now, I think the vector of pair version is preferable, since there is a path to remove all copies without modification to kernel generation (by aligning internal structure to size of pair).
If we go with the 2 separate vectors solution, removing all copies would require modification to internal structure and APIs, as well as modification to kernels.

@ankalinin and @nivas-x86 for comments.

C++ to C API. Whether we use 2 vectors or vector of pairs, we can pass those directly. Albeit, it is cleaner with 2 separate vectors, it is still doable without a copy for vector of pairs (see here)

@mgouicem, I think there are two problems with your example:

You are assuming that the data members of std::pair are contiguous in memory. The standard doesn't seem to guarantee that. Your example works because of the way std::pair is implemented (link). In our case std::pair is just a struct with two pointers therefore no padding is required hence the example works.

You are basically passing a C++ object (std::pair) as void * to the C API. By design, we forbid C++ objects from crossing the library boundary (a.k.a C API) unless there is no way around it (e.g. SYCL interop API)

I agree standard does not guarantee std::pair is contiguous in memory. However, all compilers that I know of implement it this way. Same as type punning, or reinterpret_cast: wrt to standard, they are not safe to use, but in practice, compilers tend to preserve behaviors that are not required by standard.

In any case, we can replace std::pair with a dedicated struct, no big deal.

In any case, we can replace std::pair with a dedicated struct, no big deal.

If we are dead set on passing pairs of pointers then I think that's what we need to do (I prefer to err on the side of caution), this way we won't rely on the implementation details of std::pair.

So, is the decision here to go with an array of pointer pairs, accepting potentially higher overhead?

Also, as a follow-up to the offline conversation, expanding a little on the original comment:

Can this be a potential problem in the case of some lookup pointer arrays to both A and B? I'm imagining some cases of embeddings. In that case, the API that accepts a vector of pairs would likely force some memory movement.

Consider an example when one of the inputs is a weight matrix and another one is accessed indirectly as you'd do in the case of an embedding lookup (there are also other similar cases found in beam search in LLMs, image attention masks, roi_heads in detection). Restricting the API to accept pairs of pointers hinders optimization possibilities. So, those cases can be addressed with a separate interface, yet the feasibility should also be estimated.

kurapov-peter · 2024-04-04T13:58:25Z

rfcs/20240326-block-level-api/README.md

+#### How to express HW specific layouts
+
+There are two main ways oneDNN can expose those layouts:
+- through layout tags. This has the advantage of being very simple,


Could you please elaborate a bit on what those tags would look like?

Yes they are specified in the API proposal below. They will only express HW layout requirement, not blocking logic.
So for now, there would only be plain, packed_32 (values packed by 32 bit, needed for VNNI/AMX) and packed_64 (values packed by 64-bit, maps to ARM *MMLA)

kurapov-peter · 2024-04-04T14:06:37Z

rfcs/20240326-block-level-api/README.md

+- in practice, when data is pre-packed, blocking information (so brgemm
+ object creation parameters) have to be known in order to layout data
+ in an optimal way.


Also, adding to that, how do you envision those objects' handling should look like from a compiler point of view that lowers to brgemm calls? Would this mean it would need to allocate memory, call constructors, and then call methods? Or is it expected that those will be preconfigured separately and a compiler just makes calls using some kind of a naming scheme?

ZhennanQin · 2024-04-07T01:44:45Z

rfcs/20240326-block-level-api/README.md

+
+
+There are two options here:
+- expose two APIs, `brgemm::set_hw_context()` and


If we need to run two brgemm kernels, do we need to run brgemm::release_hw_context() and brgemm::set_hw_context() in between? Can we merge them into brgemm:reset_hw_context() to reduce the overhead?

There is a comment on API level that no release is needed between two calls to set_hw_context().
However adding resest_hw_context is cleaner, so adding that to the proposal. Thanks.

ZhennanQin · 2024-04-07T01:50:47Z

rfcs/20240326-block-level-api/README.md

+The recommendation is to go with the first option, with explicit set
+and release functions, that the user can hoist as they see fit.
+
+With respect to OS syscall, we recommend to make it transparent to


How about the AMX palette buffer management? Is it possible to expose this to user so that user can implement their own brgemm::set_hw_context() and brgemm::release_hw_context() to avoid those function calls?

What's the problem of hiding this inside oneDNN?

It's related to an optimization opportunity. Based on current design, you must reconfigure hw context if multiple brgemm kernels are used. But if we can manage AMX palette, we don't need to reconfigure hw context if those brgemms use the same AMX palette.

@ZhennanQin , we would like the API to be as ISA agnostic as possible.
Internally, we will not reconfigure if the same pallette is already loaded on the core.

ZhennanQin · 2024-04-07T01:54:05Z

rfcs/20240326-block-level-api/README.md

+
+## Handling of attributes
+
+First, here is the list of attributes we plan on exposing:


I suggest adding ISA and accumulation_datatype as attributes as well for end-user to control.

ZhennanQin · 2024-04-07T02:02:14Z

rfcs/20240326-block-level-api/README.md

+ // The batch size is the size of the vector.
+ // pointers are void*, datatypes are specified in constructor
+ // Computes C = \beta C + \alpha \sum_i A_i \cdot B_i
+ void execute(const std::vector<std::pair<void *, void *>> &A_B,


It's very difficult to handle STL containers inside MLIR, will you provide C interface as well? I want to check what the C interface looks like.

Agreed. Also, std::vector might incur overhead in its memory management.

rfcs/20240326-block-level-api/README.md

nivas-x86 · 2024-04-09T01:05:39Z

rfcs/20240326-block-level-api/README.md

+ // - Release is static
+ // - No release necessary between different calls to set_hw_context
+ void set_hw_context() const;
+ static void release_hw_context() const;


"set_hw_context()" seems very generic (maybe intentionally). But it raises a question about what is being modified (mxcsr, etc.. ). Maybe provide a documentation of what features it is allowed to modify, for transparency?

By static do you mean, it is safe to call it regardless of a previous call to set_hw_context()?

this is a valid point and I left it fuzzy on purpose :-). In general it just changes the state of the core the function is executed on. If need be, we can be more specific in documentation and specify for each ISA, what is being changed.

just that it can be called without a brgemm object. This is to highlight that releasing HW context should be independent of how it was set.

huanghaixin008 · 2024-04-11T05:33:29Z

rfcs/20240326-block-level-api/README.md

+
+## All-in-all
+
+### Interface proposal


Glad to see the new APIs hide the details of palette management. And I still have a few concerns regarding the future API usage in Graph Compiler.

Currently Graph Compiler calls internal BRGEMM APIs from compiler IR world in both base pointer + fixed stride and array of pointers favors. The compiler IR only provides very limited grammar, and the internal APIs require complex structs as parameters and notable low-level control, which is beyond the capability of IR. Hence we built a few runtime wrappers to manage things like parameters construction, kernel reuse and palette management.

Upon Graph Compiler transfering to MLIR, it's no more feasible to maintain a compiler-specific downstream runtime for upstream MLIR, so we tend to remove the runtime wrappers entirely. It seems that runtime wrappers are still inevitable using the new APIs:

It still takes fairly complex structs as parameters, which are hard to construct by IR. E.g: converting base pointer + fixed stride to array of pointers, and the construction of brgemm::desc

The APIs are C++ based, which requires extra handling for name mangling and STL structures, as mentioned by @ZhennanQin before

I think this is also a common issue for invocation from compiler IR world. Is it possible to provide an alternative sets of C APIs taking only plain parameters and executing in a one-line fashion? That would make our lives a lot easier.

For example:

uint64_t brgemm_dispatch(dim_t batch, dim_t M, dim_t N, dim_t K, ...); // return handle of JITed kernel void set_hw_context(uint64_t kernel_handle); void release_hw_context(uint64_t kernel_handle); void execute(uint64_t kernel_handle, void * A[], void * B[], ...); // `array of pointers` interface void execute(uint64_t kernel_handle, void * A, void * B, uint32_t batch_size, uint64_t stride_A, uint64_t stride_B, ...); // helper function for `base pointer + fixed stride` usage, convert to `array of pointers` interface internally

@huanghaixin008 oneDNN does never provide C++ API without a C API layer underneath it. You can always switch to that if it's a better choice for MLIR.

rfcs/20240326-block-level-api/README.md

CaoE · 2024-04-25T12:46:15Z

rfcs/20240326-block-level-api/README.md

+
+ // we run it
+ brg.execute(A_B, c_buff, dst + (r[0] * BM * N + r[1] * BN) * sizeof(bfloat16_t), scratch);
+ });


Will the original value of c_buff be accumulated to the final result? Is it possible to allow user to control the original value of C to participate in the accumulation. This can save the overhead of setting values of c_buf to 0.

Yes this is controlled by the add_C attribute. When set to false, it will just override C with the result (same as if c_buff is 0).

CaoE · 2024-04-25T12:53:36Z

rfcs/20240326-block-level-api/README.md

+ // both src, dst share same dt, no fused conversion
+ transform(dim_t M, dim_t N, data_type dt, 
+ pack_type tag_src, dim_t ld_src,
+ pack_type tag_dst, dim_t ld_dst);


Will there still be limitations on leading dimensions of transform?

No API limitations. However we will likely have some implementation limitations in first release.
Those will be removed as the implementation matures.

Valentine233 · 2024-04-28T09:39:33Z

rfcs/20240326-block-level-api/README.md

+
+First, here is the list of attributes we plan on exposing:
+- fp_math mode, in order for user to control input conversions semantics;
+- scales/zero-point in order to support upconversion before computation.


I suppose it impossible to do scales/zero-point with BRGEMM_POST_OP, because it only accepts one binary memory during execution. So here are 2 questions:

Would both BRGEMM and BRGEMM_POST_OP support scales/zero-point?

How many pairs of scales/zero-point can we support? Taking the case of uint8 IN - uint8 OUT as example, we need to feed 2 pairs respectively for input A and B, and receive 1 pair for output C.

Valentine233 · 2024-04-28T09:47:34Z

rfcs/20240326-block-level-api/README.md

+
+struct brgemm {
+ // Vanilla version of brgemm with no post-op or destination conversion.
+ brgemm(dim_t batch, dim_t M, dim_t N, dim_t K,


Given dtA and dtB both dt::u8, what dtC and dtD (if with post ops) are supported for now?

Valentine233 · 2024-04-28T09:53:00Z

rfcs/20240326-block-level-api/README.md

+ void *C, void *scratch = nullptr);
+
+ // Execution function for the advanced brgemm<> variant
+ // Here the C matrix is just an input to accumulation


I suppose accumulation is just an intermediate buffer. When accumulation and output have the same type, why we need to create an additional buffer accumulation?

mgouicem added 2 commits April 1, 2024 06:31

rfcs: proposal for block-level/brgemm APIs

a62ff98

Changing hierarchy of headings

4fcd3ea

vpirogov added the RFC A design document label Apr 1, 2024

aaraujom reviewed Apr 1, 2024

View reviewed changes

mgouicem added 2 commits April 2, 2024 09:35

fix typos and add missing picture

e48e212

move picture to more meaningful spot

4027300

jgong5 reviewed Apr 4, 2024

View reviewed changes

kurapov-peter reviewed Apr 4, 2024

View reviewed changes

add example, batch param in desc constructor, and some notes

a380114

ZhennanQin reviewed Apr 7, 2024

View reviewed changes

nivas-x86 reviewed Apr 8, 2024

View reviewed changes

rfcs/20240326-block-level-api/README.md Show resolved Hide resolved

nivas-x86 reviewed Apr 9, 2024

View reviewed changes

huanghaixin008 reviewed Apr 11, 2024

View reviewed changes

add C API, and remove desc from brgemm struct

6dce755

CaoE reviewed Apr 25, 2024

View reviewed changes

rfcs/20240326-block-level-api/README.md Show resolved Hide resolved

CaoE reviewed Apr 25, 2024

View reviewed changes

Valentine233 reviewed Apr 28, 2024

View reviewed changes

mgouicem force-pushed the mgouicem/rfcs/brgemm branch from 03a286e to da0e72f Compare June 6, 2024 11:15

switch to base_ptr + offset, addresses first feedbacks

2de75d1

mgouicem force-pushed the mgouicem/rfcs/brgemm branch from da0e72f to 2de75d1 Compare June 6, 2024 11:19

vpirogov added this to the v3.5 milestone Jul 16, 2024

		We also have internal attributes to control prefetch hints, prefetch
		distances and non-temporal stores. However those are not yet fully



		To make a first version of the API publicly available faster,
		we will start by relying on existing dnnl::primitive_attr objects.



		There are two options here:
		- expose two APIs, `brgemm::set_hw_context()` and


		## Handling of attributes

		First, here is the list of attributes we plan on exposing:

rfc: proposal for block level APIs #1852

Are you sure you want to change the base?

rfc: proposal for block level APIs #1852

Conversation

mgouicem commented Apr 1, 2024 • edited by vpirogov Loading

jgong5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgong5 Apr 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgouicem Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgouicem Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nivas-x86 Apr 8, 2024 • edited by mgouicem Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densamoilov Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

densamoilov Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZhennanQin Apr 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgouicem commented Apr 1, 2024 •

edited by vpirogov

Loading

jgong5 Apr 7, 2024 •

edited

Loading

mgouicem Apr 5, 2024 •

edited

Loading

mgouicem Apr 5, 2024 •

edited

Loading

nivas-x86 Apr 8, 2024 •

edited by mgouicem

Loading

densamoilov Apr 9, 2024 •

edited

Loading

densamoilov Apr 9, 2024 •

edited

Loading

ZhennanQin Apr 7, 2024 •

edited

Loading

huanghaixin008 Apr 11, 2024 •

edited

Loading