rfcs: Auto-tuning API #1764

dyoussif · 2023-11-30T18:07:09Z

rjoursler · 2023-12-04T17:50:29Z

rfcs/20231129-tuning-api/README.md

+create_primitive(prim); 
+dnnl_set_tune(true);   
+execute(prim);  //tuning happens here
+dnnl_set_tune(false); 


Using a global state to implement tuning seems like it introduces some usage limitations in regards to multi-threaded programs. In addition, I think there are some non-obvious conflicts with the primitive cache and when a tuned/non-tuned implementation are created at different times. Would it make sense to just add something like a

enum dnnl_dispatch_mode_t { dnnl_dispatch_default, dnnl_dispatch_heuristic, dnnl_dispatch_tune, }

to the primitive attributes instead? This could be extended to allow users further control around creation time and performance in the future and used to implement PR's such as #1743 (if it is accepted).

The main reason a global variable was used was to minimize coding changes for frameworks. I can check with them if they like this option better.

In addition, I think there are some non-obvious conflicts with the primitive cache and when a tuned/non-tuned implementation are created at different times.

The tuning state is set once for all primitives and configurations during tuning are cached separately from the default one. Can you give an example of a conflict?

Another potential "creation" scenario I've seen recently: a user wants to probe the primitive cache, that is skip creation (oneDNN returns an error) in case of a primitive cache miss. All this (together with tuning) could be combined under "create mode" sub-attribute.

Can you give an example of a conflict?

The core issue is that there is a significant internal state (visible via realized performance) based on execution order. This can cause reproduction issues depending the execution order. The ordering issues that come to my mind are

create(p); // Primitive cache miss, get default version tune_create(p); // Primitive cache hit, get default version

tune_create(p); // Get tuned version create(p); // Primitive cache hit, get tuned version ... // create many primitives create(p); // Primitive cache miss, get default version

This becomes significantly more complicated in a multi-threaded scenario where the ordering can vary from run to run.

The config iteration number is now part of the pd_key. Therefore any create that does not set the that iteration value will go to the default version. That version won't change until updated in the "finalize" part of tuning which will update the default key to point to the tuned implementation. In the case of a cache miss on the default version, the lookup table would be used to create the optimized implementation.

To me it seems like primitive cache management has to be addressed in a similar way regardless of the api.

echeresh · 2023-12-06T00:05:17Z

rfcs/20231129-tuning-api/README.md

+```c
+create_primitive(prim); 
+dnnl_set_tune(true);   
+execute(prim);  //tuning happens here


This scenario looks like the initially created primitive goes unused (as it's not needed for following tuning), and we have unnecessary overhead. Is it correct? Is there a way to avoid that?

It can be avoided if the tuning state is set before primitive creation or the proposed dispatch_mode_t is used.
On the other hand, there will be many primitive creations during the tuning process itself, so i don't anticipate the overhead from the first primitive creation to be a big overhead.

If there is no reason to keep the flow in this order (create -> set_tune) then I think it makes sense to flip these steps and use an attribute. In general the API would work like this:

Add a tuning knob to the primitive attribute to specify that primitive is created for tuning (let's call this primitive a "tunable" primitive)

If function API is explicitly requested - introduce dnnl_set_tune() which will modify the attribute on creation automatically

During execution of a tunable primitive (execute() call) oneDNN performs all necessary benchmarking or no benchmarking (e.g. when the primitive doesn't support tuning or when this case was already tuned). oneDNN also doesn't guarantee correctness upon execution for tunable primitives

A tunable primitive can't be used for normal execution, it must be re-created by the user (like in your example)

Do you see any issues with such API?

It looks good to me except for

If function API is explicitly requested - introduce dnnl_set_tune() which will modify the attribute on creation automatically

I believe primitive_attr_t is passed in as const during creation, so in this scenario it would be implemented the same way as currently proposed.

I thought about following the same approach as for FP math mode:

oneDNN/src/common/primitive_attr.hpp

Line 620 in c900643

, fpmath_mode_(dnnl::impl::get_fpmath_mode())

We have a global setter which sets a global value that is used as the default value during primitive attribute creation.

I see. Does this imply after tuning is unset, the primitive_attr_t needs to be recreated?

Yes, it will be required. Might be less flexible but at the same time having a distinctive property between normal and tunable primitives looks like an easy-to-understand approach.

echeresh · 2023-12-06T00:07:12Z

rfcs/20231129-tuning-api/README.md

+create_primitive(prim); 
+dnnl_set_tune(true);   
+execute(prim);  //tuning happens here
+dnnl_set_tune(false); 


Another potential "creation" scenario I've seen recently: a user wants to probe the primitive cache, that is skip creation (oneDNN returns an error) in case of a primitive cache miss. All this (together with tuning) could be combined under "create mode" sub-attribute.

echeresh · 2023-12-06T00:15:47Z

rfcs/20231129-tuning-api/README.md

+`query::tune_nconfigs`.  For each config it will create the primitive, execute it 5 times, and record the min
+time in the tune_info_t structure. In the case where number of configurations to try is greater than 40 it will stop.
+It will then recreate the primitive with tune_status set to finalize. During this call the config with the best 
+performance will be stored in a lookup table managed by the primitive and primitive cache will be updated to point


If this mechanism is to be extended to many other primitives - it may make sense to manage tuned parameters at a higher level in a generic way. Otherwise every primitive would have to implement lookup tables, logic for the finalization step, etc. But this is an implementation detail, and can be adjusted in the future if needed without API changes.

rjoursler reviewed Dec 4, 2023

View reviewed changes

echeresh reviewed Dec 6, 2023

View reviewed changes

vpirogov added the RFC A design document label Dec 14, 2023

dyoussif force-pushed the dyoussif/autotune branch 3 times, most recently from e287200 to 648d60d Compare January 31, 2024 01:29

rfcs: Auto-tuning API

84b0919

dyoussif force-pushed the dyoussif/autotune branch from 648d60d to 84b0919 Compare January 31, 2024 01:32

rjoursler mentioned this pull request Aug 29, 2024

Start enabling the "generic" OpenCL vendor #2019

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfcs: Auto-tuning API #1764

rfcs: Auto-tuning API #1764

dyoussif commented Nov 30, 2023

rjoursler Dec 4, 2023 •

edited

Loading

dyoussif Dec 5, 2023

echeresh Dec 6, 2023

rjoursler Dec 7, 2023 •

edited

Loading

dyoussif Dec 7, 2023

echeresh Dec 6, 2023

dyoussif Dec 6, 2023

echeresh Dec 7, 2023 •

edited

Loading

dyoussif Dec 8, 2023

echeresh Dec 8, 2023

dyoussif Dec 8, 2023

echeresh Dec 8, 2023

echeresh Dec 6, 2023

echeresh Dec 6, 2023

rfcs: Auto-tuning API #1764

Are you sure you want to change the base?

rfcs: Auto-tuning API #1764

Conversation

dyoussif commented Nov 30, 2023

rjoursler Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjoursler Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echeresh Dec 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjoursler Dec 4, 2023 •

edited

Loading

rjoursler Dec 7, 2023 •

edited

Loading

echeresh Dec 7, 2023 •

edited

Loading