[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output #128825

leslie-fang-intel · 2024-06-17T12:58:21Z

Stack from ghstack (oldest at bottom):

Summary
Support int8 GEMM Template with refer MicroInt8GEMM kernel for case:

Activation dtype: uint8
Weight dtype: int8
Output dtype: float32/bfloat16
Post Op Fusion: without unary post operator fusion

Test Plan

clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise

Next Step

Unary post op fusion
Int8 output
Binary Fusion
AMX int8 MicroGEMM Kernel

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-06-17T12:58:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128825

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit 7334f44 with merge base dabaebd ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / before-test / llm-retrieval (gh) (matched llm-retrieval rule in flaky-rules.json)
Unexpected HTTP response: 429
trunk / before-test / llm-retrieval (gh) (matched llm-retrieval rule in flaky-rules.json)
Invalid miniconda version!

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.24xl.spr-metal, unstable) (gh) (#126993)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 1876e45d603fa56589b9d83895770b05f780577a Pull Request resolved: #128825

… FP32 output" **Summary** Support int8 GEMM Template with refer MictroInt8GeMM kernel for case: - Activation: uint8; Weight: int8; Output: float32/bfloat16 - With Unary Post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [ ] Unary post op fusion - [ ] Int8 output - [ ] AMX int8 MicroGEMM Kernel - [ ] Binary Fusion cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: bb71199be19188acaec121bf8200e6cb511fbb5b Pull Request resolved: #128825

… FP32 output" **Summary** Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [ ] Unary post op fusion - [ ] Int8 output - [ ] AMX int8 MicroGEMM Kernel - [ ] Binary Fusion cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 6d89dfaa7bc1636770d98b225dbd15714857adfc Pull Request resolved: #128825

test/inductor/test_cpu_select_algorithm.py

torch/_inductor/codegen/cpp_gemm_template.py

torch/_inductor/codegen/cpp_micro_gemm.py

torch/_inductor/mkldnn_lowerings.py

… FP32 output" **Summary** Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [ ] Unary post op fusion - [ ] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 071f66db3cae3a20844132324fcdf0afabf4f880 Pull Request resolved: pytorch#128825

… FP32 output" **Summary** Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [ ] Unary post op fusion - [ ] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-06-28T01:22:57Z

Hi @jansel, could you kindly help to take a look of this PR? I have modified torch/_inductor/utils.py.

… FP32 output" **Summary** Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [ ] Unary post op fusion - [ ] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-06-30T09:38:45Z

@pytorchbot merge

pytorchmergebot · 2024-06-30T09:40:21Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…t and Unary Post Op (#129048) **Summary** Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: #129048 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825

**Summary** We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template. - Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar. - We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion. **Test Plan** ``` python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear ``` Pull Request resolved: #129049 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048

…ion (#129103) **Summary** Based on previous PR, add the config to support quantized linear binary - optional(unary) post op fusion. - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with binary and optional[Unary] post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise_binary ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: #129103 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049

**Summary** Add the AMX micro gemm kernel with int8 data type. **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_amx ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [✓] Binary Fusion - [✓] AMX int8 MicroGEMM Kernel Pull Request resolved: #129220 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103

…129221) **Summary** This PR mainly refactor 2 things: 1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered. 2. Add an util function to get the output data type and compute data type from input data type. Pull Request resolved: #129221 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #128825, #129048, #129049, #129103, #129220

… template (#129470) **Summary** Remove redundant INT8-specific logic in the INT8 GEMM template to unify the code structure with FP32/BF16/FP16 GEMM Template. **Test Plan** ``` numactl -C 56-111 -m 1 python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear ``` Pull Request resolved: #129470 Approved by: https://github.com/jgong5 ghstack dependencies: #128825, #129048, #129049, #129103, #129220, #129221

pytorch#128825) **Summary** Support int8 GEMM Template with refer MicroInt8GEMM kernel for case: - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16 - Post Op Fusion: without unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [ ] Unary post op fusion - [ ] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: pytorch#128825 Approved by: https://github.com/jgong5, https://github.com/jansel

…t and Unary Post Op (pytorch#129048) **Summary** Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU` - Activation dtype: uint8 - Weight dtype: int8 - Output dtype: float32/bfloat16/uint8 - Post Op Fusion: with unary post operator fusion **Test Plan** ``` clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise ``` **Next Step** - [✓] Unary post op fusion - [✓] Int8 output - [ ] Binary Fusion - [ ] AMX int8 MicroGEMM Kernel Pull Request resolved: pytorch#129048 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: pytorch#128825

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output

d8cb775

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 17, 2024

leslie-fang-intel added a commit that referenced this pull request Jun 17, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output

e447791

ghstack-source-id: 1876e45d603fa56589b9d83895770b05f780577a Pull Request resolved: #128825

pytorchbot added the open source label Jun 17, 2024

leslie-fang-intel marked this pull request as draft June 17, 2024 13:09

leslie-fang-intel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Jun 17, 2024

leslie-fang-intel added a commit that referenced this pull request Jun 18, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output

0045768

ghstack-source-id: bb71199be19188acaec121bf8200e6cb511fbb5b Pull Request resolved: #128825

leslie-fang-intel marked this pull request as ready for review June 19, 2024 07:13

leslie-fang-intel requested review from jgong5 and chunyuan-w June 19, 2024 07:13

leslie-fang-intel added a commit that referenced this pull request Jun 19, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output

2871452

ghstack-source-id: 6d89dfaa7bc1636770d98b225dbd15714857adfc Pull Request resolved: #128825

This was referenced Jun 19, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op #129048

Closed

[Inductor][Quant] Change the schema of QLinear Binary #129049

Closed

jgong5 reviewed Jun 19, 2024

View reviewed changes

leslie-fang-intel mentioned this pull request Jun 20, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with Binary Fusion #129103

Closed

1 task

leslie-fang-intel mentioned this pull request Jun 20, 2024

Add mkldnn_ir.py into merge rule #129107

Closed

leslie-fang-intel requested a review from jgong5 June 20, 2024 05:40

This was referenced Jun 21, 2024

[Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM #129220

Closed

[Inductor][CPP] Pass weight dtype explicitly for cpp gemm template #129221

Closed

OnlyFor pushed a commit to OnlyFor/pytorch that referenced this pull request Jun 21, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output

27372c8

ghstack-source-id: 071f66db3cae3a20844132324fcdf0afabf4f880 Pull Request resolved: pytorch#128825

leslie-fang-intel added 5 commits June 22, 2024 16:35

leslie-fang-intel mentioned this pull request Jun 25, 2024

[Inductor][CPP] Remove redundant INT8-specific logic in the INT8 GEMM template #129470

Closed

leslie-fang-intel requested a review from jansel June 28, 2024 01:21

jansel approved these changes Jun 28, 2024

View reviewed changes

pytorchmergebot added the merging label Jun 30, 2024

pytorchmergebot closed this in 35a197d Jun 30, 2024

pytorchmergebot added Merged and removed merging labels Jun 30, 2024

github-actions bot deleted the gh/leslie-fang-intel/116/head branch July 31, 2024 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output #128825

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output #128825

leslie-fang-intel commented Jun 17, 2024 •

edited

Loading

pytorch-bot bot commented Jun 17, 2024 •

edited

Loading

leslie-fang-intel commented Jun 28, 2024

leslie-fang-intel commented Jun 30, 2024

pytorchmergebot commented Jun 30, 2024

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output #128825

[Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output #128825

Conversation

leslie-fang-intel commented Jun 17, 2024 • edited Loading

pytorch-bot bot commented Jun 17, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128825

✅ You can merge normally! (3 Unrelated Failures)

leslie-fang-intel commented Jun 28, 2024

leslie-fang-intel commented Jun 30, 2024

pytorchmergebot commented Jun 30, 2024

Merge started

leslie-fang-intel commented Jun 17, 2024 •

edited

Loading

pytorch-bot bot commented Jun 17, 2024 •

edited

Loading