INT4 quantization not working on MI210 #154

yafehlis · 2024-04-08T18:01:15Z

INT8 quantization works fine, but INT4 does not work.

Chillee · 2024-04-25T22:25:32Z

Yeah, int4 quantization doesn't work on AMD GPUs right now.

- Add AMD support for int4 kernel - Only supports CDNA2 and CDNA3 gpus for now - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types - Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus - Fix torchscript issues due to hipify for `__nv_bfloat16` type - TorchScript has its own implementation for bfloat16 type - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h) - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify Fixes pytorch#124699 Fixes pytorch-labs/gpt-fast/issues/154 Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: pytorch#129710 Approved by: https://github.com/malfet

jerrymannil · 2024-07-10T16:20:56Z

PR merged. This issue can be closed now.

- Add AMD support for int4 kernel - Only supports CDNA2 and CDNA3 gpus for now - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types - Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus - Fix torchscript issues due to hipify for `__nv_bfloat16` type - TorchScript has its own implementation for bfloat16 type - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h) - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify Fixes pytorch#124699 Fixes pytorch-labs/gpt-fast/issues/154 Co-authored-by: Nikita Shulga <[email protected]> Pull Request resolved: pytorch#129710 Approved by: https://github.com/malfet

msaroufim mentioned this issue Apr 23, 2024

int4 AMD pytorch/pytorch#124699

Closed

jerrymannil mentioned this issue Jul 8, 2024

[ROCm] Add int4 support pytorch/pytorch#129710

Closed

BoyuanFeng closed this as completed in BoyuanFeng/pytorch@42f6472 Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INT4 quantization not working on MI210 #154

INT4 quantization not working on MI210 #154

yafehlis commented Apr 8, 2024

Chillee commented Apr 25, 2024

jerrymannil commented Jul 10, 2024

INT4 quantization not working on MI210 #154

INT4 quantization not working on MI210 #154

Comments

yafehlis commented Apr 8, 2024

Chillee commented Apr 25, 2024

jerrymannil commented Jul 10, 2024