Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT4 quantization not working on MI210 #154

Closed
yafehlis opened this issue Apr 8, 2024 · 2 comments
Closed

INT4 quantization not working on MI210 #154

yafehlis opened this issue Apr 8, 2024 · 2 comments

Comments

@yafehlis
Copy link

yafehlis commented Apr 8, 2024

INT8 quantization works fine, but INT4 does not work.
Capture

@Chillee
Copy link
Contributor

Chillee commented Apr 25, 2024

Yeah, int4 quantization doesn't work on AMD GPUs right now.

datagero pushed a commit to datagero/pytorch that referenced this issue Jul 10, 2024
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes pytorch#124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <[email protected]>
Pull Request resolved: pytorch#129710
Approved by: https://github.com/malfet
@jerrymannil
Copy link

PR merged. This issue can be closed now.

xuhancn pushed a commit to xuhancn/pytorch that referenced this issue Jul 25, 2024
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes pytorch#124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <[email protected]>
Pull Request resolved: pytorch#129710
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants