Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile FBGEMM on H100 #2298

Closed
wants to merge 2 commits into from
Closed

Compile FBGEMM on H100 #2298

wants to merge 2 commits into from

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Jun 12, 2024

H100 requires a special 9.0a arch list to gain full kernel compilation.

Test plan:
https://github.com/pytorch/benchmark/actions/runs/9489468560

Benchmark result:
OSS Repro on H100 devgpu 500W:

(py311) [[email protected] ~/local/benchmark (xz9/add-fbgemm)]$ CUDA_VISIBLE_DEVICES=4 python run_benchmark.py triton --op fp8_gemm_rowwise  --m 4 --n 3584 --k 8192 --num-inputs 1 --only _triton,_cutlass

(M, N, K)    _triton-tflops    _cutlass-accuracy    _cutlass-speedup    _cutlass-tflops
---------------  ----------------  -------------------  ------------------  -----------------
(4, 3584, 8192)           2.93719                    1             2.72519             8.0044

Internal fbcode repro:

$ CUDA_VISIBLE_DEVICES=4 buck2 run @mode/opt  -c fbcode.nvcc_arch=h100a -c fbcode.platform010_cuda_version=12.4  //pytorch/benchmark:triton -- --op fp8_gemm_rowwise  --m 4 --n 3584 --k 8192 --num-inputs 1 --only _triton,_cutlass

      (M, N, K)    _triton-tflops    _cutlass-accuracy    _cutlass-speedup    _cutlass-tflops
---------------  ----------------  -------------------  ------------------  -----------------
(4, 3584, 8192)           6.28427                    1             1.28493            8.07484

They are using the same FBGEMM commit hash. The OSS triton version seems to have worse performance than Meta Internal version, but the CUTLASS kernel perf are similar.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in c8d6c2a.

@xuzhao9 xuzhao9 deleted the xz9/fix-fbgemm branch June 12, 2024 22:26
@xuzhao9
Copy link
Contributor Author

xuzhao9 commented Jul 4, 2024

Updated result on 20240704:
OSS:

$ CUDA_VISIBLE_DEVICES=4 python run_benchmark.py triton --op fp8_gemm_rowwise  --m 4 --n 3584 --k 8192 --num-inputs 1 --only _triton,_cutlass --metrics tflops
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.91s/it]
      (M, N, K)    _triton-tflops    _cutlass-tflops
---------------  ----------------  -----------------
(4, 3584, 8192)           10.2685            14.0767

Internal:

      (M, N, K)    _triton-tflops    _cutlass-accuracy    _cutlass-speedup    _cutlass-tflops
---------------  ----------------  -------------------  ------------------  -----------------
(4, 3584, 8192)           8.40547                    0             1.68237            14.1411

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants