[reland][ROCm] TunableOp for gemm_and_bias #128919

jeffdaily · 2024-06-18T01:53:39Z

Reland of #128143 but added alpha and bias initialization to launchTunableGemmAndBias

Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.

cc @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

pytorch-bot · 2024-06-18T01:53:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128919

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 7a44c2f with merge base 1491a61 ():

NEW FAILURE - The following job has failed:

trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh)
'test/test_ops_gradients.py::TestBwdGradientsCPU::test_fn_gradgrad_nn_functional_conv_transpose3d_cpu_complex128'

FLAKY - The following job failed but was likely due to flakiness present on trunk:

rocm / linux-focal-rocm6.1-py3.8 / test (default, 2, 6, linux.rocm.gpu.2) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jithunnair-amd · 2024-07-31T18:47:15Z

@xw285cornell Can you please review and approve if this PR looks good?

diwdoit · 2024-08-01T02:56:14Z

Hi @xw285cornell - appreciate your help to get the PR reviewed.

jithunnair-amd · 2024-08-01T19:12:01Z

@malfet I think @xw285cornell is OOO. Can you please approve and merge this PR?

jithunnair-amd · 2024-08-07T22:00:17Z

@malfet re-ping

jithunnair-amd · 2024-08-07T22:14:27Z

@jeffdaily/@naromero77amd: I see a unit test was added for this PR: test_addmm_relu_tunableop_rocm in test_linalg.py. I have added the keep-going label to this PR to ensure if runs all unit tests, since we have some unrelated failures on ROCm CI that might prevent that unit test from running in CI. Can you please post a snippet showing that the unit tests related to this PR ran successfully in the PR's CI runs?

malfet · 2024-08-13T21:06:07Z

test/test_linalg.py

+ torch.cuda.tunable.enable(True)
+ ordinal = torch.cuda.current_device()
+ filename = f"tunableop_results{ordinal}.csv"
+ torch.cuda.tunable.set_filename(filename)
+ iterations = torch.cuda.tunable.get_max_tuning_iterations()
+ torch.cuda.tunable.set_max_tuning_iterations(10)
+ self._test_addmm_impl(torch._addmm_activation, "relu", device, dtype)
+ # clean up, remove any file that was generated
+ try:
+ import os
+ os.remove(filename)
+ except FileNotFoundError:
+ pass
+ # reset back to prior settings
+ torch.cuda.tunable.set_max_tuning_iterations(iterations)
+ torch.cuda.tunable.enable(False)


Nit, please use try: finally: to avoid altering global state if test fails or gets interrupted

Suggested change

torch.cuda.tunable.enable(True)

ordinal = torch.cuda.current_device()

filename = f"tunableop_results{ordinal}.csv"

torch.cuda.tunable.set_filename(filename)

iterations = torch.cuda.tunable.get_max_tuning_iterations()

torch.cuda.tunable.set_max_tuning_iterations(10)

self._test_addmm_impl(torch._addmm_activation, "relu", device, dtype)

# clean up, remove any file that was generated

try:

import os

os.remove(filename)

except FileNotFoundError:

pass

# reset back to prior settings

torch.cuda.tunable.set_max_tuning_iterations(iterations)

torch.cuda.tunable.enable(False)

iterations = torch.cuda.tunable.get_max_tuning_iterations()

try:

torch.cuda.tunable.enable(True)

ordinal = torch.cuda.current_device()

filename = f"tunableop_results{ordinal}.csv"

torch.cuda.tunable.set_filename(filename)

torch.cuda.tunable.set_max_tuning_iterations(10)

self._test_addmm_impl(torch._addmm_activation, "relu", device, dtype)

finally:

# clean up, remove any file that was generated

try:

import os

os.remove(filename)

except FileNotFoundError:

pass

# reset back to prior settings

torch.cuda.tunable.set_max_tuning_iterations(iterations)

torch.cuda.tunable.enable(False)

malfet · 2024-08-20T19:20:59Z

aten/src/ATen/cuda/tunable/GemmCommon.h

+ return c10::str(transa, transb, "_", m, "_", n, "_", k);
+ }
+
+ size_t GetSize(bool duplicate_inputs) const {


Nit, I though we are using camelCase for methods and CapitalizedCamelCase for class names

malfet · 2024-08-20T20:20:49Z

@pytorchbot merge -i

pytorchmergebot · 2024-08-20T20:23:02Z

Merge started

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-20T22:16:16Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 3, 3, windows.4xlarge.nonephemeral)

Details for Dev Infra team

Raised by workflow job

jeffdaily · 2024-08-22T18:15:15Z

@pytorchbot merge

pytorchmergebot · 2024-08-22T18:17:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-22T18:17:22Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable)

Details for Dev Infra team

Raised by workflow job

jeffdaily · 2024-08-22T18:26:10Z

@pytorchbot merge -f "unrelated macos cpu job failed, all other CI is known flaky or passing"

pytorchmergebot · 2024-08-22T18:27:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jeffdaily added 4 commits June 18, 2024 01:50

[ROCm] TunableOp for gemm_and_bias

2ec36b3

rebase after pytorch#124362

58e9390

reduce duplicate code

0fdb0a0

fix numerical check env var behavior

39cf8e7

jeffdaily requested a review from eqy as a code owner June 18, 2024 01:53

pytorch-bot bot added ciflow/rocm module: rocm AMD GPU support for Pytorch labels Jun 18, 2024

pytorchbot added the open source label Jun 18, 2024

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 21, 2024

jithunnair-amd added the rocm This tag is for PRs from ROCm team label Jul 1, 2024

jeffdaily added 2 commits July 26, 2024 15:15

Merge branch 'main' into tunableop_gemm_and_bias

b75f651

fix gemm and bias tunable params, add unit test

affe77f

jeffdaily requested review from lezcano, nikitaved and IvanYashchuk as code owners July 26, 2024 22:50

pytorch-bot bot added the release notes: linalg_frontend release notes category label Jul 26, 2024

jeffdaily requested a review from xw285cornell July 26, 2024 22:51

fix copy/paste typo

afe9c17

jeffdaily added 2 commits August 1, 2024 20:28

scale result is optional

229b592

Merge branch 'main' into tunableop_gemm_and_bias

7182aa9

jeffdaily requested a review from syed-ahmed as a code owner August 1, 2024 20:30

jithunnair-amd requested a review from malfet August 7, 2024 21:59

jithunnair-amd added the rocm priority high priority ROCm PRs from performance or other aspects label Aug 7, 2024

jithunnair-amd added the keep-going Don't stop on first failure, keep running tests until the end label Aug 7, 2024

malfet reviewed Aug 20, 2024

View reviewed changes

malfet approved these changes Aug 20, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 20, 2024

pytorchmergebot added the merging label Aug 20, 2024

pytorchmergebot removed the merging label Aug 20, 2024

jeffdaily added 2 commits August 21, 2024 20:00

Merge branch 'main' into tunableop_gemm_and_bias

80d5c03

fix after rebase, validators not needed in TunableGemm.h

7a44c2f

pytorchmergebot added the merging label Aug 22, 2024

pytorchmergebot removed the merging label Aug 22, 2024

pytorchmergebot added the merging label Aug 22, 2024

pytorchmergebot added the Merged label Aug 22, 2024

pytorchmergebot closed this in 0eb9c87 Aug 22, 2024

pytorchmergebot removed the merging label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reland][ROCm] TunableOp for gemm_and_bias #128919

[reland][ROCm] TunableOp for gemm_and_bias #128919

jeffdaily commented Jun 18, 2024 •

edited by malfet

Loading

pytorch-bot bot commented Jun 18, 2024 •

edited

Loading

jithunnair-amd commented Jul 31, 2024

diwdoit commented Aug 1, 2024

jithunnair-amd commented Aug 1, 2024

jithunnair-amd commented Aug 7, 2024

jithunnair-amd commented Aug 7, 2024

malfet Aug 13, 2024

malfet Aug 20, 2024

malfet commented Aug 20, 2024

pytorchmergebot commented Aug 20, 2024

pytorchmergebot commented Aug 20, 2024

jeffdaily commented Aug 22, 2024

pytorchmergebot commented Aug 22, 2024

pytorchmergebot commented Aug 22, 2024

jeffdaily commented Aug 22, 2024

pytorchmergebot commented Aug 22, 2024

[reland][ROCm] TunableOp for gemm_and_bias #128919

[reland][ROCm] TunableOp for gemm_and_bias #128919

Conversation

jeffdaily commented Jun 18, 2024 • edited by malfet Loading

pytorch-bot bot commented Jun 18, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128919

❌ 1 New Failure, 1 Unrelated Failure

jithunnair-amd commented Jul 31, 2024

diwdoit commented Aug 1, 2024

jithunnair-amd commented Aug 1, 2024

jithunnair-amd commented Aug 7, 2024

jithunnair-amd commented Aug 7, 2024

malfet Aug 13, 2024

Choose a reason for hiding this comment

malfet Aug 20, 2024

Choose a reason for hiding this comment

malfet commented Aug 20, 2024

pytorchmergebot commented Aug 20, 2024

Merge started

pytorchmergebot commented Aug 20, 2024

Merge failed

jeffdaily commented Aug 22, 2024

pytorchmergebot commented Aug 22, 2024

Merge started

pytorchmergebot commented Aug 22, 2024

Merge failed

jeffdaily commented Aug 22, 2024

pytorchmergebot commented Aug 22, 2024

Merge started

jeffdaily commented Jun 18, 2024 •

edited by malfet

Loading

pytorch-bot bot commented Jun 18, 2024 •

edited

Loading