Run test_mps_allocator_module serially #129340

huydhn · 2024-06-24T02:25:13Z

Not sure why this test starts to fail (maybe runner update) https://hud.pytorch.org/hud/pytorch/pytorch/8a2fed7e6ab4cb6a97d92f6ca14f257370ec3a92/1?per_page=50&name_filter=mps or why it was XFAIL in this old PR #97151, but the test is passing locally for me now

pytorch-bot · 2024-06-24T02:25:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129340

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

NVIDIA runners have a blacklog of 2000+ jobs

✅ No Failures

As of commit ee3131d with merge base c9dc988 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Flamefire · 2024-06-24T11:50:10Z

@pytorchbot merge

pytorchmergebot · 2024-06-24T11:51:47Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2024-06-24T16:24:27Z

@pytorchbot merge -f 'Bypass Windows queue'

pytorchmergebot · 2024-06-24T16:24:45Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-06-24T16:26:31Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2024-06-25T18:35:20Z

@pytorchbot revert -m 'The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation' -c weird

An example failure https://hud.pytorch.org/pytorch/pytorch/commit/dd00f5e78d44f55595728ef03018c01296a31ec9

pytorchmergebot · 2024-06-25T18:37:46Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit c888ee3. Reverted #129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](#129340 (comment)))

pytorchmergebot · 2024-06-25T18:37:57Z

@huydhn your PR has been successfully reverted.

Flamefire · 2024-06-26T09:23:05Z

@pytorchbot revert -m 'The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation' -c weird

Yes especially the use of assertTrue instead of more specific tests make it very hard to investigate the failures.

The test looks very strange though:

self.assertTrue(current_alloc_before == 0)
...
self.assertTrue(current_alloc_after > current_alloc_before)

If the "before" must be zero, why the use in the 2nd assert? I'd say the first assert is the issue: Whether or not gc.collect() actually collects everything is up to the runtime (and moon phase ;-) )

malfet · 2024-06-26T12:48:28Z

If the "before" must be zero, why the use in the 2nd assert?

Because current_alloc_after could return zero (and I guess that what this test was trying to do)

malfet · 2024-06-26T12:55:15Z

@pytorchbot revert -m 'The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation' -c weird

An example failure https://hud.pytorch.org/pytorch/pytorch/commit/dd00f5e78d44f55595728ef03018c01296a31ec9

It's weird, but it is not, this test is just incompatible with parallel run mode, as it assumes now MPS allocation happens between those lines:

pytorch/test/test_mps.py

Lines 7835 to 7838 in 92be340

 gc.collect() 

 torch.mps.empty_cache() 

 # measure memory allocations from MPSAllocator 

 current_alloc_before = torch.mps.current_allocated_memory()

But if another test runs in parallel, we can not ensure that memory allocation would be zero at that point. Let's just add unittest.skip for this one, and create a follow up issue on who to run any memory measurement tests (I think for CUDA we run those in serialized mode)

Based on this #129340 (comment) I.e. - `assertTrue(x == y)` -> `assertEqual(x, y) - `assertTrue(not x)` -> assertFalse(x)` - `assertTrue(x > y)` -> assertGreater(x, y)`

Flamefire · 2024-06-26T14:25:04Z

Because current_alloc_after could return zero (and I guess that what this test was trying to do)

No I meant if you assert current_alloc_before=0 then why not just self.assertTrue(current_alloc_after > 0) after?

Based on this #129340 (comment) I.e. - `assertTrue(x == y)` -> `assertEqual(x, y) - `assertTrue(not x)` -> assertFalse(x)` - `assertTrue(x > y)` -> assertGreater(x, y)` Pull Request resolved: #129569 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007

pytorchmergebot · 2024-07-01T16:48:44Z

Merge started

Your change will be merged while ignoring the following 3 checks: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral), trunk / win-vs2019-cpu-py3 / test (default, 2, 3, windows.4xlarge.nonephemeral), trunk / win-vs2019-cpu-py3 / test (default, 3, 3, windows.4xlarge.nonephemeral)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

clee2000 · 2024-07-01T16:49:39Z

test/test_mps.py

@@ -7829,7 +7829,7 @@ def test_device_synchronize(self):
 x.backward(torch.randn_like(x))
 torch.mps.synchronize()

- @unittest.expectedFailure
+ @serialTest


Suggested change

@serialTest

@serialTest()

Ouch, this makes me think I need to to check if the test is really running (in the PR)

pytorchmergebot · 2024-07-01T16:54:05Z

Merge failed

Reason: Comment with id 2200612866 not found

Details for Dev Infra team

Raised by workflow job

malfet

LGTM, but please undo kineto submodule update

huydhn · 2024-07-01T18:43:18Z

@pytorchbot merge -f 'MPS tests have passed'

pytorchmergebot · 2024-07-01T18:44:44Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Not sure why this test starts to fail (maybe runner update) https://hud.pytorch.org/hud/pytorch/pytorch/8a2fed7e6ab4cb6a97d92f6ca14f257370ec3a92/1?per_page=50&name_filter=mps or why it was XFAIL in this old PR pytorch#97151, but the test is passing locally for me now Pull Request resolved: pytorch#129340 Approved by: https://github.com/kit1980, https://github.com/malfet

Remove test_mps_allocator_module XFAIL

d47b611

huydhn added ciflow/mps Run MPS tests (subset of trunk) test-config/default test-config/mps labels Jun 24, 2024

huydhn requested review from kulinseth and malfet as code owners June 24, 2024 02:25

pytorch-bot bot added the topic: not user facing topic category label Jun 24, 2024

kit1980 approved these changes Jun 24, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 24, 2024

pytorchmergebot added the merging label Jun 24, 2024

isuruf mentioned this pull request Jun 24, 2024

Fix max_pool2d decomposition for empty list and integer limits #129106

Closed

pytorchmergebot closed this in c888ee3 Jun 24, 2024

pytorchmergebot added Merged and removed merging labels Jun 24, 2024

eqy mentioned this pull request Jun 25, 2024

Fix build error on s390x #129326

Closed

pytorchmergebot added the Reverted label Jun 25, 2024

pytorchmergebot reopened this Jun 25, 2024

malfet changed the title ~~Remove test_mps_allocator_module XFAIL~~ Skip test_mps_allocator_module Jun 26, 2024

malfet mentioned this pull request Jun 26, 2024

[EZ][BE] Replace assertTrue with more appropriate checks #129569

Closed

huydhn added 2 commits June 26, 2024 23:51

Merge branch 'main' into fix-mps-test_mps_allocator_module

ac007c9

Run test_mps_allocator_module serially

4e57b69

huydhn changed the title ~~Skip test_mps_allocator_module~~ Run test_mps_allocator_module serially Jun 27, 2024

pytorchmergebot added the merging label Jul 1, 2024

clee2000 reviewed Jul 1, 2024

View reviewed changes

pytorchmergebot removed the merging label Jul 1, 2024

huydhn added 2 commits July 1, 2024 09:54

Merge branch 'main' into fix-mps-test_mps_allocator_module

c0f540a

Fix typo

ea151a3

huydhn removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 1, 2024

malfet approved these changes Jul 1, 2024

View reviewed changes

Spurious commit

ee3131d

pytorchmergebot added the merging label Jul 1, 2024

pytorchmergebot closed this in fdd0a7f Jul 1, 2024

pytorchmergebot removed the merging label Jul 1, 2024

dilililiwhy mentioned this pull request Jul 3, 2024

Enable UFMT on test/test_public_bindings.py #128389

Closed

jamesjwu mentioned this pull request Jul 3, 2024

Support allowlisted modules and op overloads in AOTAutogradCache #128329

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run test_mps_allocator_module serially #129340

Run test_mps_allocator_module serially #129340

huydhn commented Jun 24, 2024

pytorch-bot bot commented Jun 24, 2024 •

edited

Loading

Flamefire commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

huydhn commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

huydhn commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

Flamefire commented Jun 26, 2024

malfet commented Jun 26, 2024

malfet commented Jun 26, 2024

Flamefire commented Jun 26, 2024

pytorchmergebot commented Jul 1, 2024

clee2000 Jul 1, 2024

huydhn Jul 1, 2024 •

edited

Loading

pytorchmergebot commented Jul 1, 2024

malfet left a comment

huydhn commented Jul 1, 2024

pytorchmergebot commented Jul 1, 2024

Run test_mps_allocator_module serially #129340

Run test_mps_allocator_module serially #129340

Conversation

huydhn commented Jun 24, 2024

pytorch-bot bot commented Jun 24, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129340

❗ 1 Active SEVs

✅ No Failures

Flamefire commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

Merge started

huydhn commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

Merge started

huydhn commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

Flamefire commented Jun 26, 2024

malfet commented Jun 26, 2024

malfet commented Jun 26, 2024

Flamefire commented Jun 26, 2024

pytorchmergebot commented Jul 1, 2024

Merge started

clee2000 Jul 1, 2024

Choose a reason for hiding this comment

huydhn Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

pytorchmergebot commented Jul 1, 2024

Merge failed

malfet left a comment

Choose a reason for hiding this comment

huydhn commented Jul 1, 2024

pytorchmergebot commented Jul 1, 2024

Merge started

pytorch-bot bot commented Jun 24, 2024 •

edited

Loading

huydhn Jul 1, 2024 •

edited

Loading