Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run test_mps_allocator_module serially #129340

Closed
wants to merge 6 commits into from

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Jun 24, 2024

Not sure why this test starts to fail (maybe runner update) https://hud.pytorch.org/hud/pytorch/pytorch/8a2fed7e6ab4cb6a97d92f6ca14f257370ec3a92/1?per_page=50&name_filter=mps or why it was XFAIL in this old PR #97151, but the test is passing locally for me now

Copy link

pytorch-bot bot commented Jun 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129340

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit ee3131d with merge base c9dc988 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 24, 2024
@Flamefire
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 24, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor Author

huydhn commented Jun 24, 2024

@pytorchbot merge -f 'Bypass Windows queue'

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor Author

huydhn commented Jun 25, 2024

@pytorchbot revert -m 'The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation' -c weird

An example failure https://hud.pytorch.org/pytorch/pytorch/commit/dd00f5e78d44f55595728ef03018c01296a31ec9

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jun 25, 2024
This reverts commit c888ee3.

Reverted #129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](#129340 (comment)))
@pytorchmergebot
Copy link
Collaborator

@huydhn your PR has been successfully reverted.

@Flamefire
Copy link
Collaborator

@pytorchbot revert -m 'The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation' -c weird

Yes especially the use of assertTrue instead of more specific tests make it very hard to investigate the failures.

The test looks very strange though:

self.assertTrue(current_alloc_before == 0)
...
self.assertTrue(current_alloc_after > current_alloc_before)

If the "before" must be zero, why the use in the 2nd assert? I'd say the first assert is the issue: Whether or not gc.collect() actually collects everything is up to the runtime (and moon phase ;-) )

@malfet
Copy link
Contributor

malfet commented Jun 26, 2024

If the "before" must be zero, why the use in the 2nd assert?

Because current_alloc_after could return zero (and I guess that what this test was trying to do)

@malfet
Copy link
Contributor

malfet commented Jun 26, 2024

@pytorchbot revert -m 'The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation' -c weird

An example failure https://hud.pytorch.org/pytorch/pytorch/commit/dd00f5e78d44f55595728ef03018c01296a31ec9

It's weird, but it is not, this test is just incompatible with parallel run mode, as it assumes now MPS allocation happens between those lines:

pytorch/test/test_mps.py

Lines 7835 to 7838 in 92be340

gc.collect()
torch.mps.empty_cache()
# measure memory allocations from MPSAllocator
current_alloc_before = torch.mps.current_allocated_memory()

But if another test runs in parallel, we can not ensure that memory allocation would be zero at that point. Let's just add unittest.skip for this one, and create a follow up issue on who to run any memory measurement tests (I think for CUDA we run those in serialized mode)

@malfet malfet changed the title Remove test_mps_allocator_module XFAIL Skip test_mps_allocator_module Jun 26, 2024
malfet added a commit that referenced this pull request Jun 26, 2024
Based on this #129340 (comment)
I.e.
- `assertTrue(x == y)` -> `assertEqual(x, y)
- `assertTrue(not x)` -> assertFalse(x)`
- `assertTrue(x > y)` -> assertGreater(x, y)`
@Flamefire
Copy link
Collaborator

Because current_alloc_after could return zero (and I guess that what this test was trying to do)

No I meant if you assert current_alloc_before=0 then why not just self.assertTrue(current_alloc_after > 0) after?

pytorchmergebot pushed a commit that referenced this pull request Jun 26, 2024
Based on this #129340 (comment) I.e.
- `assertTrue(x == y)` -> `assertEqual(x, y)
- `assertTrue(not x)` -> assertFalse(x)`
- `assertTrue(x > y)` -> assertGreater(x, y)`

Pull Request resolved: #129569
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007
@huydhn huydhn changed the title Skip test_mps_allocator_module Run test_mps_allocator_module serially Jun 27, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral), trunk / win-vs2019-cpu-py3 / test (default, 2, 3, windows.4xlarge.nonephemeral), trunk / win-vs2019-cpu-py3 / test (default, 3, 3, windows.4xlarge.nonephemeral)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

test/test_mps.py Outdated
@@ -7829,7 +7829,7 @@ def test_device_synchronize(self):
x.backward(torch.randn_like(x))
torch.mps.synchronize()

@unittest.expectedFailure
@serialTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@serialTest
@serialTest()

Copy link
Contributor Author

@huydhn huydhn Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, this makes me think I need to to check if the test is really running (in the PR)

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Comment with id 2200612866 not found

Details for Dev Infra team Raised by workflow job

@huydhn huydhn removed the ciflow/trunk Trigger trunk jobs on your pull request label Jul 1, 2024
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please undo kineto submodule update

@huydhn
Copy link
Contributor Author

huydhn commented Jul 1, 2024

@pytorchbot merge -f 'MPS tests have passed'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants