run_test: Unset cpp stacktraces after reruns #129004

clee2000 · 2024-06-18T22:11:46Z

Rerun the failing test singly with the env var set. If it succeeds, start a new process without the cpp stack traces env var

We don't want to waste time generating these if we don't have to

They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these

Adds new --rs (run single) to be used the same way --scs and --sc are. It will only run the single test in the step current file

https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs

In the above:

test_checkpoint_valid failed, then passed in another subprocess. The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free)
test_format_traceback_short failed consistently, but it continued to run because keep-going was set

pytorch-bot · 2024-06-18T22:11:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129004

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 12 Unrelated Failures

As of commit a3312cf with merge base a0e1e20 ():

NEW FAILURE - The following job has failed:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh)
test_mps.py::TestMPS::test_mps_allocator_module

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (matched macos rule in flaky-rules.json)
Failure: There is only 1627928KB free space left in /, which is less than the minimum requirement of

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / test (default, 1, 5, lf.linux.4xlarge.nvidia.gpu, unstable) (gh) (#129080)
Process completed with exit code 9.
pull / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / test (default, 2, 5, lf.linux.4xlarge.nvidia.gpu, unstable) (gh) (#129080)
Process completed with exit code 9.
pull / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / test (default, 3, 5, lf.linux.4xlarge.nvidia.gpu, unstable) (gh) (#129080)
Process completed with exit code 9.
pull / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / test (default, 4, 5, lf.linux.4xlarge.nvidia.gpu, unstable) (gh) (#129080)
Process completed with exit code 9.
pull / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / test (default, 5, 5, lf.linux.4xlarge.nvidia.gpu, unstable) (gh) (#129080)
Process completed with exit code 9.
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 1, 3, linux.2xlarge, unstable) (gh) (#129248)
Process completed with exit code 9.
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 2, 3, linux.2xlarge, unstable) (gh) (#129248)
Process completed with exit code 9.
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 3, 3, linux.2xlarge, unstable) (gh) (#129248)
Process completed with exit code 9.
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 1, 3, linux.2xlarge, unstable) (gh) (#129256)
Process completed with exit code 9.
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 2, 3, linux.2xlarge, unstable) (gh) (#129256)
Process completed with exit code 9.
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 3, 3, linux.2xlarge, unstable) (gh) (#129256)
Process completed with exit code 9.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

clee2000 · 2024-07-02T21:37:28Z

@pytorchbot merge

pytorchmergebot · 2024-07-02T21:38:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-03T00:28:40Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team

Raised by workflow job

clee2000 · 2024-07-03T01:48:33Z

@pytorchbot merge -f "failure is present on main"

pytorchmergebot · 2024-07-03T01:50:06Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know clee2000 had also just landed #129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003 [ghstack-poisoned]

@clee2000

Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed #129004 for the same effect. Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003 Pull Request resolved: #130277 Approved by: https://github.com/soulitzer

tc

89bbb03

pytorch-bot bot added the topic: not user facing topic category label Jun 18, 2024

clee2000 changed the title ~~[experiment] run_test: Unset cpp stacktraces when possible~~ [experiment] run_test: Unset cpp stacktraces after reruns Jun 18, 2024

clee2000 added the keep-going Don't stop on first failure, keep running tests until the end label Jun 18, 2024

clee2000 added 6 commits June 20, 2024 09:46

tc

994c8d2

tc

0a7c05b

tc

ac6a608

tc

2c349d3

tc

3dc2bfb

tc

3c060b9

clee2000 marked this pull request as ready for review June 26, 2024 20:36

clee2000 requested a review from a team as a code owner June 26, 2024 20:36

clee2000 changed the title ~~[experiment] run_test: Unset cpp stacktraces after reruns~~ run_test: Unset cpp stacktraces after reruns Jun 26, 2024

tc

a3312cf

PaliC approved these changes Jul 2, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2024

pytorchmergebot added the merging label Jul 2, 2024

pytorchmergebot removed the merging label Jul 3, 2024

pytorchmergebot added the merging label Jul 3, 2024

pytorchmergebot closed this in 91a8376 Jul 3, 2024

pytorchmergebot added Merged and removed merging labels Jul 3, 2024

janeyx99 mentioned this pull request Jul 8, 2024

Fix the rest of foreach flakers #130277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_test: Unset cpp stacktraces after reruns #129004

run_test: Unset cpp stacktraces after reruns #129004

clee2000 commented Jun 18, 2024 •

edited

Loading

pytorch-bot bot commented Jun 18, 2024 •

edited

Loading

clee2000 commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

pytorchmergebot commented Jul 3, 2024

clee2000 commented Jul 3, 2024

pytorchmergebot commented Jul 3, 2024

run_test: Unset cpp stacktraces after reruns #129004

run_test: Unset cpp stacktraces after reruns #129004

Conversation

clee2000 commented Jun 18, 2024 • edited Loading

pytorch-bot bot commented Jun 18, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129004

❌ 1 New Failure, 12 Unrelated Failures

clee2000 commented Jul 2, 2024

pytorchmergebot commented Jul 2, 2024

Merge started

pytorchmergebot commented Jul 3, 2024

Merge failed

clee2000 commented Jul 3, 2024

pytorchmergebot commented Jul 3, 2024

Merge started

clee2000 commented Jun 18, 2024 •

edited

Loading

pytorch-bot bot commented Jun 18, 2024 •

edited

Loading