Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_test: Unset cpp stacktraces after reruns #129004

Closed
wants to merge 8 commits into from

Conversation

clee2000
Copy link
Contributor

@clee2000 clee2000 commented Jun 18, 2024

Rerun the failing test singly with the env var set. If it succeeds, start a new process without the cpp stack traces env var

We don't want to waste time generating these if we don't have to

They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these

Adds new --rs (run single) to be used the same way --scs and --sc are. It will only run the single test in the step current file

https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs

In the above:

  • test_checkpoint_valid failed, then passed in another subprocess. The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free)
  • test_format_traceback_short failed consistently, but it continued to run because keep-going was set

Copy link

pytorch-bot bot commented Jun 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/129004

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 12 Unrelated Failures

As of commit a3312cf with merge base a0e1e20 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jun 18, 2024
@clee2000 clee2000 changed the title [experiment] run_test: Unset cpp stacktraces when possible [experiment] run_test: Unset cpp stacktraces after reruns Jun 18, 2024
@clee2000 clee2000 added the keep-going Don't stop on first failure, keep running tests until the end label Jun 18, 2024
@clee2000 clee2000 marked this pull request as ready for review June 26, 2024 20:36
@clee2000 clee2000 requested a review from a team as a code owner June 26, 2024 20:36
@clee2000 clee2000 changed the title [experiment] run_test: Unset cpp stacktraces after reruns run_test: Unset cpp stacktraces after reruns Jun 26, 2024
@clee2000
Copy link
Contributor Author

clee2000 commented Jul 2, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 2, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team Raised by workflow job

@clee2000
Copy link
Contributor Author

clee2000 commented Jul 3, 2024

@pytorchbot merge -f "failure is present on main"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

janeyx99 added a commit that referenced this pull request Jul 8, 2024
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know clee2000 had also just landed #129004 for the same effect.

Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003




[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jul 9, 2024
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed #129004 for the same effect.

Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to #129003

Pull Request resolved: #130277
Approved by: https://github.com/soulitzer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants