[RunAllTests] Try to fix/workaround rest of #2844: Add retry mechanism when running tests #3969
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Explanation
Try to workaround the last issue in #2844: a SIGSEGV thrown sometimes when running StateFragmentTest/StateFragmentLocalTest.
While it would be preferable to fix this issue, it will likely be extremely time consuming since it'll require digging into a JVM bug. After some effort to try and find a root cause, I haven't been able to come up with a single dependable way to work around the issue. It mostly went away with previous mitigations, but after adding test sharding it seems to come up a lot.
This solution introduces the same while-loop retry mechanism as we do for builds (which was effective in addressing #3789). Bazel makes this work well since passing tests won't be re-run (their results are cached--see the CI results for this PR). Given that the issue seems to happen less often when StateFragmentTest/StateFragmentLocalTest runs by itself, this is a reasonable outcome (since it'll generally result in running just those tests for runs 2-5).
That being said, I'm not a complete fan of this solution since it will:
While the second outcome is definitely worse, it also seems to happen much less often so it seems like a worthwhile trade-off for the stability benefits that we get from this fix.
#3970 was filed to track the long-term fix of the underlying SIGSEGV so that this mechanism isn't needed for CI runs to reliably pass.
Essential Checklist
For UI-specific PRs only
N/A -- infrastructure change