PyTorch trunk is frequently broken #128180

malfet · 2024-06-07T00:45:32Z

🐛 Describe the bug

For example looking at #128123 (that adds a docstring to the utils and therefore could not cause any failures other than lint) DrCI found 2 unstable job, 1 flaky and one new failure in inductor tests:

Not sure how new developer who want to make a small contribution to PyTorch support to navigate thru those multiple layers of dilapidation...

Versions

CI

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @pytorch/pytorch-dev-infra

malfet · 2024-06-07T00:49:32Z

Another doc-only change that somehow failed rebuilding docker containers #128136

huydhn · 2024-06-07T00:50:41Z

All these unrelated failures are linked one way or another to the ongoing CUDA 12.4, CUDNN upgrade. So an approach could be to figure out how to roll out similar upgraded in a safer manner. cc @atalman

huydhn · 2024-06-07T00:56:35Z

If we zoom into doc-only changes, there is a TD item to skip CI jobs if the PR only updates comments / docs. However, that capacity is not yet being built. cc @clee2000

atalman · 2024-06-07T14:39:24Z

The failures with cudnn.so are related to the fact that we have 2 changes:

Underlining docker image
pytorch/pytorch code

These 2 changes are landed one after another. But people who not rebased are missing pytorch/pytorch changes. The docker changes will be picked up automatically. Perhaps we should open non blocking CI SEV when doing such updates ?

Please note this is major update - we have not updated cudnn major version since 2021 at least.

janeyx99 · 2024-06-07T15:15:31Z

I've noticed a higher influx of red on my PRs lately but had no idea that it was correlated with a cudnn update. If we know that these big updates could disrupt CI for all devs, it would be good to announce that risk more vocally so I could know what to expect.

ZainRizvi · 2024-06-07T16:21:24Z

@atalman , which is the pytorch/pytorch commit that people need to rebase past to not get these spurious cuda failures?

Let's mention that as part of the mitigation steps in the sev

ZainRizvi · 2024-06-07T16:37:56Z

Are the libcudnn.so failures (marked as unstable/flaky in @malfet's screenshot) also fixed by the pytorch/pytorch rebase, or are they a symptom of a separate issue?

ZainRizvi · 2024-06-07T16:46:02Z

Looking through the logs, we started seeing the ["ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory"] failures about a week ago (May 31st)

https://hud.pytorch.org/failure?name=*&jobName=manywheel-py3_8-cuda12_4-test%20%2F%20test&failureCaptures=%5B%22ImportError%3A%20libcudnn.so.8%3A%20cannot%20open%20shared%20object%20file%3A%20No%20such%20file%20or%20directory%22%5D

ZainRizvi · 2024-06-07T17:09:19Z

As per offline sync with @atalman: A rebase should fix all the errors in @malfet's screenshot

malfet · 2024-06-07T17:19:08Z

I understand that everything can be solved by a rebase and retry, I've filed this issue to highlight that this expectation to rebase seems to become more and more common.

atalman · 2024-06-10T13:55:29Z

@malfet We will start working on this : pytorch/builder#1849 - Moving docker workflow to pytorch/pytorch and fetching correct docker image should take care of the errors similar to "libcudnn.so.9"

atalman · 2024-06-10T13:58:49Z

Another example is here: #127456

albanD · 2024-06-10T17:18:03Z

Unrealted to the cudnn issue, here are the stats from the last year:

bdhirsh · 2024-06-10T20:09:00Z

from triage review: keeping hi pri to track status, will likely be an ongoing issue since there are usually new reasons for breakages.

Andrey: cuda docker builds run in two steps, can move the docker builds to pytorch/pytorch to avoid some cuda CI errors. But there are multiple different failures going on here.

huydhn · 2024-06-11T21:46:51Z

@atalman: To break this down into smaller actionable items before closing the issue
@malfet: What do we think about asking people to rebase to get CI working? We need to make sure that upgrading doesn't affect user unnecessarily and depends on the fact that they need to do a rebase to get CI working

malfet · 2024-06-20T16:02:41Z

One more example from PR #128222

clee2000 · 2024-08-06T20:48:49Z

Are there AIs here that still need to be done? I've lost context since the issue is a bit old

atalman mentioned this issue Jun 7, 2024

Rebase your PRs: Unstable CUDA signal in CI caused by cudnn 9 update #128221

Closed

bdhirsh added high priority module: ci Related to continuous integration labels Jun 10, 2024

pytorch-bot bot added the triage review label Jun 10, 2024

bdhirsh added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 10, 2024

huydhn assigned atalman Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch trunk is frequently broken #128180

PyTorch trunk is frequently broken #128180

malfet commented Jun 7, 2024 •

edited by pytorch-bot bot

Loading

malfet commented Jun 7, 2024

huydhn commented Jun 7, 2024 •

edited

Loading

huydhn commented Jun 7, 2024

atalman commented Jun 7, 2024 •

edited

Loading

janeyx99 commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

malfet commented Jun 7, 2024

atalman commented Jun 10, 2024

atalman commented Jun 10, 2024

albanD commented Jun 10, 2024

bdhirsh commented Jun 10, 2024

huydhn commented Jun 11, 2024 •

edited

Loading

malfet commented Jun 20, 2024

clee2000 commented Aug 6, 2024

PyTorch trunk is frequently broken #128180

PyTorch trunk is frequently broken #128180

Comments

malfet commented Jun 7, 2024 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

malfet commented Jun 7, 2024

huydhn commented Jun 7, 2024 • edited Loading

huydhn commented Jun 7, 2024

atalman commented Jun 7, 2024 • edited Loading

janeyx99 commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

ZainRizvi commented Jun 7, 2024

malfet commented Jun 7, 2024

atalman commented Jun 10, 2024

atalman commented Jun 10, 2024

albanD commented Jun 10, 2024

bdhirsh commented Jun 10, 2024

huydhn commented Jun 11, 2024 • edited Loading

malfet commented Jun 20, 2024

clee2000 commented Aug 6, 2024

malfet commented Jun 7, 2024 •

edited by pytorch-bot bot

Loading

huydhn commented Jun 7, 2024 •

edited

Loading

atalman commented Jun 7, 2024 •

edited

Loading

huydhn commented Jun 11, 2024 •

edited

Loading