Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch trunk is frequently broken #128180

Open
malfet opened this issue Jun 7, 2024 · 17 comments
Open

PyTorch trunk is frequently broken #128180

malfet opened this issue Jun 7, 2024 · 17 comments
Assignees
Labels
high priority module: ci Related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@malfet
Copy link
Contributor

malfet commented Jun 7, 2024

🐛 Describe the bug

For example looking at #128123 (that adds a docstring to the utils and therefore could not cause any failures other than lint) DrCI found 2 unstable job, 1 flaky and one new failure in inductor tests:

image

Not sure how new developer who want to make a small contribution to PyTorch support to navigate thru those multiple layers of dilapidation...

Versions

CI

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @seemethere @pytorch/pytorch-dev-infra

@malfet
Copy link
Contributor Author

malfet commented Jun 7, 2024

Another doc-only change that somehow failed rebuilding docker containers #128136
image

@huydhn
Copy link
Contributor

huydhn commented Jun 7, 2024

All these unrelated failures are linked one way or another to the ongoing CUDA 12.4, CUDNN upgrade. So an approach could be to figure out how to roll out similar upgraded in a safer manner. cc @atalman

@huydhn
Copy link
Contributor

huydhn commented Jun 7, 2024

If we zoom into doc-only changes, there is a TD item to skip CI jobs if the PR only updates comments / docs. However, that capacity is not yet being built. cc @clee2000

@atalman
Copy link
Contributor

atalman commented Jun 7, 2024

The failures with cudnn.so are related to the fact that we have 2 changes:

  1. Underlining docker image
  2. pytorch/pytorch code

These 2 changes are landed one after another. But people who not rebased are missing pytorch/pytorch changes. The docker changes will be picked up automatically. Perhaps we should open non blocking CI SEV when doing such updates ?

Please note this is major update - we have not updated cudnn major version since 2021 at least.

@janeyx99
Copy link
Contributor

janeyx99 commented Jun 7, 2024

I've noticed a higher influx of red on my PRs lately but had no idea that it was correlated with a cudnn update. If we know that these big updates could disrupt CI for all devs, it would be good to announce that risk more vocally so I could know what to expect.

@ZainRizvi
Copy link
Contributor

@atalman , which is the pytorch/pytorch commit that people need to rebase past to not get these spurious cuda failures?

Let's mention that as part of the mitigation steps in the sev

@ZainRizvi
Copy link
Contributor

Are the libcudnn.so failures (marked as unstable/flaky in @malfet's screenshot) also fixed by the pytorch/pytorch rebase, or are they a symptom of a separate issue?

@ZainRizvi
Copy link
Contributor

Looking through the logs, we started seeing the ["ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory"] failures about a week ago (May 31st)

https://hud.pytorch.org/failure?name=*&jobName=manywheel-py3_8-cuda12_4-test%20%2F%20test&failureCaptures=%5B%22ImportError%3A%20libcudnn.so.8%3A%20cannot%20open%20shared%20object%20file%3A%20No%20such%20file%20or%20directory%22%5D

@ZainRizvi
Copy link
Contributor

As per offline sync with @atalman: A rebase should fix all the errors in @malfet's screenshot

@malfet
Copy link
Contributor Author

malfet commented Jun 7, 2024

I understand that everything can be solved by a rebase and retry, I've filed this issue to highlight that this expectation to rebase seems to become more and more common.

@atalman
Copy link
Contributor

atalman commented Jun 10, 2024

@malfet We will start working on this : pytorch/builder#1849 - Moving docker workflow to pytorch/pytorch and fetching correct docker image should take care of the errors similar to "libcudnn.so.9"

@atalman
Copy link
Contributor

atalman commented Jun 10, 2024

Another example is here: #127456

Screenshot 2024-06-10 at 9 58 20 AM

@bdhirsh bdhirsh added high priority module: ci Related to continuous integration labels Jun 10, 2024
@albanD
Copy link
Collaborator

albanD commented Jun 10, 2024

Unrealted to the cudnn issue, here are the stats from the last year:
image

@bdhirsh
Copy link
Contributor

bdhirsh commented Jun 10, 2024

from triage review: keeping hi pri to track status, will likely be an ongoing issue since there are usually new reasons for breakages.

Andrey: cuda docker builds run in two steps, can move the docker builds to pytorch/pytorch to avoid some cuda CI errors. But there are multiple different failures going on here.

@bdhirsh bdhirsh added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 10, 2024
@huydhn
Copy link
Contributor

huydhn commented Jun 11, 2024

@atalman: To break this down into smaller actionable items before closing the issue
@malfet: What do we think about asking people to rebase to get CI working? We need to make sure that upgrading doesn't affect user unnecessarily and depends on the fact that they need to do a rebase to get CI working

@malfet
Copy link
Contributor Author

malfet commented Jun 20, 2024

One more example from PR #128222
image

@clee2000
Copy link
Contributor

clee2000 commented Aug 6, 2024

Are there AIs here that still need to be done? I've lost context since the issue is a bit old

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: ci Related to continuous integration triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Prioritized
Development

No branches or pull requests

8 participants