-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA nightly docker actually includes CPU build of torch #125879
Comments
/cc @atalman |
|
@atalman can you please check what is going on here? |
Is this solved? |
Same problem for CUDA |
We need to reopen this ticket as also today image has the same problem:
|
cuda |
We have regressed again also on
I think without regular testing this with the CI it will be always out of control. |
@bhack you right, our nightly not very stable for docker, I do see an issue here: https://github.com/pytorch/pytorch/actions/runs/9314625747/job/25670242262#step:12:6821 We would need to add test step on top the docker image building |
Yes but I cannot use release images in this specific context as I am regularly submitting compiler issues for a devel/testing env. So for the rapid evolution of the compiler activities in the repo here I need to regularly follow nightly and quickly go back on broken days |
Also is it not the same dockerfile receipt of stable releases? |
It seems fixed with The following packages will be downloaded:
package | build
---------------------------|-----------------
pytorch-2.4.0.dev20240601 |py3.11_cuda12.1_cudnn8.9.2_0 1.39 GB pytorch-nightly
pytorch-mutex-1.0 | cuda 3 KB pytorch-nightly
torchaudio-2.2.0.dev20240601| py311_cu121 6.4 MB pytorch-nightly
torchtriton-3.0.0+45fff310c8| py311 250.5 MB pytorch-nightly
torchvision-0.19.0.dev20240601| py311_cu121 8.7 MB pytorch-nightly
------------------------------------------------------------
Total: 1.65 GB
The following NEW packages will be INSTALLED:
torchtriton pytorch-nightly/linux-64::torchtriton-3.0.0+45fff310c8-py311
The following packages will be UPDATED:
pytorch 2.4.0.dev20240531-py3.11_cpu_0 --> 2.4.0.dev20240601-py3.11_cuda12.1_cudnn8.9.2_0
pytorch-mutex 1.0-cpu --> 1.0-cuda
torchaudio 2.2.0.dev20240531-py311_cpu --> 2.2.0.dev20240601-py311_cu121
torchvision 0.19.0.dev20240531-py311_cpu --> 0.19.0.dev20240601-py311_cu121 |
Both and Are ok today. We could temp close this and focus on pytorch/builder#1432 |
@bhack agree. I think we need add a call to https://github.com/pytorch/builder/blob/main/.github/workflows/validate_docker_images.yml from pytorch/pytorch after building docker image |
As a temporary workaround it could be also nicer to not push these nighties at all in the registry as it will be a bit confusing to have a new pushed nightly image with an old nightly wheel. |
It is happening again today:
|
I think it is specific to pip3 install --upgrade --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 instead conda is still broken trying to push for a GPU to CPU build upgrade. |
Also today conda is upgrading on CPU but pip is ok. |
…27768) Adds validation to docker images. As discussed here: #125879 Pull Request resolved: #127768 Approved by: https://github.com/huydhn, https://github.com/Skylion007
@bhack Nightly CI - validation is added to Docker images : https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official%20Docker Will work now on keeping 2 days conda builds - this should mitigate installing of cpu binaries rather then cuda |
It is quite strange that pip upgrade over the conda install in docker image is always ok. I've tested an upgrade every day in the last 5/6 nightlies. |
Do you prefer to have a separate ticket for conda nightly? The following packages will be UPDATED:
pytorch 2.4.0.dev20240604-py3.11_cuda12.4_cud~ --> 2.4.0.dev20240610-py3.11_cpu_0
torchaudio 2.2.0.dev20240605-py311_cu124 --> 2.4.0.dev20240610-py311_cpu
torchvision 0.19.0.dev20240605-py311_cu124 --> 0.19.0.dev20240610-py311_cpu
The following packages will be SUPERSEDED by a higher-priority channel:
certifi pkgs/main/linux-64::certifi-2024.6.2-~ --> conda-forge/noarch::certifi-2024.6.2-pyhd8ed1ab_0
The following packages will be DOWNGRADED:
pytorch-mutex 1.0-cuda --> 1.0-cpu |
@bhack yes pip is always ok since we are not pruning download.pytorch.org. Having 2 nightly builds for conda should resolve the issue of installing cpu instead of gpu image. However it may install previous nightly build rather then current one. But this is the case with pip builds as well |
Ok thanks. Do you know if the conda layer is invalidated without a new release? |
@atalman Something is broken again. I cannot compile a pytorch custom op on today devel image
==========
== CUDA ==
==========
CUDA Version 12.4.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
False |
@atalman do we have any testing for the docker builds, because indeed it looks like we've included CPU build there:
|
I think we have these https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official%20Docker But have we created a dependency on this to push or not the images to the artifact registry? @atalman wrote: This test is done post push - will show us if build failed in the CI |
Also it seems that we are not coordinated between the wheel publication and the docker build and push to the artifact registy. It seems that 06/13 images have the 06/12 CPU wheel. As I've added this layer to the last nightly image RUN conda install -y pytorch torchvision torchaudio pytorc... And the proposal is to ugprade from 06/12 to 06/13 on 06/13 images.
|
This PR: pytorch/test-infra#5332 should mitigate the issue with conda nightly CPU install. By keeping 2 torch versions at all time. Will monitor this job for next few days. Next will look into adding smoke tests before push cc @bhack |
While we don't have vision/torch builds properly scync, adding this as mitigation Help mitigate this issue: pytorch/pytorch#125879
But it seems we had a valid GPU wheel for conda today as you can see in my extra layer log #125879 (comment). Why we didn't have that in the image and we have the cpu one of the day before? |
@bhack the workflow is initiated at the same time as all other bulds, on push to nightly: We should perhaps run it on schedule when we are on nightly branch. Will look into that |
Can we create a dependency between these jobs? |
Yes this is what our intent is. Ultimately we want workflow like this |
…torch#127768) Adds validation to docker images. As discussed here: pytorch#125879 Pull Request resolved: pytorch#127768 Approved by: https://github.com/huydhn, https://github.com/Skylion007
…g to repo (#128852) Related to: #125879 Would check if we are compiled with CUDA before publishing CUDA Docker nightly image Test ``` #18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi #18 1.656 Is torch compiled with cuda: False #18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 ------ > [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi: 1.656 Is torch compiled with cuda: False ------ Dockerfile:80 -------------------- 79 | RUN /opt/conda/bin/pip install torchelastic 80 | >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\ 81 | >>> echo "Is torch compiled with cuda: ${IS_CUDA}"; \ 82 | >>> if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \ 83 | >>> exit 1; \ 84 | >>> fi 85 | -------------------- ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1 (base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain --platform="linux/amd64" --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" . #0 building with "default" instance using docker driver ``` Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet: https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852 Pull Request resolved: #128852 Approved by: https://github.com/malfet
hi @bhack let me know if you see this issue still. Looks like adding conda fallback and adding test before push to ghcr.io helped to improve this situation: |
Thanks, I could have a bit of bandwidth to test this again next week. |
🐛 Describe the bug
Versions
pytorch-nightly
cc @ezyang @gchanan @zou3519 @kadeng @msaroufim
The text was updated successfully, but these errors were encountered: