Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA nightly docker actually includes CPU build of torch #125879

Closed
bhack opened this issue May 9, 2024 · 42 comments
Closed

CUDA nightly docker actually includes CPU build of torch #125879

bhack opened this issue May 9, 2024 · 42 comments
Assignees
Labels
high priority module: docker module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@bhack
Copy link
Contributor

bhack commented May 9, 2024

🐛 Describe the bug

 File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1077, in CUDAExtension
 library_dirs += library_paths(cuda=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1204, in library_paths
if (not os.path.exists(_join_cuda_home(lib_dir)) and
                            ^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2417, in _join_cuda_home
     raise OSError('CUDA_HOME environment variable is not set. '
OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

Versions

pytorch-nightly

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim

@bhack
Copy link
Contributor Author

bhack commented May 9, 2024

torch.cuda._is_compiled() return false on pytorch nightly docker image.

/cc @atalman

@bhack bhack changed the title Compiling extension on pytorch nighlty image is broken wihtout GPU Compiling extension on pytorch nighlty image is broken May 9, 2024
@bhack
Copy link
Contributor Author

bhack commented May 9, 2024

docker run --rm -it ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240509-cuda12.4-cudnn8-devel python -c "import torch ; print(torch.cuda._is_compiled())"

False

@malfet
Copy link
Contributor

malfet commented May 10, 2024

@atalman can you please check what is going on here?

@bhack
Copy link
Contributor Author

bhack commented May 14, 2024

Is this solved?
I was using ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240514-cuda12.4-cudnn8-devel today and we have the same problem.

@bhack
Copy link
Contributor Author

bhack commented May 14, 2024

Same problem for CUDA 12.1 2.4.0.dev20240514-cuda12.1-cudnn8-devel

@bhack
Copy link
Contributor Author

bhack commented May 15, 2024

We need to reopen this ticket as also today image has the same problem:

docker run --rm -it ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240515-cuda12.4-cudnn8-devel python -c "import torch ; print(torch.cuda._is_compiled())"

False

@ezyang ezyang reopened this May 15, 2024
@bhack
Copy link
Contributor Author

bhack commented May 15, 2024

cuda 12.1 is ok instead.

@bhack
Copy link
Contributor Author

bhack commented May 31, 2024

We have regressed again also on 12.1:
docker run --rm -it ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240531-cuda12.1-cudnn8-devel python -c "import torch ; print(torch.cuda._is_compiled())"

False

I think without regular testing this with the CI it will be always out of control.

@atalman
Copy link
Contributor

atalman commented May 31, 2024

@bhack you right, our nightly not very stable for docker, I do see an issue here: https://github.com/pytorch/pytorch/actions/runs/9314625747/job/25670242262#step:12:6821
For some reason cpu version is getting installed rather the GPU one. If you want to consume more stable version, use release.

We would need to add test step on top the docker image building

@bhack
Copy link
Contributor Author

bhack commented May 31, 2024

Yes but I cannot use release images in this specific context as I am regularly submitting compiler issues for a devel/testing env. So for the rapid evolution of the compiler activities in the repo here I need to regularly follow nightly and quickly go back on broken days
Releases are too distant to regularly interact with the compiler team.

@bhack
Copy link
Contributor Author

bhack commented May 31, 2024

Also is it not the same dockerfile receipt of stable releases?
Cause in this case it is better to strictly monitor it as proposed at:

pytorch/builder#1432

@bhack
Copy link
Contributor Author

bhack commented Jun 1, 2024

It seems fixed with 06/01 but we need to wait tomorrow nightly container. In any case cause the instabilities I think that pytorch/builder#1432 is required.

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    pytorch-2.4.0.dev20240601  |py3.11_cuda12.1_cudnn8.9.2_0        1.39 GB  pytorch-nightly
    pytorch-mutex-1.0          |             cuda           3 KB  pytorch-nightly
    torchaudio-2.2.0.dev20240601|      py311_cu121         6.4 MB  pytorch-nightly
    torchtriton-3.0.0+45fff310c8|            py311       250.5 MB  pytorch-nightly
    torchvision-0.19.0.dev20240601|      py311_cu121         8.7 MB  pytorch-nightly
    ------------------------------------------------------------
                                           Total:        1.65 GB

The following NEW packages will be INSTALLED:

  torchtriton        pytorch-nightly/linux-64::torchtriton-3.0.0+45fff310c8-py311 

The following packages will be UPDATED:

  pytorch                    2.4.0.dev20240531-py3.11_cpu_0 --> 2.4.0.dev20240601-py3.11_cuda12.1_cudnn8.9.2_0 
  pytorch-mutex                                     1.0-cpu --> 1.0-cuda 
  torchaudio                    2.2.0.dev20240531-py311_cpu --> 2.2.0.dev20240601-py311_cu121 
  torchvision                  0.19.0.dev20240531-py311_cpu --> 0.19.0.dev20240601-py311_cu121 

@bhack
Copy link
Contributor Author

bhack commented Jun 3, 2024

Both docker run --rm -it ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240602-cuda12.1-cudnn8-runtime python -c "import torch ; print(torch.cuda._is_compiled())"

and docker run --rm -it ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240602-cuda12.4-cudnn8-runtime python -c "import torch ; print(torch.cuda._is_compiled())"

Are ok today.

We could temp close this and focus on pytorch/builder#1432

@atalman
Copy link
Contributor

atalman commented Jun 3, 2024

@bhack agree. I think we need add a call to https://github.com/pytorch/builder/blob/main/.github/workflows/validate_docker_images.yml from pytorch/pytorch after building docker image

@albanD albanD added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed triage review labels Jun 3, 2024
@malfet malfet changed the title Compiling extension on pytorch nighlty image is broken CUDA nightly docker actually includes CPU build of torch Jun 3, 2024
@malfet
Copy link
Contributor

malfet commented Jun 3, 2024

@atalman can you please add an explanation whether #125887 actually fixed the problem and we regressed again or it never was a proper fix? It would be good to find/mention PRs that caused this regression in the first

@bhack
Copy link
Contributor Author

bhack commented Jun 3, 2024

As a temporary workaround it could be also nicer to not push these nighties at all in the registry as it will be a bit confusing to have a new pushed nightly image with an old nightly wheel.

@bhack
Copy link
Contributor Author

bhack commented Jun 4, 2024

It is happening again today:

The following packages will be UPDATED:

  pytorch            2.4.0.dev20240602-py3.11_cuda12.4_cud~ --> 2.4.0.dev20240604-py3.11_cpu_0 
  torchaudio                  2.2.0.dev20240602-py311_cu124 --> 2.2.0.dev20240604-py311_cpu 
  torchvision                0.19.0.dev20240602-py311_cu124 --> 0.19.0.dev20240604-py311_cpu 

@bhack
Copy link
Contributor Author

bhack commented Jun 5, 2024

I think it is specific to conda as if I make a pip upgrade over the conda base install it is going to work:

pip3 install  --upgrade --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

instead conda is still broken trying to push for a GPU to CPU build upgrade.

@bhack
Copy link
Contributor Author

bhack commented Jun 6, 2024

Also today conda is upgrading on CPU but pip is ok.

pytorchmergebot pushed a commit that referenced this issue Jun 6, 2024
@atalman
Copy link
Contributor

atalman commented Jun 10, 2024

@bhack Nightly CI - validation is added to Docker images : https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official%20Docker

Will work now on keeping 2 days conda builds - this should mitigate installing of cpu binaries rather then cuda

@bhack
Copy link
Contributor Author

bhack commented Jun 10, 2024

It is quite strange that pip upgrade over the conda install in docker image is always ok. I've tested an upgrade every day in the last 5/6 nightlies.

@bhack
Copy link
Contributor Author

bhack commented Jun 10, 2024

Do you prefer to have a separate ticket for conda nightly?
As also today we have

The following packages will be UPDATED:

  pytorch            2.4.0.dev20240604-py3.11_cuda12.4_cud~ --> 2.4.0.dev20240610-py3.11_cpu_0 
  torchaudio                  2.2.0.dev20240605-py311_cu124 --> 2.4.0.dev20240610-py311_cpu 
  torchvision                0.19.0.dev20240605-py311_cu124 --> 0.19.0.dev20240610-py311_cpu 

The following packages will be SUPERSEDED by a higher-priority channel:

  certifi            pkgs/main/linux-64::certifi-2024.6.2-~ --> conda-forge/noarch::certifi-2024.6.2-pyhd8ed1ab_0 

The following packages will be DOWNGRADED:

  pytorch-mutex                                    1.0-cuda --> 1.0-cpu 

@atalman
Copy link
Contributor

atalman commented Jun 12, 2024

@bhack yes pip is always ok since we are not pruning download.pytorch.org. Having 2 nightly builds for conda should resolve the issue of installing cpu instead of gpu image. However it may install previous nightly build rather then current one. But this is the case with pip builds as well

@bhack
Copy link
Contributor Author

bhack commented Jun 12, 2024

Ok thanks. Do you know if the conda layer is invalidated without a new release?
Cause there is the risk to download another 5GB layer from the registry without a new release inside just tracking the nightly images.

@bhack
Copy link
Contributor Author

bhack commented Jun 13, 2024

@atalman Something is broken again. I cannot compile a pytorch custom op on today devel image ghcr.io/pytorch/pytorch-nightly:2.5.0.dev20240613-cuda12.4-cudnn9-devel

docker run --rm -it ghcr.io/pytorch/pytorch-nightly:2.5.0.dev20240613-cuda12.4-cudnn9-devel python -c "import torch ; print(torch.cuda._is_compiled())"

==========
== CUDA ==
==========

CUDA Version 12.4.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

False

@malfet
Copy link
Contributor

malfet commented Jun 13, 2024

@atalman do we have any testing for the docker builds, because indeed it looks like we've included CPU build there:

$ docker run --rm -it ghcr.io/pytorch/pytorch-nightly:2.5.0.dev20240613-cuda12.4-cudnn9-devel   python -c "import torch ; print(torch.__config__.show())"

...

PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.4.2 (Git Hash 1137e04ec0b5251ca2b4400a4fd3c667ce843d67)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=2.4.0, USE_CUDA=0, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, 

@bhack
Copy link
Contributor Author

bhack commented Jun 13, 2024

I think we have these https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Build%20Official%20Docker

But have we created a dependency on this to push or not the images to the artifact registry?

@atalman wrote: This test is done post push - will show us if build failed in the CI

@bhack
Copy link
Contributor Author

bhack commented Jun 13, 2024

Also it seems that we are not coordinated between the wheel publication and the docker build and push to the artifact registy. It seems that 06/13 images have the 06/12 CPU wheel.

As I've added this layer to the last nightly image

RUN conda install -y pytorch torchvision torchaudio pytorc...

And the proposal is to ugprade from 06/12 to 06/13 on 06/13 images.

#11 [builder 5/6] RUN conda install -y pytorch torchvision torchaudio pytorc...
#11 1.922 Channels:
#11 1.922  - pytorch-nightly
#11 1.922  - nvidia
#11 1.922  - defaults
#11 1.922 Platform: linux-64
#11 1.922 Collecting package metadata (repodata.json): ...working... done
#11 5.638 Solving environment: ...working... done
#11 220.9 
#11 220.9 ## Package Plan ##
#11 220.9 
#11 220.9   environment location: /opt/conda
#11 220.9 
#11 220.9   added / updated specs:
#11 220.9     - pytorch
#11 220.9     - pytorch-cuda=12.4
#11 220.9     - torchaudio
#11 220.9     - torchvision
#11 220.9 
#11 220.9 
#11 220.9 The following packages will be downloaded:
#11 220.9 
#11 220.9     package                    |            build
#11 220.9     ---------------------------|-----------------
#11 220.9     pytorch-2.5.0.dev20240613  |py3.11_cuda12.4_cudnn9.1.0_0        1.34 GB  pytorch-nightly
#11 220.9     pytorch-mutex-1.0          |             cuda           3 KB  pytorch-nightly
#11 220.9     torchaudio-2.4.0.dev20240613|      py311_cu124         6.3 MB  pytorch-nightly
#11 220.9     torchtriton-3.0.0+45fff310c8|            py311       250.6 MB  pytorch-nightly
#11 220.9     torchvision-0.19.0.dev20240613|      py311_cu124         8.7 MB  pytorch-nightly
#11 220.9     ------------------------------------------------------------
#11 220.9                                            Total:        1.60 GB
#11 220.9 
#11 220.9 The following NEW packages will be INSTALLED:
#11 220.9 
#11 220.9   torchtriton        pytorch-nightly/linux-64::torchtriton-3.0.0+45fff310c8-py311 
#11 220.9 
#11 220.9 The following packages will be UPDATED:
#11 220.9 
#11 220.9   pytorch                    2.4.0.dev20240612-py3.11_cpu_0 --> 2.5.0.dev20240613-py3.11_cuda12.4_cudnn9.1.0_0 
#11 220.9   pytorch-mutex                                     1.0-cpu --> 1.0-cuda 
#11 220.9   torchaudio                    2.4.0.dev20240612-py311_cpu --> 2.4.0.dev20240613-py311_cu124 
#11 220.9   torchvision                  0.19.0.dev20240612-py311_cpu --> 0.19.0.dev20240613-py311_cu124 
#11 220.9 

@atalman
Copy link
Contributor

atalman commented Jun 13, 2024

This PR: pytorch/test-infra#5332 should mitigate the issue with conda nightly CPU install. By keeping 2 torch versions at all time. Will monitor this job for next few days.

Next will look into adding smoke tests before push

cc @bhack

atalman added a commit to pytorch/test-infra that referenced this issue Jun 13, 2024
While we don't have vision/torch builds properly scync, adding this as
mitigation

Help mitigate this issue:
pytorch/pytorch#125879
@bhack
Copy link
Contributor Author

bhack commented Jun 13, 2024

But it seems we had a valid GPU wheel for conda today as you can see in my extra layer log #125879 (comment). Why we didn't have that in the image and we have the cpu one of the day before?

@atalman
Copy link
Contributor

atalman commented Jun 14, 2024

@bhack the workflow is initiated at the same time as all other bulds, on push to nightly:
https://github.com/pytorch/pytorch/blob/main/.github/workflows/docker-release.yml#L13
Hence its taking previous nightly build.

We should perhaps run it on schedule when we are on nightly branch. Will look into that

@bhack
Copy link
Contributor Author

bhack commented Jun 14, 2024

Can we create a dependency between these jobs?

@atalman
Copy link
Contributor

atalman commented Jun 14, 2024

Yes this is what our intent is. Ultimately we want workflow like this
torch->vision/audio->validations/docker builds

TharinduRusira pushed a commit to TharinduRusira/pytorch that referenced this issue Jun 14, 2024
@atalman atalman assigned atalman and unassigned atalman Jun 17, 2024
pytorchmergebot pushed a commit that referenced this issue Jun 17, 2024
…g to repo (#128852)

Related to: #125879
Would check if we are compiled with CUDA before publishing CUDA Docker nightly image

Test
```
#18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo "Is torch compiled with cuda: ${IS_CUDA}";     if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then 	exit 1;     fi
#18 1.656 Is torch compiled with cuda: False
#18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo \"Is torch compiled with cuda: ${IS_CUDA}\";     if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1;     fi" did not complete successfully: exit code: 1
------
 > [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo "Is torch compiled with cuda: ${IS_CUDA}";     if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then 	exit 1;     fi:
1.656 Is torch compiled with cuda: False
------
Dockerfile:80
--------------------
  79 |     RUN /opt/conda/bin/pip install torchelastic
  80 | >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\
  81 | >>>     echo "Is torch compiled with cuda: ${IS_CUDA}"; \
  82 | >>>     if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \
  83 | >>> 	exit 1; \
  84 | >>>     fi
  85 |
--------------------
ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo \"Is torch compiled with cuda: ${IS_CUDA}\";     if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1;     fi" did not complete successfully: exit code: 1
(base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain  --platform="linux/amd64"  --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" .
#0 building with "default" instance using docker driver
```

Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet:
https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852

Pull Request resolved: #128852
Approved by: https://github.com/malfet
@atalman
Copy link
Contributor

atalman commented Jun 27, 2024

hi @bhack let me know if you see this issue still. Looks like adding conda fallback and adding test before push to ghcr.io helped to improve this situation:

Screenshot 2024-06-27 at 1 39 24 PM

@bhack
Copy link
Contributor Author

bhack commented Jun 27, 2024

Thanks, I could have a bit of bandwidth to test this again next week.

@bhack bhack closed this as completed Aug 23, 2024
@atalman
Copy link
Contributor

atalman commented Aug 23, 2024

@bhack Thank you for closing this. Please note: we switched from conda to pip in this workflow. Here is the PR that does the switch: #134274
Please let me know if you see issue on your end because of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: docker module: regression It used to work, and now it doesn't triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants