Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pytorch nightly docker image invalidated layers #125862

Open
bhack opened this issue May 9, 2024 · 10 comments
Open

Pytorch nightly docker image invalidated layers #125862

bhack opened this issue May 9, 2024 · 10 comments
Labels
module: binaries Anything related to official binaries that we release to users module: docker triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@bhack
Copy link
Contributor

bhack commented May 9, 2024

🐛 Describe the bug

Is there something that is invalidating Docker nightly image layers?

It seems that every time I am going to pull a new nightly image it is going to download all the heavy layers.

Versions

pytorch nightly

cc @seemethere @malfet @osalpekar @atalman

@atalman
Copy link
Contributor

atalman commented May 9, 2024

We have this workflow that we introduced recently : https://github.com/pytorch/builder/actions/runs/9023482505/job/24795381627 that runs smoke tests. I see we had issue last night, looking into that

@bhack
Copy link
Contributor Author

bhack commented May 9, 2024

Do we have a self-check to see if layers are not invalidated on every builds?

@bhack
Copy link
Contributor Author

bhack commented May 9, 2024

I see we had issue last night, looking into that

That one is another bug: #125879

Here instead I am talking about invalidating common layers at every nightly rebuild/push/pull

@malfet malfet added module: binaries Anything related to official binaries that we release to users module: docker triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels May 10, 2024
@atalman atalman self-assigned this May 10, 2024
@atalman
Copy link
Contributor

atalman commented May 10, 2024

@bhack can you please specify more detailed with links your idea ? Are you referring to: https://docs.docker.com/build/cache/invalidation/ ? Maybe you have particular tool in mind we can look into ?

@bhack
Copy link
Contributor Author

bhack commented May 10, 2024

Yes it is mainly:

Once one layer is invalidated, all following layers are also invalidated

https://docs.docker.com/guides/docker-concepts/building-images/using-the-build-cache/

So also if we have a single layer in the Dockefile that is going to be different every night we are going to invalidate all the other layers in the image (and also the cache of course).

@seemethere
Copy link
Member

Yeah unfortunately the cache-ability of python packages within a docker container don't really make this an easy task.

However I'd happily review a PR if you feel like there's an opportunity to make this better!

As well, I think this might be more of a specialized workflow that doesn't represent what our average user might see so I think it might be hard for us to prioritize this as of right now.

@bhack bhack changed the title Pytorch nightly docker image Pytorch nightly docker image invalidated layers May 10, 2024
@bhack
Copy link
Contributor Author

bhack commented May 14, 2024

If we take e.g. today image docker history ghcr.io/pytorch/pytorch-nightly:2.4.0.dev20240514-cuda12.1-cudnn8-devel

IMAGE          CREATED        CREATED BY                                      SIZE      COMMENT
eec515b76ada   6 hours ago    WORKDIR /workspace                              0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ENV PYTORCH_VERSION=2.4.0.dev20240514           0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ENV PATH=/usr/local/nvidia/bin:/usr/local/cu…   0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/u…   0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ENV NVIDIA_DRIVER_CAPABILITIES=compute,utili…   0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ENV NVIDIA_VISIBLE_DEVICES=all                  0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ENV PATH=/opt/conda/bin:/usr/local/nvidia/bi…   0B        buildkit.dockerfile.v0
<missing>      6 hours ago    RUN |4 PYTORCH_VERSION=2.4.0.dev20240514 TRI…   6.54kB    buildkit.dockerfile.v0
<missing>      6 hours ago    COPY /opt/conda /opt/conda # buildkit           4.81GB    buildkit.dockerfile.v0
<missing>      6 hours ago    RUN |4 PYTORCH_VERSION=2.4.0.dev20240514 TRI…   3.31MB    buildkit.dockerfile.v0
<missing>      6 hours ago    LABEL com.nvidia.volumes.needed=nvidia_driver   0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ARG CUDA_VERSION                                0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ARG TARGETPLATFORM                              0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ARG TRITON_VERSION                              0B        buildkit.dockerfile.v0
<missing>      6 hours ago    ARG PYTORCH_VERSION                             0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs    0B        buildkit.dockerfile.v0
<missing>      6 months ago   RUN |1 TARGETARCH=amd64 /bin/sh -c apt-mark …   386kB     buildkit.dockerfile.v0
<missing>      6 months ago   RUN |1 TARGETARCH=amd64 /bin/sh -c apt-get u…   4.79GB    buildkit.dockerfile.v0
<missing>      6 months ago   LABEL maintainer=NVIDIA CORPORATION <cudatoo…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ARG TARGETARCH                                  0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.17.…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NCCL_VERSION=2.17.1-1                       0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNCCL_DEV_PACKAGE_VERSION=2.17.1-1     0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev     0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-1=1…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_NVPROF_VERSION=12.1.105-1                0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_CUDA_NSIGHT_COMPUTE_VERSION=12.1.1-1     0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-1…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUBLAS_DEV_VERSION=12.1.3.1-1         0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-1=12…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNPP_DEV_VERSION=12.1.0.40-1           0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUSPARSE_DEV_VERSION=12.1.0.106-1     0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_NVML_DEV_VERSION=12.1.105-1              0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_CUDA_CUDART_DEV_VERSION=12.1.105-1       0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_CUDA_LIB_VERSION=12.1.1-1                0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENTRYPOINT ["/opt/nvidia/nvidia_entrypoint.s…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NVIDIA_PRODUCT_NAME=CUDA                    0B        buildkit.dockerfile.v0
<missing>      6 months ago   COPY nvidia_entrypoint.sh /opt/nvidia/ # bui…   2.53kB    buildkit.dockerfile.v0
<missing>      6 months ago   COPY entrypoint.d/ /opt/nvidia/entrypoint.d/…   3.06kB    buildkit.dockerfile.v0
<missing>      6 months ago   RUN |1 TARGETARCH=amd64 /bin/sh -c apt-mark …   261kB     buildkit.dockerfile.v0
<missing>      6 months ago   RUN |1 TARGETARCH=amd64 /bin/sh -c apt-get u…   2.01GB    buildkit.dockerfile.v0
<missing>      6 months ago   LABEL maintainer=NVIDIA CORPORATION <cudatoo…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ARG TARGETARCH                                  0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNCCL_PACKAGE=libnccl2=2.17.1-1+cud…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NCCL_VERSION=2.17.1-1                       0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNCCL_PACKAGE_VERSION=2.17.1-1         0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNCCL_PACKAGE_NAME=libnccl2            0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUBLAS_PACKAGE=libcublas-12-1=12.1…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUBLAS_VERSION=12.1.3.1-1             0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-1    0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBCUSPARSE_VERSION=12.1.0.106-1         0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNPP_PACKAGE=libnpp-12-1=12.1.0.40-1   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_LIBNPP_VERSION=12.1.0.40-1               0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_NVTX_VERSION=12.1.105-1                  0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_CUDA_LIB_VERSION=12.1.1-1                0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NVIDIA_DRIVER_CAPABILITIES=compute,utili…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NVIDIA_VISIBLE_DEVICES=all                  0B        buildkit.dockerfile.v0
<missing>      6 months ago   COPY NGC-DL-CONTAINER-LICENSE / # buildkit      17.3kB    buildkit.dockerfile.v0
<missing>      6 months ago   ENV LD_LIBRARY_PATH=/usr/local/nvidia/lib:/u…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV PATH=/usr/local/nvidia/bin:/usr/local/cu…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   RUN |1 TARGETARCH=amd64 /bin/sh -c echo "/us…   46B       buildkit.dockerfile.v0
<missing>      6 months ago   RUN |1 TARGETARCH=amd64 /bin/sh -c apt-get u…   150MB     buildkit.dockerfile.v0
<missing>      6 months ago   ENV CUDA_VERSION=12.1.1                         0B        buildkit.dockerfile.v0
<missing>      6 months ago   RUN |1 TARGETARCH=amd64 /bin/sh -c apt-get u…   10.5MB    buildkit.dockerfile.v0
<missing>      6 months ago   LABEL maintainer=NVIDIA CORPORATION <cudatoo…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ARG TARGETARCH                                  0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-1     0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NV_CUDA_CUDART_VERSION=12.1.105-1           0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tes…   0B        buildkit.dockerfile.v0
<missing>      6 months ago   ENV NVARCH=x86_64                               0B        buildkit.dockerfile.v0
<missing>      7 months ago   /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B        
<missing>      7 months ago   /bin/sh -c #(nop) ADD file:63d5ab3ef0aab308c…   77.8MB    
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B        
<missing>      7 months ago   /bin/sh -c #(nop)  LABEL org.opencontainers.…   0B        
<missing>      7 months ago   /bin/sh -c #(nop)  ARG LAUNCHPAD_BUILD_ARCH     0B        
<missing>      7 months ago   /bin/sh -c #(nop)  ARG RELEASE                  0B 

Are we going to produce an fresh new ~5GB layer every night with COPY /opt/conda /opt/conda?

Other then the download time and bootstrap/provisioning lag this is really going to grown quickly any registry for derived images.

@bhack
Copy link
Contributor Author

bhack commented Jun 4, 2024

@atalman I think that the main issue are COPY --from=build /opt/conda /opt/conda and COPY --from=conda /opt/conda /opt/conda layers that is near 5GB and it is going to be monotonically invalidating every day.

What do you think about separating this monolithic layer so that it is not anymore a 5GB to re-download, store in the build cache and in the artifact registry every day?

E.g. this could be a little bit too atomic, but I think we could find a balance what do you think?

COPY --from=build /opt/conda/bin /opt/conda/bin
COPY --from=build /opt/conda/envs /opt/conda/envs
COPY --from=build /opt/conda/lib /opt/conda/lib
COPY --from=build /opt/conda/include /opt/conda/include
COPY --from=build /opt/conda/etc /opt/conda/etc
COPY --from=build /opt/conda/conda-meta /opt/conda/conda-meta
COPY --from=build /opt/conda/pkgs /opt/conda/pkgs

@bhack
Copy link
Contributor Author

bhack commented Jun 4, 2024

If you have a local setup to build pytorch images you can easy analyze the conda layer with https://github.com/wagoodman/dive

@bhack
Copy link
Contributor Author

bhack commented Jun 12, 2024

An average conda nightly upgrade 1.36GB. So I think that invalidating a 5GB layer every day is a lot of overhead

    package                    |            build
    ---------------------------|-----------------
    certifi-2024.6.2           |     pyhd8ed1ab_0         157 KB  conda-forge
    pytorch-2.4.0.dev20240611  |py3.11_cuda12.4_cudnn9.1.0_0        1.34 GB  pytorch-nightly
    torchaudio-2.4.0.dev20240611|      py311_cu124         6.4 MB  pytorch-nightly
    torchvision-0.19.0.dev20240611|      py311_cu124         8.6 MB  pytorch-nightly
    ------------------------------------------------------------
                                           Total:        1.36 GB

@atalman atalman removed their assignment Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: binaries Anything related to official binaries that we release to users module: docker triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants