Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Fedora installation procedure #706

Closed
escorciav opened this issue Apr 18, 2018 · 32 comments
Closed

Fedora installation procedure #706

escorciav opened this issue Apr 18, 2018 · 32 comments

Comments

@escorciav
Copy link

Given that there is no direct support for Fedora.

Can we put a precise guide here? The idea is to make nvidia-docker works in Fedora.
I volunteer to try out previous approaches in my system, Fedora 27. I installed very recently and it's almost brand new.

@escorciav
Copy link
Author

escorciav commented Apr 18, 2018

I'm tackling nvidia-docker2. Things that didn't work me

  • Use repositories from Centos and Amazon.
    Following the instructions from Centos and Amazon did not work for me. Error is here.
    Summary: Failed to synchronize cache for repo X, where X is all the three required items.

  • Compile the two or three items by myself

  • Search how to do this

@rbavery
Copy link

rbavery commented Apr 18, 2018

+1 Fedora 27 user similarly stuck and looking for instructions

@flx42
Copy link
Member

flx42 commented Apr 18, 2018

@escorciav Thanks for volunteering.
By the way, pastebin is blocked on our corporate network, can you copy the error? Or provide an attachment.

@escorciav
Copy link
Author

escorciav commented Apr 18, 2018

@flx42, Error in Fedora 27 due to using repo from Centos 7:

$ dnf install nvidia-docker
Failed to synchronize cache for repo 'libnvidia-container', disabling.
Failed to synchronize cache for repo 'nvidia-container-runtime', disabling.
Failed to synchronize cache for repo 'nvidia-docker', disabling.
Last metadata expiration check: 1:08:53 ago on Wed 18 Apr 2018 03:45:34 PM +03.
No match for argument: nvidia-docker
Error: Unable to find a match

The .repo file is:

$ cat /etc/yum.repos.d/nvidia-docker.repo
[libnvidia-container]
name=libnvidia-container
baseurl=https://nvidia.github.io/libnvidia-container/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-container-runtime]
name=nvidia-container-runtime
baseurl=https://nvidia.github.io/nvidia-container-runtime/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-container-runtime/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

[nvidia-docker]
name=nvidia-docker
baseurl=https://nvidia.github.io/nvidia-docker/centos7/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/nvidia-docker/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

@escorciav
Copy link
Author

escorciav commented Apr 18, 2018

Hi @flx42,
I did a monkey typing approach to build nvidia-docker and nvidia-container-runtime via make. Apparently, everything ran without problems, and I ended-up with the following images (output is below).

  • Can I copy the binaries that I need from those images?
  • Which are the files that I must copy?
# docker images            
REPOSITORY              TAG                   IMAGE ID            CREATED              SIZE
nvidia-docker2          18.03.0.ce-fedora27   679a30fc3930        About a minute ago   473MB
nvidia/runtime/fedora   27-docker1.12.6       1aa46723e854        16 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker1.13.1       2b1a29593f49        16 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker17.03.2      f933689823b9        16 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker17.06.2      0d3b5e338b44        17 minutes ago       1.74GB
nvidia/runtime/fedora   27-docker17.09.0      4d034d0a9dcb        17 minutes ago       1.74GB
nvidia/runtime/fedora   27-docker17.09.1      50f3191ebdc0        18 minutes ago       1.74GB
nvidia/runtime/fedora   27-docker17.12.0      7e1330b2307a        18 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker17.12.1      95c7427f9a50        19 minutes ago       1.73GB
nvidia/runtime/fedora   27-docker18.03.0      fdb954f2ee40        19 minutes ago       1.73GB
nvidia/hook/fedora      27                    aeeaba24d9af        35 minutes ago       837MB
nvidia/base/fedora      27                    514c0326c663        38 minutes ago       835MB
fedora                  27                    9110ae7f579f        6 weeks ago          235MB
# docker --version 
Docker version 18.03.0-ce, build 0520e24

Update (April 20 after using solution below)
Apparently, after you build with make a new folder called dist with the rpm appears 😆.
I guess those .rpm files may work as well.

@flx42
Copy link
Member

flx42 commented Apr 18, 2018

Did you also try doing directly rpm -i on the packages we provide for centos 7?

@escorciav
Copy link
Author

escorciav commented Apr 19, 2018

where are those files?
In the case of nvidia-docker, you only provided a rpm file for nvidia-docker 1.0.

@flx42
Copy link
Member

flx42 commented Apr 19, 2018

Look at what's suggested here: #635 (comment)

@escorciav
Copy link
Author

escorciav commented Apr 19, 2018

update: Oct 24 2018

Please follow the strategy suggested here for Fedora 26, maybe it also works in newer versions.

Original message

Apparently, it works. Thanks!

The alternative that worked for me case was:

  1. Clone the repos as follows (executed as root)
LOCALDIR=/var/lib/nvidia-docker-repo
mkdir -p $LOCALDIR && cd $LOCALDIR
git clone -b gh-pages https://github.com/NVIDIA/libnvidia-container.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-container-runtime.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-docker.git
  1. Install rpm files manually
    Note: NOT copy-paste if your docker version is not 18.03.0.ce. Edit the last two lines accordingly.
rpm --import $LOCALDIR/nvidia-docker/gpgkey
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container1-1.0.0-0.1.beta.1.x86_64.rpm
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container-tools-1.0.0-0.1.beta.1.x86_64.rpm
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-hook-1.3.0-1.x86_64.rpm
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-2.0.0-1.docker18.03.0.x86_64.rpm
rpm -i nvidia-docker/centos7/x86_64/nvidia-docker2-2.0.3-1.docker18.03.0.ce.noarch.rpm

Notes:

  • According to the issue mentioned by @flx42, you can update it by doing git pull.
  • I tried to setup the yum repo but keep receiving the error of loading the repo. I guess I am not registering the .repo file properly.
  • Tested by doing:
sudo pkill -SIGHUP dockerd
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

@andys0975
Copy link

@escorciav Thanks a lot! However I encountered a weird error when doing "sudo docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi"

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused "process_linux.go:385: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=8702 /var/lib/docker/overlay2/f33c9f212b70e1069c28213f71d6a593c6a9e01eb2f4da9cfab15b0692578c6e/merged]\\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\\n\""": unknown.

I think i've been in seccomp mode
cat /boot/config-$(uname -r) | grep -i seccomp
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_SECCOMP=y

@escorciav
Copy link
Author

sorry, I was attending an important issue.

I forgot to mention the version of docker that I used. Also, note that I installed (rpm -i []) the packages that match with my docker version.

Other than that, I don't know how to help you.

@pawelmarkowski
Copy link

pawelmarkowski commented Jul 13, 2018

@andys0975 try to update packages - I had the same issue.

(Updated @escorciav manual)

Clone the repos as follows (executed as root)

mkdir -p $LOCALDIR && cd $LOCALDIR
git clone -b gh-pages https://github.com/NVIDIA/libnvidia-container.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-container-runtime.git
git clone -b gh-pages https://github.com/NVIDIA/nvidia-docker.git

Install rpm files manually
Note: NOT copy-paste if you're docker version is not 18.03.1.ce. Check ALL(!) packages listed below especially if you encounter the problem mentioned by @andys0975

rpm -i libnvidia-container/centos7/x86_64/libnvidia-container1-1.0.0-0.1.rc.2.x86_64.rpm 
rpm -i libnvidia-container/centos7/x86_64/libnvidia-container-tools-1.0.0-0.1.rc.2.x86_64.rpm 
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-hook-1.4.0-1.x86_64.rpm 
rpm -i nvidia-container-runtime/centos7/x86_64/nvidia-container-runtime-2.0.0-1.docker18.03.1.x86_64.rpm 
rpm -i nvidia-docker/centos7/x86_64/nvidia-docker2-2.0.3-1.docker18.03.1.ce.noarch.rpm 
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

@rickyzhang82
Copy link

I didn't install nvidia-docker2 package. Because I used device mapper with direct lvm.

So I modified /etc/docker/daemon.json manually. It works quite well in 4.17.12-200.fc28.x86_64. I confirmed Pytroch from NVIDIA cloud registry works. I believed the installation and configuration works for the rest.

docker-ce-18.06.0.ce-3.el7.x86_64
nvidia-container-runtime-hook-1.4.0-1.x86_64
libnvidia-container-tools-1.0.0-0.1.rc.2.x86_64
nvidia-container-runtime-2.0.0-1.docker18.06.0.x86_64
libnvidia-container1-1.0.0-0.1.rc.2.x86_64

/etc/docker/daemon.json

{
    "storage-driver": "devicemapper",
    "storage-opts": [
    "dm.thinpooldev=/dev/mapper/docker-thinpool",
    "dm.basesize=100G",
    "dm.use_deferred_removal=true",
    "dm.use_deferred_deletion=true"
    ],    
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

@escorciav
Copy link
Author

quick-update: as @OleRoel mentioned here. I did a silly mistake following his procedure.
That works like a charm in a fedora 26 machine. I highly recommend to try this approach first as it's much easier.

@jamesdbrock
Copy link

I did the procedure from #553 (comment)
and it succeeded. Thx @escorciav

  • Fedora 29
  • docker-ce version 18.09.3, build 774a1f4

@rickycorte
Copy link

For people searching, procedure #553 (comment) works even in Fedora 34 with just a few more steps.

First make sure that you have installed both the nvidia drivers and cuda on your host system (install them from RPM Fusion).

After executing the commands in the linked comment you have to edit /etc/nvidia-container-runtime/config.toml config.
Make sure to have this line: no-cgroups = true (by default is should be commented and set to false)
Restart docker with systemctl.
Now you should be able to run your gpu containers in privileged mode (--privileged flag).

Leaving out the privileged mode probably will lead you to "Unknown error" or logs complaining that the are missing libraries and a not working container.

@jamesdbrock
Copy link

Here's what I just did based on @rickycorte 's instructions and #553 (comment) to get nvidia-docker working with Fedora 34:

sudo dnf remove docker \
                  docker-client \
                  docker-client-latest \
                  docker-common \
                  docker-latest \
                  docker-latest-logrotate \
                  docker-logrotate \
                  docker-selinux \
                  docker-engine-selinux \
                  docker-engine

Use centos8 repo instead of centos7

curl -s -L https://nvidia.github.io/nvidia-docker/centos8/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo dnf install nvidia-docker2

edit /etc/nvidia-container-runtime/config.toml: no-cgroups = true

sudo systemctl start docker

docker run --privileged --runtime=nvidia --rm nvidia/cuda:11.3.0-devel-ubuntu18.04 nvidia-smi
Tue Jun  1 05:17:46 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.31       Driver Version: 465.31       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+

@raffaem
Copy link

raffaem commented Sep 10, 2022

For Fedora Workstation 36:

  1. Run sudo dnf remove moby-engine
  2. Install Docker Engine following these instructions
  3. Follow jamesdbrock's instructions

@jamesdbrock
Copy link

What I did for Fedora Workstation 36:

Uninstalled and reinstalled Nvidia Driver through Gnome Software and it worked.
https://www.reddit.com/r/Fedora/comments/unfbel/comment/i89qnwp/

then

sudo dnf install xorg-x11-drv-nvidia-cuda

@JohanAR
Copy link

JohanAR commented Sep 13, 2022

Since instructions are spread all over the place, here's all the commands I ran on Fedora 36:

# Uninstall old docker engine
sudo dnf remove moby-engine

# Get latest docker engine
# https://docs.docker.com/engine/install/fedora/#install-using-the-repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io docker-compose-plugin

# Get nvidia container toolkit, using the centos8 repo
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install nvidia-docker2

# Restart docker daemon and verify that it is working
sudo systemctl restart docker.service
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Don't copy-paste commands into your terminal blindly, especially not with sudo involved! Double check all URL:s that they point to the correct servers, or even better copy them from the official instructions instead of trusting strangers on github.

@elezar
Copy link
Member

elezar commented Sep 14, 2022

@JohanAR just a note: You should be able to use moby-engine on Fedora as long as you:

  1. Install the nvidia-container-toolkit package and not nvidia-docker2
  2. Configure your /etc/docker/daemon.json file to include the nvidia runtime and then restarting the docker service:
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Note that when using docker run --gpus all even this is required, but it is recommented that the runtime be specified explicitly:

docker run --rm --gpus all --runtime nvidia nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

@JohanAR
Copy link

JohanAR commented Sep 16, 2022

@elezar using GPU in docker suddenly stopped working yesterday after updating packages (included both some stuff from official docker repos and nvidia-firmwares from fedora repos, but I don't know exactly what caused it). So I thought I'd try uninstalling from docker-ce repo and try using moby-engine instead.

Now I'm getting Failed to initialize NVML: Insufficient Permissions though, due to some SELinux stuff. Tried reinstalling container-selinux but that didn't help either

Seems like it's only when I try to run nvidia-smi in the nvidia/cuda container.. I could run my stable diffusion webui just fine, but I don't know if that has anything to do with that image being created yesterday before these problems started..

@elezar
Copy link
Member

elezar commented Sep 16, 2022

@JohanAR please create a new ticket against https://github.com/NVIDIA/nvidia-container-toolkit with details of your setup (including installed versions of the *nvidia-contianer* packages) and the behaviour that you are seeing.

@PriamX
Copy link

PriamX commented Sep 18, 2022

@elezar using GPU in docker suddenly stopped working yesterday after updating packages (included both some stuff from official docker repos and nvidia-firmwares from fedora repos, but I don't know exactly what caused it). So I thought I'd try uninstalling from docker-ce repo and try using moby-engine instead.

Had the same issue after an update, pretty sure it came from the official Fedora repo, but I wouldn't know which package.

As a work around I set the default runtime to nvidia in /etc/docker/daemon.json and commented out the "runtime" argument in my docker-compose files. That seems to do it for now, but I would like to get back to explicitly declaring the runtime per service.

@PriamX
Copy link

PriamX commented Sep 18, 2022

Had the same issue after an update, pretty sure it came from the official Fedora repo, but I wouldn't know which package.

As a work around I set the default runtime to nvidia in /etc/docker/daemon.json and commented out the "runtime" argument in my docker-compose files. That seems to do it for now, but I would like to get back to explicitly declaring the runtime per service.

Oh, nope, nevermind, setting the default runtime to nvidia did not work. It seemed to after systemctl restart docker, but after a reboot all I get now is an error for any container I try to start. Even a simple one:

[root@mediaserv yaml-test]# docker run hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 1: unknown.
ERRO[0000] error waiting for container: context canceled
[root@mediaserv yaml-test]#

From what I recall I'd used essentially the same setup that @JohanAR described 5 days ago (above), however, I didn't have moby installed beforehand. Been operating that way since moving from Fedora 34 to 35 about 6 months ago-ish.

@PriamX
Copy link

PriamX commented Sep 18, 2022

Aha! Found these in my dnf history from 4 days ago:

    Upgrade  libnvidia-container-devel-1.11.0-1.x86_64            @libnvidia-container
    Upgraded libnvidia-container-devel-1.10.0-1.x86_64            @@System
    Upgrade  libnvidia-container-static-1.11.0-1.x86_64           @libnvidia-container
    Upgraded libnvidia-container-static-1.10.0-1.x86_64           @@System
    Upgrade  libnvidia-container-tools-1.11.0-1.x86_64            @libnvidia-container
    Upgraded libnvidia-container-tools-1.10.0-1.x86_64            @@System
    Upgrade  libnvidia-container1-1.11.0-1.x86_64                 @libnvidia-container
    Upgraded libnvidia-container1-1.10.0-1.x86_64                 @@System
    Upgrade  libnvidia-container1-debuginfo-1.11.0-1.x86_64       @libnvidia-container
    Upgraded libnvidia-container1-debuginfo-1.10.0-1.x86_64       @@System
    Upgrade  nvidia-container-toolkit-1.11.0-1.x86_64             @libnvidia-container
    Upgraded nvidia-container-toolkit-1.10.0-1.x86_64             @@System

I removed the 1.11 version and nvidia-docker2, installed the 1.10 version, reinstalled nvidia-docker2. And it works as it did before now.

@elezar
Copy link
Member

elezar commented Sep 19, 2022

@PriamX it seems as if there may be a regression in our 1.11.0 packages -- although we didn't see this behaviour in our testing.

Would you be able to reproduce the failures with debug logging enabled (uncomment the #debug = lines in /etc/nvidia-contianer-runtime/config.toml) and provide the /var/log/nvidia-container-runtime.log file? (ideally as an issue under https://github.com/NVIDIA/nvidia-container-toolkit)

Update: I see that you have already created NVIDIA/nvidia-container-toolkit#34 let's continue the discussion there.

@PriamX
Copy link

PriamX commented Sep 19, 2022

@elezar I did open issue #34 not long after I posted here. Saw you posted there. I'll move over to that conversation. Thanks!

@airtonix
Copy link

airtonix commented Feb 25, 2023

Since instructions are spread all over the place, here's all the commands I ran on Fedora 36:

# Uninstall old docker engine
sudo dnf remove moby-engine

# Get latest docker engine
# https://docs.docker.com/engine/install/fedora/#install-using-the-repository
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io docker-compose-plugin

# Get nvidia container toolkit, using the centos8 repo
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo dnf install nvidia-docker2

# Restart docker daemon and verify that it is working
sudo systemctl restart docker.service
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Don't copy-paste commands into your terminal blindly, especially not with sudo involved! Double check all URL:s that they point to the correct servers, or even better copy them from the official instructions instead of trusting strangers on github.

@elezar @JohanAR Does this work on Fedora 37?

@JohanAR
Copy link

JohanAR commented Feb 25, 2023

@airtonix no idea since I'm still using Fedora 36. However everything started working again after a couple of months, though I don't know exactly which package version that fixed it.

@elezar
Copy link
Member

elezar commented Feb 27, 2023

@airtonix for recent rpm-based distributions the first step is to install the centos8 packages. Then, our stack has changed quite a bit since the original post, and we no longer recommend that users install nvidia-docker2. Instead, our docs recommend (or should if they have not yet been updated) installing the nvidia-container-toolkit package and using the nvidia-ctk runtime configure command to apply the necessary configuration changes to the container engine such as docker.

Running

sudo nvidia-ctk runtime configure --runtime docker --config /etc/docker/daemon.json

Will update the config to include the nvidia runtime.

Restarting the docker daemon is still required to update the config.

Note there should be no technical reason for the stack to not work on newer fedora distributions.

@jamesdbrock
Copy link

jamesdbrock commented May 1, 2023

What I did for Fedora Workstation 38:

  1. Uninstalled and reinstalled Nvidia Graphics Driver through Gnome Software.
    https://www.reddit.com/r/Fedora/comments/unfbel/comment/i89qnwp/
  2. sudo dnf install xorg-x11-drv-nvidia-cuda
    

Test:

docker run --privileged --runtime=nvidia --rm nvidia/cuda:12.1.1-devel-ubuntu22.04 nvidia-smi
==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Mon May  1 13:58:41 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.41.03              Driver Version: 530.41.03    CUDA Version: 12.1     |

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests