Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU not released on VM stop #30

Open
4 tasks done
itzsimpl opened this issue Aug 8, 2023 · 8 comments
Open
4 tasks done

GPU not released on VM stop #30

itzsimpl opened this issue Aug 8, 2023 · 8 comments
Labels
Bug Confirmed to be a bug Incomplete Waiting on more information from reporter

Comments

@itzsimpl
Copy link

itzsimpl commented Aug 8, 2023

Following lxc/lxc#4332 (comment) I'm opening the issue here.

Required information

Click to see full
  • Distribution: Ubuntu
  • Distribution version: 22.04
  • The output of
    • lxc-start --version
5.0.0~git2209-g5a7b9ce67
  • lxc-checkconfig
LXC version 5.0.0~git2209-g5a7b9ce67
Kernel configuration not found at /proc/config.gz; searching...
Kernel configuration found at /boot/config-6.2.0-26-generic
--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled

--- Control groups ---
Cgroups: enabled
Cgroup namespace: enabled

Cgroup v1 mount points: 


Cgroup v2 mount points: 
/sys/fs/cgroup

Cgroup v1 systemd controller: missing
Cgroup v1 freezer controller: missing
Cgroup ns_cgroup: required
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled, not loaded
Macvlan: enabled, not loaded
Vlan: enabled, not loaded
Bridges: enabled, loaded
Advanced netfilter: enabled, loaded
CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, not loaded
CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, not loaded
FUSE (for use with lxcfs): enabled, not loaded

--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities: 

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig
  • uname -a
Linux q1 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
  • cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-3.scope
  • cat /proc/1/mounts
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=263874368k,nr_inodes=65968592,mode=755,inode64 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=52797032k,mode=755,inode64 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
/dev/nvme0n1p2 / ext4 rw,relatime,stripe=32 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=29,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=92734 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0
configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
ramfs /run/credentials/systemd-sysusers.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0
/dev/nvme0n1p1 /boot/efi vfat rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0
/dev/loop0 /snap/core20/1852 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0
/dev/loop1 /snap/core20/1974 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0
/dev/loop2 /snap/core22/858 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0
/dev/loop4 /snap/snapd/18596 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0
/dev/loop3 /snap/lxd/25112 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0
/dev/loop5 /snap/snapd/19457 squashfs ro,nodev,relatime,errors=continue,threads=single 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /run/snapd/ns tmpfs rw,nosuid,nodev,noexec,relatime,size=52797032k,mode=755,inode64 0 0
nsfs /run/snapd/ns/lxd.mnt nsfs rw 0 0
tmpfs /var/snap/lxd/common/ns tmpfs rw,relatime,size=1024k,mode=700,inode64 0 0
nsfs /var/snap/lxd/common/ns/shmounts nsfs rw 0 0
nsfs /var/snap/lxd/common/ns/mntns nsfs rw 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=52797028k,nr_inodes=13199257,mode=700,uid=1000,gid=1000,inode64 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

Issue description

I have an Ubuntu:22.04 system with multiple GPUs, NVIDIA drivers 535 server installed, persistence mode is off. I pass individual GPUs to a VM as PCI passthrough. When I pass a single GPU to the VM, start it and then stop, the GPU is not returned to the host system (i.e. nvidia-smi does not show it anymore). When I pass multiple GPUs to the VM, start it and then stop, the GPU with the lowest PCI address on the host is not returned to the host system (i.e. nvidia-smi does not show it anymore), but the other GPUs get returned just fine.

Restarting the VMs again the GPUs are visible inside the VM, but if I start a container with nvidia-driver passthrough, only the GPUs that are currently visible on the host (i.e. all installed minus those that were not returned from the VMs earlier) are visible in the container. The only info I can find is that syslog says "Failed to stop device".

Steps to reproduce

  1. run nvidia-smi -L on host
  2. create VM with single GPU via passthrough
  3. start VM
  4. stop VM
  5. run nvidia-smi -L on host (the GPU that was passthrough to the VM will not be listed)
  6. create VM with multiple GPUs via passthrough
  7. start VM
  8. stop VM
  9. run nvidia-smi -L on host (the GPU will the lowest PCI address on the host that was passthrough to the VM will also not be listed)
  10. run container with nvidia-driver passthrough (same status as on the host)

Information to attach

Click to see full
  • VM log (lxc info --show-log vm2)
Name: vm2
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Created: 2023/08/07 22:14 UTC
Last Used: 2023/08/07 23:19 UTC

Log:

qemu-system-x86_64: Issue while setting TUNSETSTEERINGEBPF: Invalid argument with fd: 83, prog_fd: -1
  • any relevant kernel output (syslog), the single GPU case
Aug  7 23:18:53 q1 kernel: [  846.086912] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
Aug  7 23:18:53 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data
Aug  7 23:18:54 q1 kernel: [  846.548326] xhci_hcd 0000:ca:00.2: remove, state 4
Aug  7 23:18:54 q1 kernel: [  846.548343] usb usb10: USB disconnect, device number 1
Aug  7 23:18:54 q1 kernel: [  846.549060] xhci_hcd 0000:ca:00.2: USB bus 10 deregistered
Aug  7 23:18:54 q1 kernel: [  846.549083] xhci_hcd 0000:ca:00.2: remove, state 4
Aug  7 23:18:54 q1 kernel: [  846.549091] usb usb9: USB disconnect, device number 1
Aug  7 23:18:54 q1 kernel: [  846.550896] xhci_hcd 0000:ca:00.2: USB bus 9 deregistered
Aug  7 23:18:54 q1 kernel: [  846.653021] kauditd_printk_skb: 9 callbacks suppressed
Aug  7 23:18:54 q1 kernel: [  846.653026] audit: type=1400 audit(1691450334.129:54): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-vm2_</var/snap/lxd/common/lxd>" pid=5316 comm="apparmor_parser"
Aug  7 23:18:53 q1 snapd[2334]: message repeated 3 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data]
Aug  7 23:18:55 q1 systemd[3823]: Started snap.lxd.lxc.b9b13195-c7c3-46d4-842a-856565db2c99.scope.
Aug  7 23:19:13 q1 kernel: [  865.800363] vfio-pci 0000:ca:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Aug  7 23:19:13 q1 kernel: [  865.800386] vfio-pci 0000:ca:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Aug  7 23:19:46 q1 systemd[3823]: Started snap.lxd.lxc.0a424bc8-95d2-4cb9-bdd0-468d3dbce737.scope.
Aug  7 23:19:51 q1 systemd[3823]: Started snap.lxd.lxc.63564057-7dd7-462c-9548-3a5153ddd1e7.scope.
Aug  7 23:19:51 q1 systemd[1]: Starting Cleanup of Temporary Directories...
Aug  7 23:19:51 q1 systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Aug  7 23:19:51 q1 systemd[1]: Finished Cleanup of Temporary Directories.
Aug  7 23:19:54 q1 kernel: [  907.246377] vfio-pci 0000:ca:00.0: Relaying device request to user (#0)
Aug  7 23:20:01 q1 kernel: [  913.710624] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Aug  7 23:20:01 q1 kernel: [  913.711376] vfio-pci 0000:ca:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Aug  7 23:20:01 q1 lxd.daemon[3076]: time="2023-08-07T23:20:01Z" level=error msg="Failed to stop device" device=gpu3 err="Failed probing device \"0000:ca:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default
Aug  7 23:20:01 q1 systemd-networkd[2222]: mac6293c2ac: Link DOWN
Aug  7 23:20:01 q1 systemd-networkd[2222]: mac6293c2ac: Lost carrier
Aug  7 23:20:01 q1 kernel: [  913.898141] audit: type=1400 audit(1691450401.373:55): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-vm2_</var/snap/lxd/common/lxd>" pid=10366 comm="apparmor_parser"
Aug  7 23:32:42 q1 systemd[3823]: Started snap.lxd.lxc.44a0582a-97eb-4f56-9149-a7b6f2afec5b.scope.
  • any relevant kernel output (syslog), two GPU case
Aug  7 23:45:38 q1 kernel: [ 2450.861745] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
Aug  7 23:45:38 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data
Aug  7 23:45:38 q1 kernel: [ 2451.339448] xhci_hcd 0000:17:00.2: remove, state 4
Aug  7 23:45:38 q1 kernel: [ 2451.339464] usb usb4: USB disconnect, device number 1
Aug  7 23:45:38 q1 kernel: [ 2451.340164] xhci_hcd 0000:17:00.2: USB bus 4 deregistered
Aug  7 23:45:38 q1 kernel: [ 2451.340188] xhci_hcd 0000:17:00.2: remove, state 4
Aug  7 23:45:38 q1 kernel: [ 2451.340197] usb usb3: USB disconnect, device number 1
Aug  7 23:45:38 q1 kernel: [ 2451.341944] xhci_hcd 0000:17:00.2: USB bus 3 deregistered
Aug  7 23:45:40 q1 kernel: [ 2453.384621] vfio-pci 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none
Aug  7 23:45:41 q1 kernel: [ 2453.867449] xhci_hcd 0000:31:00.2: remove, state 4
Aug  7 23:45:41 q1 kernel: [ 2453.867464] usb usb6: USB disconnect, device number 1
Aug  7 23:45:41 q1 kernel: [ 2453.868123] xhci_hcd 0000:31:00.2: USB bus 6 deregistered
Aug  7 23:45:41 q1 kernel: [ 2453.868144] xhci_hcd 0000:31:00.2: remove, state 4
Aug  7 23:45:41 q1 kernel: [ 2453.868151] usb usb5: USB disconnect, device number 1
Aug  7 23:45:41 q1 kernel: [ 2453.869683] xhci_hcd 0000:31:00.2: USB bus 5 deregistered
Aug  7 23:45:41 q1 kernel: [ 2453.966981] audit: type=1400 audit(1691451941.446:56): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lxd-vm2_</var/snap/lxd/common/lxd>" pid=11010 comm="apparmor_parser"
Aug  7 23:46:00 q1 kernel: [ 2472.883434] vfio-pci 0000:17:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Aug  7 23:46:00 q1 kernel: [ 2472.883457] vfio-pci 0000:17:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Aug  7 23:46:00 q1 kernel: [ 2473.055433] vfio-pci 0000:31:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Aug  7 23:46:00 q1 kernel: [ 2473.055455] vfio-pci 0000:31:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Aug  7 23:45:40 q1 snapd[2334]: message repeated 7 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data]
Aug  7 23:47:16 q1 systemd[3823]: Started snap.lxd.lxc.c918eda7-03e8-4d84-9cb2-c9e1b4d6bfa2.scope.
Aug  7 23:49:01 q1 kernel: [ 2653.889634] vfio-pci 0000:31:00.0: Relaying device request to user (#0)
Aug  7 23:49:08 q1 kernel: [ 2660.602855] vfio-pci 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Aug  7 23:49:08 q1 kernel: [ 2660.603292] nvidia 0000:31:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Aug  7 23:49:08 q1 kernel: [ 2660.690297] snd_hda_intel 0000:31:00.1: Disabling MSI
Aug  7 23:49:08 q1 kernel: [ 2660.690325] snd_hda_intel 0000:31:00.1: Handle vga_switcheroo audio client
Aug  7 23:49:08 q1 kernel: [ 2660.714786] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input19
Aug  7 23:49:08 q1 kernel: [ 2660.714916] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input20
Aug  7 23:49:08 q1 kernel: [ 2660.715088] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input21
Aug  7 23:49:08 q1 kernel: [ 2660.715283] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:30/0000:30:02.0/0000:31:00.1/sound/card0/input22
Aug  7 23:49:08 q1 snapd[2334]: udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data
Aug  7 23:49:08 q1 kernel: [ 2660.726602] xhci_hcd 0000:31:00.2: xHCI Host Controller
Aug  7 23:49:08 q1 kernel: [ 2660.726615] xhci_hcd 0000:31:00.2: new USB bus registered, assigned bus number 3
Aug  7 23:49:08 q1 kernel: [ 2660.727221] xhci_hcd 0000:31:00.2: hcc params 0x0180ff05 hci version 0x110 quirks 0x0000000000000010
Aug  7 23:49:08 q1 kernel: [ 2660.727606] xhci_hcd 0000:31:00.2: xHCI Host Controller
Aug  7 23:49:08 q1 kernel: [ 2660.727610] xhci_hcd 0000:31:00.2: new USB bus registered, assigned bus number 4
Aug  7 23:49:08 q1 kernel: [ 2660.727613] xhci_hcd 0000:31:00.2: Host supports USB 3.1 Enhanced SuperSpeed
Aug  7 23:49:08 q1 kernel: [ 2660.727661] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 6.02
Aug  7 23:49:08 q1 kernel: [ 2660.727664] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Aug  7 23:49:08 q1 kernel: [ 2660.727666] usb usb3: Product: xHCI Host Controller
Aug  7 23:49:08 q1 kernel: [ 2660.727668] usb usb3: Manufacturer: Linux 6.2.0-26-generic xhci-hcd
Aug  7 23:49:08 q1 kernel: [ 2660.727669] usb usb3: SerialNumber: 0000:31:00.2
Aug  7 23:49:08 q1 kernel: [ 2660.727830] hub 3-0:1.0: USB hub found
Aug  7 23:49:08 q1 kernel: [ 2660.727837] hub 3-0:1.0: 2 ports detected
Aug  7 23:49:08 q1 kernel: [ 2660.727975] usb usb4: We don't know the algorithms for LPM for this host, disabling LPM.
Aug  7 23:49:08 q1 kernel: [ 2660.727993] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 6.02
Aug  7 23:49:08 q1 kernel: [ 2660.727995] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Aug  7 23:49:08 q1 kernel: [ 2660.727997] usb usb4: Product: xHCI Host Controller
Aug  7 23:49:08 q1 kernel: [ 2660.727999] usb usb4: Manufacturer: Linux 6.2.0-26-generic xhci-hcd
Aug  7 23:49:08 q1 kernel: [ 2660.728000] usb usb4: SerialNumber: 0000:31:00.2
Aug  7 23:49:08 q1 kernel: [ 2660.728175] hub 4-0:1.0: USB hub found
Aug  7 23:49:08 q1 kernel: [ 2660.728184] hub 4-0:1.0: 4 ports detected
Aug  7 23:49:08 q1 snapd[2334]: message repeated 3 times: [ udevmon.go:149: udev event error: Unable to parse uevent, err: cannot parse libudev event: invalid env data]
Aug  7 23:49:08 q1 systemd[3823]: Reached target Sound Card.
Aug  7 23:49:08 q1 kernel: [ 2660.807453] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Aug  7 23:49:08 q1 kernel: [ 2660.807674] vfio-pci 0000:17:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Aug  7 23:49:08 q1 lxd.daemon[3076]: time="2023-08-07T23:49:08Z" level=error msg="Failed to stop device" device=gpu0 err="Failed probing device \"0000:17:00.0\" via \"/sys/bus/pci/drivers_probe\": write /sys/bus/pci/drivers_probe: invalid argument" instance=vm2 instanceType=virtual-machine project=default
Aug  7 23:49:08 q1 systemd-networkd[2222]: mac43379c64: Link DOWN
Aug  7 23:49:08 q1 systemd-networkd[2222]: mac43379c64: Lost carrier
Aug  7 23:49:08 q1 kernel: [ 2661.011901] audit: type=1400 audit(1691452148.495:57): apparmor="STATUS" operation="profile_remove" profile="unconfined" name="lxd-vm2_</var/snap/lxd/common/lxd>" pid=13584 comm="apparmor_parser"
  • the VM configuration file
architecture: x86_64
config:
  agent.nic_config: "true"
  cloud-init.network-config: |
    version: 1
    config:
      - type: physical
        name: eth0
        subnets:
          - type: static
            ipv4: true
            address: 10.10.10.10/25
            gateway: 10.10.10.1
            control: auto
      - type: nameserver
        address:
          - 1.1.1.1
          - 1.0.0.1
  cloud-init.user-data: |
    #cloud-config
    ssh_import_id: [gh:itzsimpl]
  image.architecture: amd64
  image.description: ubuntu 22.04 LTS amd64 (release) (20230729)
  image.label: release
  image.os: ubuntu
  image.release: jammy
  image.serial: "20230729"
  image.type: disk-kvm.img
  image.version: "22.04"
  limits.cpu: "20"
  limits.memory: 64GiB
  security.secureboot: "false"
  volatile.base_image: c3a32ce371819c4fb845867e8e602ad6a636e211cfaeca448e767de4b415c038
  volatile.cloud-init.instance-id: f6fa9720-3024-4574-bbd7-e29a10e14ca0
  volatile.eth0.hwaddr: 00:16:3e:73:46:f3
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: 114bc8ad-0afb-4732-9911-f2583a3330c4
  volatile.uuid.generation: 114bc8ad-0afb-4732-9911-f2583a3330c4
  volatile.vsock_id: "1262936222"
devices:
  eth0:
    name: eth0
    nictype: macvlan
    parent: ens97f0np0
    type: nic
  gpu0:
    gputype: physical
    pci: "0000:17:00.0"
    type: gpu
  gpu1:
    gputype: physical
    pci: "0000:31:00.0"
    type: gpu
  root:
    path: /
    pool: default
    size: 128GB
    type: disk
ephemeral: false
profiles:
- default
- pub-macvlan
- gpu0
- gpu1
stateful: false
description: vm2
@itzsimpl itzsimpl changed the title LXD fails to release GPU on VM stop LXD/LXC fails to release GPU on VM stop Aug 8, 2023
@adamcstephens
Copy link
Contributor

I think you want to post this at https://github.com/canonical/lxd instead.

@stgraber
Copy link
Member

I'm happy to still keep this one open as Incus is very likely to have this exact same issue given where we're at with the fork.
But indeed if you're looking for reasonably quick resolution and for that fix to be available in LXD, you're better off reporting the issue against LXD.

@itzsimpl
Copy link
Author

Just to let you know, I've opened the issue also on on Canonical/lxd, and there is a little bit more info (additional tests that I made on different GPUs and with vGPU drivers), see canonical/lxd#12128.

@stgraber stgraber changed the title LXD/LXC fails to release GPU on VM stop GPU not released on VM stop Aug 30, 2023
@stgraber stgraber added the Bug Confirmed to be a bug label Nov 29, 2023
@stgraber stgraber added this to the incus-0.6 milestone Feb 11, 2024
@stgraber
Copy link
Member

Going to poke at that one tomorrow. Sadly the only system I have with multiple NVIDIA GPUs is a box where I have no intention to ever install the binary NVIDIA driver :)

But I do have our other test system which has a single NVIDIA GPU and where I don't mind installing the NVIDIA drivers on the host, so I'm hoping I can reproduce what you're seeing on that one.

@stgraber stgraber removed this from the incus-0.6 milestone Feb 22, 2024
@stgraber stgraber added the Incomplete Waiting on more information from reporter label Feb 22, 2024
@stgraber
Copy link
Member

I'm unable to reproduce the described issue with current Incus:

root@argos:~# nvidia-smi
Thu Feb 22 15:41:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:07:00.0 Off |                  Off |
| N/A   93C    P0              67W / 250W |      0MiB / 40960MiB |     41%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
iroot@argos:~# incus config show v1
architecture: x86_64
config:
  image.architecture: amd64
  image.description: Ubuntu jammy amd64 (20240221_07:42)
  image.os: Ubuntu
  image.release: jammy
  image.serial: "20240221_07:42"
  image.type: disk-kvm.img
  image.variant: default
  limits.cpu: "8"
  limits.memory: 8GiB
  volatile.base_image: 22ab00c001e2a464dabf7c813bb448797900ca922bd96a8104a8089584c07e95
  volatile.cloud-init.instance-id: 77946807-0039-4423-b30f-2cba99b265a9
  volatile.eth0.hwaddr: 00:16:3e:30:f2:10
  volatile.last_state.power: STOPPED
  volatile.last_state.ready: "false"
  volatile.uuid: f27532d7-eadb-4487-9d07-15dcd1dde1ce
  volatile.uuid.generation: f27532d7-eadb-4487-9d07-15dcd1dde1ce
  volatile.vsock_id: "1338125073"
devices:
  gpu:
    gputype: physical
    pci: "07:00.0"
    type: gpu
ephemeral: false
profiles:
- default
stateful: false
description: ""
root@argos:~# incus start v1
root@argos:~# readlink -f /sys/bus/pci/devices/0000\:07\:00.0/driver
/sys/bus/pci/drivers/vfio-pci
root@argos:~# incus exec v1 bash
Error: VM agent isn't currently running
root@argos:~# incus exec v1 bash
root@v1:~# apt install pciutils
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libpci3 pci.ids
Suggested packages:
  bzip2 wget | curl | lynx-cur
The following NEW packages will be installed:
  libpci3 pci.ids pciutils
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 343 kB of archives.
After this operation, 1581 kB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 http:https://archive.ubuntu.com/ubuntu jammy/main amd64 pci.ids all 0.0~2022.01.22-1 [251 kB]
Get:2 http:https://archive.ubuntu.com/ubuntu jammy/main amd64 libpci3 amd64 1:3.7.0-6 [28.9 kB]
Get:3 http:https://archive.ubuntu.com/ubuntu jammy/main amd64 pciutils amd64 1:3.7.0-6 [63.6 kB]
Fetched 343 kB in 0s (841 kB/s)    
Selecting previously unselected package pci.ids.
(Reading database ... 18356 files and directories currently installed.)
Preparing to unpack .../pci.ids_0.0~2022.01.22-1_all.deb ...
Unpacking pci.ids (0.0~2022.01.22-1) ...
Selecting previously unselected package libpci3:amd64.
Preparing to unpack .../libpci3_1%3a3.7.0-6_amd64.deb ...
Unpacking libpci3:amd64 (1:3.7.0-6) ...
Selecting previously unselected package pciutils.
Preparing to unpack .../pciutils_1%3a3.7.0-6_amd64.deb ...
Unpacking pciutils (1:3.7.0-6) ...
Setting up pci.ids (0.0~2022.01.22-1) ...
Setting up libpci3:amd64 (1:3.7.0-6) ...
Setting up pciutils (1:3.7.0-6) ...
Processing triggers for libc-bin (2.35-0ubuntu3.6) ...
root@v1:~# lspci -nnn | grep -i nvidia
06:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 40GB] [10de:20f1] (rev a1)
root@v1:~# 
exit
root@argos:~# incus stop v1
root@argos:~# readlink -f /sys/bus/pci/devices/0000\:07\:00.0/driver
/sys/bus/pci/drivers/nvidia
root@argos:~# nvidia-smi
Thu Feb 22 15:42:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:07:00.0 Off |                  Off |
| N/A   94C    P0              68W / 250W |      0MiB / 40960MiB |     48%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@argos:~# 

@itzsimpl can you confirm that this is still an issue on current Incus?
If so, I may need to find a system with a similar set of GPUs and drivers as you're running since our test system has no such problems.

@itzsimpl
Copy link
Author

@stgraber thank you for starting to look into this. Unfortunately, I do not have a system with the same setup available at the moment. Based on the experiments when we first saw the issue it may be limited to "older" and "non-datacenter" GPUs, as these load/unload more devices (eg. Quadro RTX 6000 in our case, see canonical/lxd#12128 (comment)).

FWW. We also noticed issues with unloading of vGPU drivers. The only workaround that we managed to setup was to remove the devices and rescan the PCI once the VM shuts down, but that does not work with vGPU drivers (see canonical/lxd#12128 (comment)).

@stgraber
Copy link
Member

Okay, so we're going to need to get access to a system with such a GPU to be able to reproduce the issue and look for a fix.

Having multiple devices in the group definitely sounds like it may be the problem but we don't have anything in our lab that behaves that way.

Similarly for vGPU, we only have the A100 for that and it uses mdev which doesn't have any such issues.

@itzsimpl
Copy link
Author

FWW. vis vGPU, we had mdev as well, the Quadro RTX 6000 is on the list of supported GPUs (https://docs.nvidia.com/grid/gpus-supported-by-vgpu.html), but the result was that on VM shutdown some vGPUs did not get released properly, so VM shutdown and startup eventually drained the GPU memory, only a reboot helped (canonical/lxd#12128 (comment)). The drivers were 535.54.03, this is all I can remember or have on file from then, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug Incomplete Waiting on more information from reporter
Development

No branches or pull requests

3 participants