Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running CUDA samples with multiple GPUs is failed #7852

Open
1 of 2 tasks
yes89929 opened this issue Dec 21, 2021 · 18 comments
Open
1 of 2 tasks

Running CUDA samples with multiple GPUs is failed #7852

yes89929 opened this issue Dec 21, 2021 · 18 comments

Comments

@yes89929
Copy link

yes89929 commented Dec 21, 2021

Version

Microsoft Windows [Version 10.0.22000.376]

WSL Version

  • WSL 2
  • WSL 1

Kernel Version

5.10.60.1

Distro Version

Ubuntu 20.04 and Ubuntu 18.04

Other Software

CPU: Intel(R) Core(TM) i9-9900X
GPU: Nvidia Titan RTX * 4 (driver 510.06)
RAM: 128GB

Repro Steps

Install CUDA on WSL

sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub

sudo sh -c 'echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 /" > /etc/apt/sources.list.d/cuda.list'

sudo apt-get update

sudo apt-get install -y cuda-toolkit-11-0

Run samples

cd /usr/local/cuda-11.0/samples/4_Finance/BlackScholes

sudo make

./BlackScholes
cd /usr/local/cuda-11.0/samples/1_Utilities/deviceQuery

sudo make

./deviceQuery

Expected Behavior

Return success

Actual Behavior

[./BlackScholes] - Starting...
CUDA error at ../../common/inc/helper_cuda.h:777 code=2(cudaErrorMemoryAllocation) "cudaGetDeviceCount(&device_count)"
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 2
-> out of memory
Result = FAIL

Diagnostic Logs

No response

@elsaco
Copy link

elsaco commented Dec 25, 2021

@yes89929 your CUDA apps are not seeing the GPU. What is the output of nvidia-smi -q?

This is sample deviceQuery output on WSL Ubuntu-20.04 test run:

tux@ubuntu:/usr/local/cuda/samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 3060 Ti"
  CUDA Driver Version / Runtime Version          11.4 / 11.0
  CUDA Capability Major/Minor version number:    8.6
---cut---

@yes89929
Copy link
Author

@yes89929 your CUDA apps are not seeing the GPU. What is the output of nvidia-smi -q?

This is sample deviceQuery output on WSL Ubuntu-20.04 test run:

tux@ubuntu:/usr/local/cuda/samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 3060 Ti"
  CUDA Driver Version / Runtime Version          11.4 / 11.0
  CUDA Capability Major/Minor version number:    8.6
---cut---

Thanks for replay.

nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                 : Wed Dec 29 01:16:10 2021
Driver Version                            : 510.06
CUDA Version                              : 11.6

Attached GPUs                             : 4
GPU 00000000:19:00.0
    Product Name                          : NVIDIA TITAN RTX
    Product Brand                         : Titan
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : WDDM
        Pending                           : WDDM
    Serial Number                         : 0320419026962
    GPU UUID                              : GPU-f9825843-2134-3ac5-55be-460ae94d1cb5
    Minor Number                          : N/A
    VBIOS Version                         : 90.02.23.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x1900
    GPU Part Number                       : 900-1G150-2500-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x19
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0210DE
        Bus Id                            : 00000000:19:00.0
        Sub System Id                     : 0x12A310DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24576 MiB
        Used                              : 300 MiB
        Free                              : 24276 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 28 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 15.26 W
        Power Limit                       : 280.00 W
        Default Power Limit               : 280.00 W
        Enforced Power Limit              : 280.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 320.00 W
    Clocks
        Graphics                          : 8 MHz
        SM                                : 8 MHz
        Memory                            : 20 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

GPU 00000000:1A:00.0
    Product Name                          : NVIDIA TITAN RTX
    Product Brand                         : Titan
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : WDDM
        Pending                           : WDDM
    Serial Number                         : 0320419025605
    GPU UUID                              : GPU-cfa65c4f-346e-8905-3988-4285c8d6274e
    Minor Number                          : N/A
    VBIOS Version                         : 90.02.23.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x1a00
    GPU Part Number                       : 900-1G150-2500-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x1A
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0210DE
        Bus Id                            : 00000000:1A:00.0
        Sub System Id                     : 0x12A310DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24576 MiB
        Used                              : 300 MiB
        Free                              : 24276 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 29 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 1.00 W
        Power Limit                       : 280.00 W
        Default Power Limit               : 280.00 W
        Enforced Power Limit              : 280.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 320.00 W
    Clocks
        Graphics                          : 8 MHz
        SM                                : 8 MHz
        Memory                            : 19 MHz
        Video                             : 539 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

GPU 00000000:67:00.0
    Product Name                          : NVIDIA TITAN RTX
    Product Brand                         : Titan
    Product Architecture                  : Turing
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : WDDM
        Pending                           : WDDM
    Serial Number                         : 0325218082988
    GPU UUID                              : GPU-e8f40e7a-8ad8-9007-f573-f0a0ae2eab33
    Minor Number                          : N/A
    VBIOS Version                         : 90.02.23.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x6700
    GPU Part Number                       : 900-1G150-2500-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x67
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0210DE
        Bus Id                            : 00000000:67:00.0
        Sub System Id                     : 0x12A310DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24576 MiB
        Used                              : 300 MiB
        Free                              : 24276 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 29 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 11.14 W
        Power Limit                       : 280.00 W
        Default Power Limit               : 280.00 W
        Enforced Power Limit              : 280.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 320.00 W
    Clocks
        Graphics                          : 8 MHz
        SM                                : 8 MHz
        Memory                            : 19 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

GPU 00000000:68:00.0
    Product Name                          : NVIDIA TITAN RTX
    Product Brand                         : Titan
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : WDDM
        Pending                           : WDDM
    Serial Number                         : 0320419044339
    GPU UUID                              : GPU-1a386410-a438-6060-79a3-81cfe5537096
    Minor Number                          : N/A
    VBIOS Version                         : 90.02.23.00.01
    MultiGPU Board                        : No
    Board ID                              : 0x6800
    GPU Part Number                       : 900-1G150-2500-000
    Module ID                             : 0
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x68
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1E0210DE
        Bus Id                            : 00000000:68:00.0
        Sub System Id                     : 0x12A310DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 769000 KB/s
        Rx Throughput                     : 2000 KB/s
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 24576 MiB
        Used                              : 825 MiB
        Free                              : 23751 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 2 MiB
        Free                              : 254 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 28 C
        GPU Shutdown Temp                 : 94 C
        GPU Slowdown Temp                 : 91 C
        GPU Max Operating Temp            : 89 C
        GPU Target Temperature            : 84 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 10.53 W
        Power Limit                       : 280.00 W
        Default Power Limit               : 280.00 W
        Enforced Power Limit              : 280.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 320.00 W
    Clocks
        Graphics                          : 138 MHz
        SM                                : 138 MHz
        Memory                            : 183 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Processes                             : None

@elsaco
Copy link

elsaco commented Dec 29, 2021

@yes89929 you're using multiple GPUs, therefore it might help troubleshooting by isolating a specific GPU.

Please try CUDA_VISIBLE_DEVICES=0 ./deviceQuery to run the test on the first available GPU only.

Also CUDA_VISIBLE_DEVICES="0,1,2,3" to use all GPUs.

@yes89929
Copy link
Author

yes89929 commented Dec 29, 2021

@elsaco
thanks for help. it works.
However, it still fails when using GPUs 2 and 3 together.
Is it normal?

CUDA_VISIBLE_DEVICES="{GPU}" ./deviceQuery

GPU Result
0 PASS
1 PASS
2 PASS
3 PASS
0,1 PASS
0,1,2 PASS
0,1,2,3 FAIL
0,3 PASS
1,3 PASS
2,3 FAIL
0,1,3 PASS
1,2,3 FAIL

CUDA_VISIBLE_DEVICES=0 ./deviceQuery

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA TITAN RTX"
  CUDA Driver Version / Runtime Version          11.6 / 11.0
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 24576 MBytes (25769476096 bytes)

CUDA_VISIBLE_DEVICES="0,1,2,3" ./deviceQuery

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 2
-> out of memory
Result = FAIL

@yes89929 yes89929 changed the title Running CUDA samples is failed Running CUDA samples with multiple GPUs is failed Jan 6, 2022
@travisjayday
Copy link

Same issue here

@JamesPerlman
Copy link

I am also having this issue. I'm glad I found this thread, I thought I was going crazy. I'm using 4x A6000, WSL Ubuntu 20.04, CUDA 11.7.1. Any combination of CUDA_VISIBLE_DEVICES works except those that contain both device 1 and 2.

@XuPlusC
Copy link

XuPlusC commented Dec 7, 2022

Same issue here, I'm using 4 2080TIs, WSL Ubuntu 20.04, kernel version 5.10.102.1-microsoft-standard-WSL2, CUDA 11.3. Combination of device 0 and 1 will result in "cudaGetDeviceCount returned 2 -> out of memory" failure.
This problem has been discovered for nearly a year and is still not solved😥

@FueiH
Copy link

FueiH commented Feb 7, 2023

Same issue here, A10 * 4, WSL Ubuntu 20.04, Linux version 5.10.16.3-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, CUDA 12.0. "cudaGetDeviceCount returned 2 -> out of memory" failure occurs when i set CUDA_VISIBLE_DEVICES="0,3".

@hicotton02
Copy link

hicotton02 commented Aug 25, 2023

I have this same issue with 4x 3090s on Windows 11 and Ubuntu 22.04.2 Cuda 12.2 I can run CUDA_VISIBLE_DEVICES=0,2,3, but if I put 1 in there, I get out of memory exception

Linux Version: 5.15.90.1-microsoft-standard-WSL2

@WarrenSchultz
Copy link

Running latest on WSL2, Ubuntu 22.04, CUDA 12.2. 4x RTX 6000 Ada. Similar to above, any combination of 1 and 3 fail.
0,1,2,3 fail
0,1,2 pass
0,1,3 fail
0,2,3 pass
1,2,3 fail
0,1 pass
0,2 pass
0,3 pass
1,2 pass
1,3 fail
2,3 pass

@YoungjaeDev
Copy link

my linux

Linux AICADS 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

0,1,2 success but 0,1,2,3 fail

@MilesQLi
Copy link

@yes89929 you're using multiple GPUs, therefore it might help troubleshooting by isolating a specific GPU.

Please try CUDA_VISIBLE_DEVICES=0 ./deviceQuery to run the test on the first available GPU only.

Also CUDA_VISIBLE_DEVICES="0,1,2,3" to use all GPUs.

Oh my god! Dude, you solved a long problem with not being able to use GPUs in docker! Thank you so much! After I isolate a GPU, I can use GPU in docker. Why is that??? What is the reason?

@andrewaf1
Copy link

I am having this issue now, almost exactly as described. Any combination of devices that don't include both 1 and 2 work.

@velicm
Copy link

velicm commented Feb 7, 2024

Had the same issue with multi gpu setup in WSL. I guess the issue was NVLINK. I set the SLI configuration in NVIDIA Control Panel (in Windows) to 'Maximize 3D performance'. Now it finally works! Hope this helps someone - this thread helped me figuring it out...

@BUJIDAOVS
Copy link

same problem too, 4*gpu on wsl2 cannot work together.

@HELLONVIDIA
Copy link

同样的问题,2080TI X 4,最新的Windows11系统、最新的WSL2版本、最新的Windows端Nvidia驱动、正确安装的cudatoolkit以及cudnn;仍然出现类似问题,在Windows中可以通过示例测试,WSL2中以及docker中均不可以

@Zephyr69
Copy link

Same issue, 3x 3090 cannot work together perfectly with this same error. "0,2" and "0,1,2" are fine. "0,1" is not fine. How does this make sense?
Is there any working solutions to this? I have attempted those above to no avail.

@insujeon
Copy link

insujeon commented Aug 25, 2024

Setting the SLI configuration in the NVIDIA Control Panel (in Windows) to 'Maximize 3D performance' worked for me!!!

But, it solve the problem partially because the problem of monitoring the memory usage in nvidia-smi still remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests