(3.0.0-3.7.2) Nvidia Driver and EFA incompatibility issue when using P4 and P5 instances with updated Kernel drivers #6015

enrico-usai · 2024-01-16T15:45:02Z

Bug description

The Linux Kernel community introduced a change that is incompatible with EFA and Nvidia drivers. This change has propagated to recent releases of Linux distributions including Amazon Linux. When using instance types with GPUDirect RDMA (the option to write/read directly from the EFA device to the GPU memory), EFA kernel module is unable to retrieve GPU memory information.

Nvidia introduced a open-source (OSS) version of their drivers, known as OpenRM, that is compatible with this kernel change. EFA released a new version, 1.29.0, that is compatible with recent kernels and with OSS Nvidia driver.

The use of P4 or P5 instance types, with a recently released Linux Kernel RPM, in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster < 3.8.0) will cause the communication between your workload nodes (via EFA) to stop working.

If you’re using ParallelCluster 3.7.2 or an earlier version, and you’re using official ParallelCluster AMIs, you will be affected by the issue only if you update the kernel to a newer version, and if you use an instance type with GPUDirect RDMA and EFA (like P4 and P5).

In the logs you can find an error like the following:

`kernel: failing symbol_get of non-GPLONLY symbol nvidia_p2p_get_pages.`

Affected versions

P4 or P5 instance types in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster <= 3.7.2) won’t work after updating Linux kernel starting with the following version numbers: 4.14.326, 5.4.257, 5.10.195, 5.15.131, 6.1.52.

How to check affected components

kernel version: uname -r
Installed Nvidia driver version: nvidia-smi
Nvidia license of the kernel: modinfo -F license nvidia, it will return Dual MIT/GPL or NVIDIA for the open source or closed driver, respectively
EFA installed version: cat /opt/amazon/efa_installed_packages | grep -E -o "EFA installer version: [0-9.]+"

Mitigation

Starting from ParallelCluster 3.8.0 we installed the OSS Nvidia drivers and EFA 1.29.0, as default in all official ParallelCluster AMIs, to permit the customers to use recent kernels and safely ingest security fixes.

To build a custom AMI for ParallelCluster <=3.7.2, with an updated kernel and with OSS Nvidia drivers, please follow How to create a custom AMI with Open Source Nvidia drivers for P4 and P5.

The text was updated successfully, but these errors were encountered:

enrico-usai · 2024-01-17T08:11:02Z

This issue can be closed since we already released ParallelCluster 3.8.0 with Open Source Nvidia Drivers and EFA 1.29.0.
This has been created to track this known issue and have official instructions to cope with it.

enrico-usai added the pending release label Jan 17, 2024

enrico-usai closed this as completed Jan 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.0.0-3.7.2) Nvidia Driver and EFA incompatibility issue when using P4 and P5 instances with updated Kernel drivers #6015

(3.0.0-3.7.2) Nvidia Driver and EFA incompatibility issue when using P4 and P5 instances with updated Kernel drivers #6015

enrico-usai commented Jan 16, 2024

enrico-usai commented Jan 17, 2024

(3.0.0-3.7.2) Nvidia Driver and EFA incompatibility issue when using P4 and P5 instances with updated Kernel drivers #6015

(3.0.0-3.7.2) Nvidia Driver and EFA incompatibility issue when using P4 and P5 instances with updated Kernel drivers #6015

Comments

enrico-usai commented Jan 16, 2024

Bug description

Affected versions

Mitigation

enrico-usai commented Jan 17, 2024