Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(3.0.0-3.7.2) Nvidia Driver and EFA incompatibility issue when using P4 and P5 instances with updated Kernel drivers #6015

Closed
enrico-usai opened this issue Jan 16, 2024 · 1 comment

Comments

@enrico-usai
Copy link
Contributor

Bug description

The Linux Kernel community introduced a change that is incompatible with EFA and Nvidia drivers. This change has propagated to recent releases of Linux distributions including Amazon Linux. When using instance types with GPUDirect RDMA (the option to write/read directly from the EFA device to the GPU memory), EFA kernel module is unable to retrieve GPU memory information.

Nvidia introduced a open-source (OSS) version of their drivers, known as OpenRM, that is compatible with this kernel change. EFA released a new version, 1.29.0, that is compatible with recent kernels and with OSS Nvidia driver.

The use of P4 or P5 instance types, with a recently released Linux Kernel RPM, in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster < 3.8.0) will cause the communication between your workload nodes (via EFA) to stop working.

If you’re using ParallelCluster 3.7.2 or an earlier version, and you’re using official ParallelCluster AMIs, you will be affected by the issue only if you update the kernel to a newer version, and if you use an instance type with GPUDirect RDMA and EFA (like P4 and P5).

In the logs you can find an error like the following:

`kernel: failing symbol_get of non-GPLONLY symbol nvidia_p2p_get_pages.`

Affected versions

P4 or P5 instance types in combination with EFA and the non-OSS Nvidia drivers (ParallelCluster <= 3.7.2) won’t work after updating Linux kernel starting with the following version numbers: 4.14.326, 5.4.257, 5.10.195, 5.15.131, 6.1.52.

How to check affected components

  • kernel version: uname -r
  • Installed Nvidia driver version: nvidia-smi
  • Nvidia license of the kernel: modinfo -F license nvidia, it will return Dual MIT/GPL or NVIDIA for the open source or closed driver, respectively
  • EFA installed version: cat /opt/amazon/efa_installed_packages | grep -E -o "EFA installer version: [0-9.]+"

Mitigation

Starting from ParallelCluster 3.8.0 we installed the OSS Nvidia drivers and EFA 1.29.0, as default in all official ParallelCluster AMIs, to permit the customers to use recent kernels and safely ingest security fixes.

To build a custom AMI for ParallelCluster <=3.7.2, with an updated kernel and with OSS Nvidia drivers, please follow How to create a custom AMI with Open Source Nvidia drivers for P4 and P5.

@enrico-usai
Copy link
Contributor Author

This issue can be closed since we already released ParallelCluster 3.8.0 with Open Source Nvidia Drivers and EFA 1.29.0.
This has been created to track this known issue and have official instructions to cope with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant