[Release Test] Update cuda version in gpu docker cluster launcher ima…

…ge to 12.1 (#42246) After the Ray 2.9 release, the release test for the GPU Docker example cluster YAML file started failing with 2023-12-23 03:00:43,078 VINFO command_runner.py:371 -- Running `docker run --rm --name ray_nvidia_docker -d -it -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='2301055426.56b' --runtime=nvidia --net=host rayproject/ray:latest-gpu bash` 24897079968c098daccf1ed65a0bea5d3d9e3df84de201ea20f1a34b0363975c docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown. The likely cause is Ray 2.9 increased the required CUDA version to 11.8. This PR updates the CUDA version used in the GCP VM image in the example cluster YAML file from 11.3 to 12.1. The test passes after this change. Related issue number Closes #42134 --------- Signed-off-by: Archit Kulkarni <[email protected]>
ray-project · Jan 10, 2024 · 4d0f6dd · 4d0f6dd
1 parent f3efbe2
commit 4d0f6dd
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/python/ray/autoscaler/gcp/example-gpu-docker.yaml b/python/ray/autoscaler/gcp/example-gpu-docker.yaml
@@ -64,7 +64,7 @@ available_node_types:
  initializeParams:
  diskSizeGb: 50
  # See https://cloud.google.com/compute/docs/images for more images
- sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
+ sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
  # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
  guestAccelerators:
  - acceleratorType: nvidia-tesla-t4
@@ -98,7 +98,7 @@ available_node_types:
  initializeParams:
  diskSizeGb: 50
  # See https://cloud.google.com/compute/docs/images for more images
- sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
+ sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
  # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
  guestAccelerators:
  - acceleratorType: nvidia-tesla-t4