Skip to content

Commit

Permalink
[Release Test] Update cuda version in gpu docker cluster launcher ima…
Browse files Browse the repository at this point in the history
…ge to 12.1 (#42246)

After the Ray 2.9 release, the release test for the GPU Docker example cluster YAML file started failing with

2023-12-23 03:00:43,078 VINFO command_runner.py:371 -- Running `docker run --rm --name ray_nvidia_docker -d -it  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='2301055426.56b' --runtime=nvidia --net=host rayproject/ray:latest-gpu bash`
24897079968c098daccf1ed65a0bea5d3d9e3df84de201ea20f1a34b0363975c
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown.
The likely cause is Ray 2.9 increased the required CUDA version to 11.8. This PR updates the CUDA version used in the GCP VM image in the example cluster YAML file from 11.3 to 12.1. The test passes after this change.

Related issue number
Closes #42134

---------

Signed-off-by: Archit Kulkarni <[email protected]>
  • Loading branch information
architkulkarni committed Jan 10, 2024
1 parent f3efbe2 commit 4d0f6dd
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions python/ray/autoscaler/gcp/example-gpu-docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ available_node_types:
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
# Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
guestAccelerators:
- acceleratorType: nvidia-tesla-t4
Expand Down Expand Up @@ -98,7 +98,7 @@ available_node_types:
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu113
sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
# Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
guestAccelerators:
- acceleratorType: nvidia-tesla-t4
Expand Down

0 comments on commit 4d0f6dd

Please sign in to comment.