[Release Test] Update cuda version in gpu docker cluster launcher image to 12.1 #42246

architkulkarni · 2024-01-08T23:17:45Z

Why are these changes needed?

After the Ray 2.9 release, the release test for the GPU Docker example cluster YAML file started failing with

2023-12-23 03:00:43,078 VINFO command_runner.py:371 -- Running `docker run --rm --name ray_nvidia_docker -d -it  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='2301055426.56b' --runtime=nvidia --net=host rayproject/ray:latest-gpu bash`
24897079968c098daccf1ed65a0bea5d3d9e3df84de201ea20f1a34b0363975c
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown.

The likely cause is Ray 2.9 increased the required CUDA version to 11.8. This PR updates the CUDA version used in the GCP VM image in the example cluster YAML file from 11.3 to 12.1. The test passes after this change.

Related issue number

Closes #42134

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni · 2024-01-09T00:27:41Z

Release test running here: https://buildkite.com/ray-project/release/builds/5418

Assigning @stephanie-wang as core-oncall (codeowner) as Hongchao is out.

architkulkarni · 2024-01-10T21:52:43Z

The test passed. https://buildkite.com/ray-project/release/builds/5418#018cebb4-548a-4f3f-9401-df263c3cf5be

…ge to 12.1 (ray-project#42246) After the Ray 2.9 release, the release test for the GPU Docker example cluster YAML file started failing with 2023-12-23 03:00:43,078 VINFO command_runner.py:371 -- Running `docker run --rm --name ray_nvidia_docker -d -it -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='2301055426.56b' --runtime=nvidia --net=host rayproject/ray:latest-gpu bash` 24897079968c098daccf1ed65a0bea5d3d9e3df84de201ea20f1a34b0363975c docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.8, please update your driver to a newer version, or use an earlier cuda container: unknown. The likely cause is Ray 2.9 increased the required CUDA version to 11.8. This PR updates the CUDA version used in the GCP VM image in the example cluster YAML file from 11.3 to 12.1. The test passes after this change. Related issue number Closes ray-project#42134 --------- Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni added 2 commits January 8, 2024 15:15

Update cuda version in gpu docker cluster launcher image to 12.1

03972ac

Signed-off-by: Archit Kulkarni <[email protected]>

Fix image name

b48e78e

Signed-off-by: Archit Kulkarni <[email protected]>

architkulkarni marked this pull request as ready for review January 9, 2024 00:27

architkulkarni requested review from ericl, hongchaodeng and a team as code owners January 9, 2024 00:27

architkulkarni assigned stephanie-wang Jan 9, 2024

architkulkarni added v2.9.1-pick and removed v2.9.1-pick labels Jan 9, 2024

architkulkarni assigned rickyyx Jan 10, 2024

rickyyx approved these changes Jan 10, 2024

View reviewed changes

architkulkarni merged commit 4d0f6dd into ray-project:master Jan 10, 2024
9 checks passed

architkulkarni mentioned this pull request Jan 10, 2024

[Release Test] Update cuda version in gpu docker cluster launcher imageto 12.1 (#42246) #42309

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release Test] Update cuda version in gpu docker cluster launcher image to 12.1 #42246

[Release Test] Update cuda version in gpu docker cluster launcher image to 12.1 #42246

architkulkarni commented Jan 8, 2024 •

edited

Loading

architkulkarni commented Jan 9, 2024

architkulkarni commented Jan 10, 2024

[Release Test] Update cuda version in gpu docker cluster launcher image to 12.1 #42246

[Release Test] Update cuda version in gpu docker cluster launcher image to 12.1 #42246

Conversation

architkulkarni commented Jan 8, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

architkulkarni commented Jan 9, 2024

architkulkarni commented Jan 10, 2024

architkulkarni commented Jan 8, 2024 •

edited

Loading