-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542
Comments
cc @wuisawesome |
@ANarayan could you also post the output of |
Yup, here it is: |
Here's a minimal example: import ray
import ray.cloudpickle as cp
import numpy as np
from ray.util.placement_group import placement_group
ray.init(num_gpus=2)
X = ray.put(np.random.rand(300, 300, 10))
@ray.remote(num_gpus=1)
class Test:
def test(self, config):
import os
print(ray.worker.global_worker.core_worker.resource_ids())
print(os.environ.get("CUDA_VISIBLE_DEVICES"))
pg = placement_group([{"CPU": 1, "GPU": 1}], strategy="PACK")
ray.wait([pg.ready()])
t = Test.options(placement_group=pg, placement_group_bundle_index=0).remote()
ray.get(t.test.remote(0)) |
@richardliaw can you find an assignee for this? |
System Information:
CUDA Version: 10.1
Tensorflow Version: 2.3.1
Ludwig Version: 0.4-dev0 (most recent commit on master)
Ray Version: 2.0.0.dev0
Python Version: 3.7.7
Error:
I am experiencing the following error when running Ray Tune:
2021-03-08 09:12:39.656449: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
It appears as though ray is not properly setting the
CUDA_VISIBLE_DEVICES
environment variable. Callingos.environ['CUDA_VISIBLE_DEVICES']
returns the string "0,0". Moreover, callingray.get_gpu_ids()
on the worker returned the list ['0','0'].I was able to fix the issue by explicitly setting CUDA_VISIBLE_DEVICES as follows:
cc: @richardliaw
The text was updated successfully, but these errors were encountered: