Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542

Closed
ANarayan opened this issue Mar 8, 2021 · 5 comments · Fixed by #14574
Closed

CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542

ANarayan opened this issue Mar 8, 2021 · 5 comments · Fixed by #14574
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Milestone

Comments

@ANarayan
Copy link

ANarayan commented Mar 8, 2021

System Information:
CUDA Version: 10.1
Tensorflow Version: 2.3.1
Ludwig Version: 0.4-dev0 (most recent commit on master)
Ray Version: 2.0.0.dev0
Python Version: 3.7.7

Error:
I am experiencing the following error when running Ray Tune:
2021-03-08 09:12:39.656449: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

It appears as though ray is not properly setting the CUDA_VISIBLE_DEVICES environment variable. Calling os.environ['CUDA_VISIBLE_DEVICES'] returns the string "0,0". Moreover, calling ray.get_gpu_ids() on the worker returned the list ['0','0'].

I was able to fix the issue by explicitly setting CUDA_VISIBLE_DEVICES as follows:

gpu_ids = set(ray.get_gpu_ids())
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(gid) for gid in gpu_ids)

cc: @richardliaw

@ANarayan ANarayan added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 8, 2021
@richardliaw richardliaw changed the title [TUNE] Getting CUDA_ERROR_INVALID_DEVICE: invalid device ordinal w/Tune [tune] Getting CUDA_ERROR_INVALID_DEVICE: invalid device ordinal w/Tune Mar 8, 2021
@richardliaw richardliaw changed the title [tune] Getting CUDA_ERROR_INVALID_DEVICE: invalid device ordinal w/Tune CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs Mar 8, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Mar 8, 2021

cc @wuisawesome

@richardliaw
Copy link
Contributor

@ANarayan could you also post the output of ray.worker.global_worker.core_worker.resource_ids()?

@ANarayan
Copy link
Author

ANarayan commented Mar 8, 2021

@ANarayan could you also post the output of ray.worker.global_worker.core_worker.resource_ids()?

Yup, here it is:
{'GPU_group_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)], 'CPU_group_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)], 'CPU_group_0_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)], 'GPU_group_0_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)]}

@wuisawesome wuisawesome added this to the Core Bugs milestone Mar 8, 2021
@richardliaw
Copy link
Contributor

Here's a minimal example:

import ray

import ray.cloudpickle as cp
import numpy as np
from ray.util.placement_group import placement_group

ray.init(num_gpus=2)
X = ray.put(np.random.rand(300, 300, 10))

@ray.remote(num_gpus=1)
class Test:
    def test(self, config):
        import os
        print(ray.worker.global_worker.core_worker.resource_ids())
        print(os.environ.get("CUDA_VISIBLE_DEVICES"))


pg = placement_group([{"CPU": 1, "GPU": 1}], strategy="PACK")
ray.wait([pg.ready()])
t = Test.options(placement_group=pg, placement_group_bundle_index=0).remote()

ray.get(t.test.remote(0))

@richardliaw richardliaw added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 9, 2021
@ericl
Copy link
Contributor

ericl commented Mar 9, 2021

@richardliaw can you find an assignee for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants