CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542

ANarayan · 2021-03-08T21:01:55Z

System Information:
CUDA Version: 10.1
Tensorflow Version: 2.3.1
Ludwig Version: 0.4-dev0 (most recent commit on master)
Ray Version: 2.0.0.dev0
Python Version: 3.7.7

Error:
I am experiencing the following error when running Ray Tune:
2021-03-08 09:12:39.656449: E tensorflow/stream_executor/cuda/cuda_driver.cc:314] failed call to cuInit: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal

It appears as though ray is not properly setting the CUDA_VISIBLE_DEVICES environment variable. Calling os.environ['CUDA_VISIBLE_DEVICES'] returns the string "0,0". Moreover, calling ray.get_gpu_ids() on the worker returned the list ['0','0'].

I was able to fix the issue by explicitly setting CUDA_VISIBLE_DEVICES as follows:

gpu_ids = set(ray.get_gpu_ids())
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(str(gid) for gid in gpu_ids)

cc: @richardliaw

The text was updated successfully, but these errors were encountered:

rkooo567 · 2021-03-08T21:10:21Z

cc @wuisawesome

richardliaw · 2021-03-08T21:10:46Z

@ANarayan could you also post the output of ray.worker.global_worker.core_worker.resource_ids()?

ANarayan · 2021-03-08T21:14:10Z

@ANarayan could you also post the output of ray.worker.global_worker.core_worker.resource_ids()?

Yup, here it is:
{'GPU_group_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)], 'CPU_group_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)], 'CPU_group_0_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)], 'GPU_group_0_2918a11aef386cd9d4d5c8fcbba0c409': [(0, 1.0)]}

richardliaw · 2021-03-09T08:43:54Z

Here's a minimal example:

import ray

import ray.cloudpickle as cp
import numpy as np
from ray.util.placement_group import placement_group

ray.init(num_gpus=2)
X = ray.put(np.random.rand(300, 300, 10))

@ray.remote(num_gpus=1)
class Test:
    def test(self, config):
        import os
        print(ray.worker.global_worker.core_worker.resource_ids())
        print(os.environ.get("CUDA_VISIBLE_DEVICES"))


pg = placement_group([{"CPU": 1, "GPU": 1}], strategy="PACK")
ray.wait([pg.ready()])
t = Test.options(placement_group=pg, placement_group_bundle_index=0).remote()

ray.get(t.test.remote(0))

ericl · 2021-03-09T18:25:34Z

@richardliaw can you find an assignee for this?

ANarayan added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 8, 2021

richardliaw changed the title ~~[TUNE] Getting CUDA_ERROR_INVALID_DEVICE: invalid device ordinal w/Tune~~ [tune] Getting CUDA_ERROR_INVALID_DEVICE: invalid device ordinal w/Tune Mar 8, 2021

richardliaw changed the title ~~[tune] Getting CUDA_ERROR_INVALID_DEVICE: invalid device ordinal w/Tune~~ CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs Mar 8, 2021

wuisawesome added this to the Core Bugs milestone Mar 8, 2021

richardliaw added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 9, 2021

ericl assigned richardliaw Mar 9, 2021

richardliaw mentioned this issue Mar 9, 2021

[placement groups] fix gpu ids for bundles #14574

Merged

6 tasks

richardliaw closed this as completed in #14574 Mar 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542

CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542

ANarayan commented Mar 8, 2021 •

edited

Loading

rkooo567 commented Mar 8, 2021

richardliaw commented Mar 8, 2021

ANarayan commented Mar 8, 2021 •

edited

Loading

richardliaw commented Mar 9, 2021

ericl commented Mar 9, 2021

CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542

CUDA_VISIBLE_DEVICES is not set properly when using placement groups with GPUs #14542

Comments

ANarayan commented Mar 8, 2021 • edited Loading

rkooo567 commented Mar 8, 2021

richardliaw commented Mar 8, 2021

ANarayan commented Mar 8, 2021 • edited Loading

richardliaw commented Mar 9, 2021

ericl commented Mar 9, 2021

ANarayan commented Mar 8, 2021 •

edited

Loading

ANarayan commented Mar 8, 2021 •

edited

Loading