[RLlib] Don't add a cpu to bundle for learner when using gpu #35529

avnishn · 2023-05-18T22:51:55Z

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn [email protected]

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Avnish <[email protected]>

solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>

Signed-off-by: avnishn <[email protected]>

ArturNiederfahrenhorst · 2023-05-19T22:01:22Z

@avnishn Thanks man! Can you also make sure that a pick lands?

kouroshHakha

Looks good. One nit. Can we run the relevant release test on this here? and link the buildkite?

kouroshHakha · 2023-05-19T22:03:53Z

rllib/algorithms/algorithm_config.py

+ num_cpus_per_learner_worker=(
+ self.num_cpus_per_learner_worker
+ if not self.num_gpus_per_learner_worker
+ else 0
+ ),


qq: this is not needed right? cause self.num_cpus_per_learner_worker will be zero if self.num_gpus_per_learner_worker > 0. If this is not the case, we already raise errors.

yeah I think I just added this in here before I added the error. We don't technically need this Ill remove it if I need to do anything to get the release tests to run.

avnishn · 2023-05-19T22:07:50Z

Don't merge just yet

For whatever reason, I haven't gotten the release tests to run, Like its not finding the name of the test.... I suspect I did something wrong while launching the test.

kouroshHakha · 2023-05-19T22:10:22Z

ok let me know when. 👍

rllib/algorithms/algorithm_config.py

Signed-off-by: Avnish <[email protected]>

…avoid hanging experiment Signed-off-by: Avnish <[email protected]>

Signed-off-by: Avnish <[email protected]>

avnishn · 2023-05-22T17:01:48Z

https://buildkite.com/ray-project/release-tests-pr/builds/39400#0188443c-35b2-4d89-aaa3-9b8f8b442826

release tests are passing, this is ready to merge @kouroshHakha

kouroshHakha

There are some non-catchable conflicts with this PR #35573. Please merge master and use the new Learner API. Also not sure if the changes to learners are needed in this PR.

rllib/algorithms/ppo/ppo_learner.py

…fy_resource_request_multi_gpu_learner

Signed-off-by: Avnish <[email protected]>

…efficiently and finish training faster Signed-off-by: avnishn <[email protected]>

avnishn · 2023-05-22T20:33:32Z

https://buildkite.com/ray-project/release-tests-pr/builds/39400#_

The release tests are passing, but one of the multi gpu test examples is taking a really long time to run because of bad resource utilization, so I changed its resource utilization (changed batch and minibatch sizes, and increased number of rollout workers since we have 48 on the machine)

Signed-off-by: avnishn <[email protected]>

Signed-off-by: Avnish <[email protected]>

…fy_resource_request_multi_gpu_learner

Signed-off-by: Avnish <[email protected]>

avnishn · 2023-05-23T19:31:51Z

the rllib multi-gpu test that failed is flakey. I have verified by running repro ci and running the test myself that it passes.
This PR is ready to merge

…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>

…#35676) solves #35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>

…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>

…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]> Signed-off-by: e428265 <[email protected]>

…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>

avnishn added 2 commits May 18, 2023 13:37

Initial commit

bce003e

Signed-off-by: Avnish <[email protected]>

avnishn requested review from sven1977, gjoliver, ArturNiederfahrenhorst, smorad, maxpumperla, kouroshHakha and krfricke as code owners May 18, 2023 22:51

avnishn added 2 commits May 18, 2023 16:14

Fix indentation

0b75bf6

Signed-off-by: avnishn <[email protected]>

Fix indentation

d7ab6e4

Signed-off-by: avnishn <[email protected]>

avnishn assigned gjoliver, ArturNiederfahrenhorst and kouroshHakha May 19, 2023

avnishn added release-blocker P0 Issue that blocks the release v2.5.0-pick labels May 19, 2023

kouroshHakha approved these changes May 19, 2023

View reviewed changes

kouroshHakha added the do-not-merge Do not merge this PR! label May 19, 2023

gjoliver reviewed May 19, 2023

View reviewed changes

rllib/algorithms/algorithm_config.py Show resolved Hide resolved

avnishn added 4 commits May 20, 2023 20:20

Move ppo kl infinite check to avoid tensorflow autograph errors

9ba46de

Signed-off-by: Avnish <[email protected]>

Release test cluster only has 2 gpus not 4, reduce number of gpus to …

162470a

…avoid hanging experiment Signed-off-by: Avnish <[email protected]>

Address comments

668899d

Signed-off-by: Avnish <[email protected]>

Add resource resolution back in

517f5c1

Signed-off-by: Avnish <[email protected]>

avnishn removed the do-not-merge Do not merge this PR! label May 22, 2023

kouroshHakha reviewed May 22, 2023

View reviewed changes

rllib/algorithms/ppo/ppo_learner.py Outdated Show resolved Hide resolved

Merge branch 'master' of https://github.com/ray-project/ray into modi…

8a7954d

…fy_resource_request_multi_gpu_learner

avnishn added 2 commits May 22, 2023 11:51

Make level with master, fix docstring

1a7820e

Signed-off-by: Avnish <[email protected]>

Configure the mulit agent cartpole ppo example to use resources more …

fa97a46

…efficiently and finish training faster Signed-off-by: avnishn <[email protected]>

avnishn added 6 commits May 22, 2023 13:54

Address lint

a81170f

Signed-off-by: avnishn <[email protected]>

Reduce num cpus used for ci

b2804d0

Signed-off-by: avnishn <[email protected]>

Move kl check out of traced update, torch, tf

c496c34

Signed-off-by: Avnish <[email protected]>

Reduce num rollout workers

b3b7c8f

Signed-off-by: Avnish <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into modi…

b2c947f

…fy_resource_request_multi_gpu_learner

Remove inf kl changes in favor of svens

d199063

Signed-off-by: Avnish <[email protected]>

gjoliver merged commit 5073be7 into ray-project:master May 23, 2023
2 checks passed

avnishn mentioned this pull request May 23, 2023

[RLlib] Make resource requests for multi gpu learners not request cpu IMPALA #35679

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] Don't add a cpu to bundle for learner when using gpu #35529

[RLlib] Don't add a cpu to bundle for learner when using gpu #35529

avnishn commented May 18, 2023

ArturNiederfahrenhorst commented May 19, 2023

kouroshHakha left a comment

kouroshHakha May 19, 2023

avnishn May 19, 2023

avnishn commented May 19, 2023

kouroshHakha commented May 19, 2023

avnishn commented May 22, 2023

kouroshHakha left a comment

avnishn commented May 22, 2023

avnishn commented May 23, 2023 •

edited

Loading

[RLlib] Don't add a cpu to bundle for learner when using gpu #35529

[RLlib] Don't add a cpu to bundle for learner when using gpu #35529

Conversation

avnishn commented May 18, 2023

Why are these changes needed?

Related issue number

Checks

ArturNiederfahrenhorst commented May 19, 2023

kouroshHakha left a comment

Choose a reason for hiding this comment

kouroshHakha May 19, 2023

Choose a reason for hiding this comment

avnishn May 19, 2023

Choose a reason for hiding this comment

avnishn commented May 19, 2023

kouroshHakha commented May 19, 2023

avnishn commented May 22, 2023

kouroshHakha left a comment

Choose a reason for hiding this comment

avnishn commented May 22, 2023

avnishn commented May 23, 2023 • edited Loading

avnishn commented May 23, 2023 •

edited

Loading