Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Don't add a cpu to bundle for learner when using gpu #35529

Merged

Conversation

avnishn
Copy link
Member

@avnishn avnishn commented May 18, 2023

solves #35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn [email protected]

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Avnish <[email protected]>
solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
Signed-off-by: avnishn <[email protected]>
Signed-off-by: avnishn <[email protected]>
@ArturNiederfahrenhorst
Copy link
Contributor

@avnishn Thanks man! Can you also make sure that a pick lands?

@avnishn avnishn added release-blocker P0 Issue that blocks the release v2.5.0-pick labels May 19, 2023
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One nit. Can we run the relevant release test on this here? and link the buildkite?

Comment on lines +3243 to +3247
num_cpus_per_learner_worker=(
self.num_cpus_per_learner_worker
if not self.num_gpus_per_learner_worker
else 0
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: this is not needed right? cause self.num_cpus_per_learner_worker will be zero if self.num_gpus_per_learner_worker > 0. If this is not the case, we already raise errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think I just added this in here before I added the error. We don't technically need this Ill remove it if I need to do anything to get the release tests to run.

@avnishn
Copy link
Member Author

avnishn commented May 19, 2023

Don't merge just yet

For whatever reason, I haven't gotten the release tests to run, Like its not finding the name of the test.... I suspect I did something wrong while launching the test.

@kouroshHakha kouroshHakha added the do-not-merge Do not merge this PR! label May 19, 2023
@kouroshHakha
Copy link
Contributor

ok let me know when. 👍

@avnishn
Copy link
Member Author

avnishn commented May 22, 2023

@avnishn avnishn removed the do-not-merge Do not merge this PR! label May 22, 2023
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some non-catchable conflicts with this PR #35573. Please merge master and use the new Learner API. Also not sure if the changes to learners are needed in this PR.

rllib/algorithms/ppo/ppo_learner.py Outdated Show resolved Hide resolved
@avnishn
Copy link
Member Author

avnishn commented May 22, 2023

https://buildkite.com/ray-project/release-tests-pr/builds/39400#_

The release tests are passing, but one of the multi gpu test examples is taking a really long time to run because of bad resource utilization, so I changed its resource utilization (changed batch and minibatch sizes, and increased number of rollout workers since we have 48 on the machine)

@avnishn
Copy link
Member Author

avnishn commented May 23, 2023

the rllib multi-gpu test that failed is flakey. I have verified by running repro ci and running the test myself that it passes.
This PR is ready to merge

@gjoliver gjoliver merged commit 5073be7 into ray-project:master May 23, 2023
2 checks passed
avnishn added a commit to avnishn/ray that referenced this pull request May 23, 2023
…ject#35529)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
ArturNiederfahrenhorst pushed a commit that referenced this pull request May 24, 2023
…#35676)

solves #35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
scv119 pushed a commit to scv119/ray that referenced this pull request Jun 16, 2023
…ject#35529)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
arvind-chandra pushed a commit to lmco/ray that referenced this pull request Aug 31, 2023
…ject#35529)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
Signed-off-by: e428265 <[email protected]>
glennmoy pushed a commit to beacon-biosignals/ray that referenced this pull request Sep 26, 2023
…ject#35529) (ray-project#35676)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
glennmoy pushed a commit to beacon-biosignals/ray that referenced this pull request Sep 26, 2023
…ject#35529) (ray-project#35676)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
glennmoy pushed a commit to beacon-biosignals/ray that referenced this pull request Sep 26, 2023
…ject#35529) (ray-project#35676)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
glennmoy pushed a commit to beacon-biosignals/ray that referenced this pull request Sep 26, 2023
…ject#35529) (ray-project#35676)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
glennmoy pushed a commit to beacon-biosignals/ray that referenced this pull request Sep 26, 2023
…ject#35529) (ray-project#35676)

solves ray-project#35409

Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.

The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.

Signed-off-by: avnishn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants