-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Don't add a cpu to bundle for learner when using gpu #35529
[RLlib] Don't add a cpu to bundle for learner when using gpu #35529
Conversation
Signed-off-by: Avnish <[email protected]>
solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
Signed-off-by: avnishn <[email protected]>
Signed-off-by: avnishn <[email protected]>
@avnishn Thanks man! Can you also make sure that a pick lands? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One nit. Can we run the relevant release test on this here? and link the buildkite?
num_cpus_per_learner_worker=( | ||
self.num_cpus_per_learner_worker | ||
if not self.num_gpus_per_learner_worker | ||
else 0 | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: this is not needed right? cause self.num_cpus_per_learner_worker
will be zero if self.num_gpus_per_learner_worker > 0
. If this is not the case, we already raise errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think I just added this in here before I added the error. We don't technically need this Ill remove it if I need to do anything to get the release tests to run.
Don't merge just yet For whatever reason, I haven't gotten the release tests to run, Like its not finding the name of the test.... I suspect I did something wrong while launching the test. |
ok let me know when. 👍 |
Signed-off-by: Avnish <[email protected]>
…avoid hanging experiment Signed-off-by: Avnish <[email protected]>
Signed-off-by: Avnish <[email protected]>
Signed-off-by: Avnish <[email protected]>
https://buildkite.com/ray-project/release-tests-pr/builds/39400#0188443c-35b2-4d89-aaa3-9b8f8b442826 release tests are passing, this is ready to merge @kouroshHakha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some non-catchable conflicts with this PR #35573. Please merge master and use the new Learner API. Also not sure if the changes to learners are needed in this PR.
…fy_resource_request_multi_gpu_learner
Signed-off-by: Avnish <[email protected]>
…efficiently and finish training faster Signed-off-by: avnishn <[email protected]>
https://buildkite.com/ray-project/release-tests-pr/builds/39400#_ The release tests are passing, but one of the multi gpu test examples is taking a really long time to run because of bad resource utilization, so I changed its resource utilization (changed batch and minibatch sizes, and increased number of rollout workers since we have 48 on the machine) |
Signed-off-by: avnishn <[email protected]>
Signed-off-by: avnishn <[email protected]>
Signed-off-by: Avnish <[email protected]>
Signed-off-by: Avnish <[email protected]>
…fy_resource_request_multi_gpu_learner
Signed-off-by: Avnish <[email protected]>
the rllib multi-gpu test that failed is flakey. I have verified by running repro ci and running the test myself that it passes. |
…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…#35676) solves #35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]> Signed-off-by: e428265 <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
solves #35409
Prevent fragmentation of resources by not placing gpus
with cpus in bundles for the learner workers, making it
so that an actor that requires only cpu does not
potentially take a bundle that has both a cpu and gpu.
The long term fix will be to allow the specification
of placement group bundle index via tune and ray train.
Signed-off-by: avnishn [email protected]
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.