-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core, RLlib] Multi GPU RLlib experiment is unable to be scheduled. #35409
Comments
I think the hang is due to the PG fragmentation: The pg has 4 bundles We already scheduled two The short-term fix now should be on the Tune side to always specify bundle_index during scheduling to avoid fragmentation. In the long term, core can probably do a better job to reduce fragmentation automatically. |
Also as suggested by @cadedaniel: core should provide some message on why the scheduling is pending, that way people are not left assuming it’s a bug in ray core |
@jjyao Why does that happen? The two task/actors that specify one CPU requirement and no-gpu requirements should not be assigned to the bundle that has GPU requirements. Isn't that the case? |
Please tag me and @avnishn in the follow up conversations. |
@kouroshHakha, for this particular case, yes, core can be smarter to only use the cpu-only bundle to avoid fragmentation. But in general, core doesn't have the complete view to fully solve the fragmentation issue since it doesn't know what requests will come in later and what resources they need. Quote from https://docs.ray.io/en/latest/ray-core/scheduling/placement-group.html
Currently, if you don't specify bundle_index, you cannot expect core will use a certain bundle. |
solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
Since this is a release-blocker issue, please close it only after the cherry pick fix is merged into 2.5 release branch. Please add @ArturNiederfahrenhorst as one of the reviewer of the fix as well for tracking purpose. Thankks! |
solves #35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
also need to merge this |
…#35676) solves #35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
@avnishn Merged on release, branch. |
@avnishn: is this already done? thanks |
…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]> Signed-off-by: e428265 <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
…ject#35529) (ray-project#35676) solves ray-project#35409 Prevent fragmentation of resources by not placing gpus with cpus in bundles for the learner workers, making it so that an actor that requires only cpu does not potentially take a bundle that has both a cpu and gpu. The long term fix will be to allow the specification of placement group bundle index via tune and ray train. Signed-off-by: avnishn <[email protected]>
What happened + What you expected to happen
Above is a script to reproduce the problem. I am running on a the following cluster:
https://console.anyscale.com/o/anyscale-internal/workspaces/expwrk_rexsdhckwvn3wltbtxwce57a77/ses_qstkpd5ej9qjmle94esjcl6nyr
which has the following cluster compute layout
I'm trying to run a script that creates a placement group that looks like the following:
[{"CPU:1, "GPU: 0"}, {"CPU:1, "GPU: 0"}, {"CPU:1, "GPU: 1"}, {"CPU:1, "GPU: 1"}]
and when I run this one of my gpu actors is never created.When I run ray status I see the following:
If I run the same script, but remove the need for 1 of the actors, then it runs without hanging.
The placement group for that script has 1 less bundle:
[{"CPU:1, "GPU: 0"}, {"CPU:1, "GPU: 1"}, {"CPU:1, "GPU: 1"}]
This issue blocks me from being able to run experiments for a blog post on multi gpu training with RLlib in ray 2.5. I cannot train across multiple nodes without this issue appearing.
Versions / Dependencies
ray 5197da2
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: