[AIR] Fix `ResourceChangingScheduler` not working with AIR #26307

Yard1 · 2022-07-05T19:57:45Z

Why are these changes needed?

This PR ensures that the new trial resources set by ResourceChangingScheduler are respected by the train loop logic by modifying the scaling config to match. Previously, even though trials had their resources updated, the scaling config was not modified which lead to eg. new workers not being spawned in the DataParallelTrainer even though resources were available.

In order to accomplish this, ScalingConfigDataClass is updated to allow equality comparisons with other ScalingConfigDataClasses (using the underlying PGF) and to create a ScalingConfigDataClass from a PGF.

Please note that this is an internal only change intended to actually make ResourceChangingScheduler work. In the future, ResourceChangingScheduler should be updated to operate on ScalingConfigDataClasses instead of PGFs as it is now. That will require a deprecation cycle.

Related issue number

Closes #26130

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/tune/trainable/session.py

Yard1 · 2022-07-07T19:25:51Z

@sumanthratna FYI this PR will conflict with yours

amogkam

Thanks @Yard1! Changes to Trainer lgtm.

Can we make sure to create an issue to track the remaining follow up items from the PR description and comments in the code?

amogkam · 2022-07-11T17:44:49Z

python/ray/air/config.py

+ # Same check as in TrialRunner
+ if not (isinstance(self.fail_fast, bool) or self.fail_fast.upper() != "RAISE"):
+ raise ValueError(
+ "fail_fast must be one of {bool, 'raise'}. " f"Got {self.fail_fast}."


Suggested change

"fail_fast must be one of {bool, 'raise'}. " f"Got {self.fail_fast}."

"fail_fast" must be one of {bool, 'RAISE'}. " f"Got {self.fail_fast}."

We use lowercase in the docstring, so I figured we may as well be consistent here

amogkam · 2022-07-11T18:26:07Z

python/ray/train/session.py

@@ -41,7 +42,7 @@ def trial_id(self) -> str:
 return self._session.trial_info.id

 @property
- def trial_resources(self) -> Dict[str, float]:
+ def trial_resources(self) -> "PlacementGroupFactory":


For my own understanding, is this a behavior change, or is this correcting the type hint?

Behavior change. we agreed upon this with @xwjiang2010 - basically, just returning the dictionary of resources is not sufficient because we lose the bundle information. In the future we can make it return scaling configs instead, once we transition PGF into a fully developer API.

amogkam · 2022-07-11T18:26:24Z

python/ray/train/base_trainer.py

@@ -178,7 +179,7 @@ def _validate_attributes(self):
 if not isinstance(self.scaling_config, dict):
 raise ValueError(
 f"`scaling_config` should be an instance of `dict`, "
- f"found {type(self.run_config)} with value `{self.run_config}`."
+ f"found {type(self.scaling_config)} with value `{self.scaling_config}`."


Nice good catch!

Yard1 · 2022-07-11T19:58:48Z

Will make issues once this is merged!

krfricke

Looks good, left minor comments

python/ray/air/config.py

python/ray/air/tests/test_resource_changing.py

…ct#26307) This PR ensures that the new trial resources set by `ResourceChangingScheduler` are respected by the train loop logic by modifying the scaling config to match. Previously, even though trials had their resources updated, the scaling config was not modified which lead to eg. new workers not being spawned in the `DataParallelTrainer` even though resources were available. In order to accomplish this, `ScalingConfigDataClass` is updated to allow equality comparisons with other `ScalingConfigDataClass`es (using the underlying PGF) and to create a `ScalingConfigDataClass` from a PGF. Please note that this is an internal only change intended to actually make `ResourceChangingScheduler` work. In the future, `ResourceChangingScheduler` should be updated to operate on `ScalingConfigDataClass`es instead of PGFs as it is now. That will require a deprecation cycle. Signed-off-by: Stefan van der Kleij <[email protected]>

Yard1 added 10 commits June 29, 2022 15:31

WIP

383d32c

Merge branch 'master' into resource_changing_scheduler_air

ad6b8e0

WIP

d83de9c

Merge branch 'master' into resource_changing_scheduler_air

434eb09

Add fail_fast to FailureConfig

6de5c2f

Add test

fbfeec6

WIP

5050b97

Merge branch 'master' into resource_changing_scheduler_air

baf6c8f

Add logic

2682e22

Revert bad change

142e62c

Yard1 added this to the Ray AIR milestone Jul 5, 2022

Yard1 requested review from matthewdeng, amogkam, krfricke and xwjiang2010 July 5, 2022 19:57

Yard1 assigned amogkam, krfricke and xwjiang2010 Jul 5, 2022

Yard1 added 5 commits July 5, 2022 20:46

CI fix

d00fc9a

Fix CI

e9c28f5

Fix CI

320c74f

Fix CI

e051423

Merge branch 'ray-project:master' into resource_changing_scheduler_air

fb93294

Yard1 commented Jul 6, 2022

View reviewed changes

python/ray/tune/trainable/session.py Outdated Show resolved Hide resolved

Yard1 added 4 commits July 6, 2022 08:51

Merge branch 'ray-project:master' into resource_changing_scheduler_air

9aca411

Session trial resources type hint fix

6f79925

Allow inheritance

54ce141

Merge branch 'master' into resource_changing_scheduler_air

4eb65e0

amogkam reviewed Jul 11, 2022

View reviewed changes

krfricke approved these changes Jul 12, 2022

View reviewed changes

python/ray/air/config.py Outdated Show resolved Hide resolved

python/ray/air/tests/test_resource_changing.py Outdated Show resolved Hide resolved

Yard1 added 3 commits July 12, 2022 14:26

Better error message

0e04977

Fix type hint

f2a6486

Add sanity check to test

e4912c5

krfricke merged commit b3878e2 into ray-project:master Jul 12, 2022

Yard1 mentioned this pull request Jul 12, 2022

[air] Allow users to use instances of ScalingConfig #25712

Merged

10 tasks

Yard1 deleted the resource_changing_scheduler_air branch July 12, 2022 20:15

justinvyu mentioned this pull request Feb 26, 2024

[Train] Colocate Trainer and rank 0 worker #43115

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Fix `ResourceChangingScheduler` not working with AIR #26307

[AIR] Fix `ResourceChangingScheduler` not working with AIR #26307

Yard1 commented Jul 5, 2022 •

edited

Loading

Yard1 commented Jul 7, 2022

amogkam left a comment

amogkam Jul 11, 2022

Yard1 Jul 11, 2022

amogkam Jul 11, 2022

Yard1 Jul 11, 2022

amogkam Jul 11, 2022

Yard1 commented Jul 11, 2022

krfricke left a comment

	"fail_fast must be one of {bool, 'raise'}. " f"Got {self.fail_fast}."
	"fail_fast" must be one of {bool, 'RAISE'}. " f"Got {self.fail_fast}."

[AIR] Fix ResourceChangingScheduler not working with AIR #26307

[AIR] Fix ResourceChangingScheduler not working with AIR #26307

Conversation

Yard1 commented Jul 5, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

Yard1 commented Jul 7, 2022

amogkam left a comment

Choose a reason for hiding this comment

amogkam Jul 11, 2022

Choose a reason for hiding this comment

Yard1 Jul 11, 2022

Choose a reason for hiding this comment

amogkam Jul 11, 2022

Choose a reason for hiding this comment

Yard1 Jul 11, 2022

Choose a reason for hiding this comment

amogkam Jul 11, 2022

Choose a reason for hiding this comment

Yard1 commented Jul 11, 2022

krfricke left a comment

Choose a reason for hiding this comment

[AIR] Fix `ResourceChangingScheduler` not working with AIR #26307

[AIR] Fix `ResourceChangingScheduler` not working with AIR #26307

Yard1 commented Jul 5, 2022 •

edited

Loading