Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIR] Fix ResourceChangingScheduler not working with AIR #26307

Merged
merged 22 commits into from
Jul 12, 2022

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Jul 5, 2022

Why are these changes needed?

This PR ensures that the new trial resources set by ResourceChangingScheduler are respected by the train loop logic by modifying the scaling config to match. Previously, even though trials had their resources updated, the scaling config was not modified which lead to eg. new workers not being spawned in the DataParallelTrainer even though resources were available.

In order to accomplish this, ScalingConfigDataClass is updated to allow equality comparisons with other ScalingConfigDataClasses (using the underlying PGF) and to create a ScalingConfigDataClass from a PGF.

Please note that this is an internal only change intended to actually make ResourceChangingScheduler work. In the future, ResourceChangingScheduler should be updated to operate on ScalingConfigDataClasses instead of PGFs as it is now. That will require a deprecation cycle.

Related issue number

Closes #26130

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Yard1
Copy link
Member Author

Yard1 commented Jul 7, 2022

@sumanthratna FYI this PR will conflict with yours

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Yard1! Changes to Trainer lgtm.

Can we make sure to create an issue to track the remaining follow up items from the PR description and comments in the code?

# Same check as in TrialRunner
if not (isinstance(self.fail_fast, bool) or self.fail_fast.upper() != "RAISE"):
raise ValueError(
"fail_fast must be one of {bool, 'raise'}. " f"Got {self.fail_fast}."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"fail_fast must be one of {bool, 'raise'}. " f"Got {self.fail_fast}."
"fail_fast" must be one of {bool, 'RAISE'}. " f"Got {self.fail_fast}."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use lowercase in the docstring, so I figured we may as well be consistent here

@@ -41,7 +42,7 @@ def trial_id(self) -> str:
return self._session.trial_info.id

@property
def trial_resources(self) -> Dict[str, float]:
def trial_resources(self) -> "PlacementGroupFactory":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, is this a behavior change, or is this correcting the type hint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior change. we agreed upon this with @xwjiang2010 - basically, just returning the dictionary of resources is not sufficient because we lose the bundle information. In the future we can make it return scaling configs instead, once we transition PGF into a fully developer API.

@@ -178,7 +179,7 @@ def _validate_attributes(self):
if not isinstance(self.scaling_config, dict):
raise ValueError(
f"`scaling_config` should be an instance of `dict`, "
f"found {type(self.run_config)} with value `{self.run_config}`."
f"found {type(self.scaling_config)} with value `{self.scaling_config}`."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice good catch!

@Yard1
Copy link
Member Author

Yard1 commented Jul 11, 2022

Will make issues once this is merged!

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left minor comments

python/ray/air/config.py Outdated Show resolved Hide resolved
python/ray/air/tests/test_resource_changing.py Outdated Show resolved Hide resolved
@krfricke krfricke merged commit b3878e2 into ray-project:master Jul 12, 2022
@Yard1 Yard1 deleted the resource_changing_scheduler_air branch July 12, 2022 20:15
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
…ct#26307)

This PR ensures that the new trial resources set by `ResourceChangingScheduler` are respected by the train loop logic by modifying the scaling config to match. Previously, even though trials had their resources updated, the scaling config was not modified which lead to eg. new workers not being spawned in the `DataParallelTrainer` even though resources were available.

In order to accomplish this, `ScalingConfigDataClass` is updated to allow equality comparisons with other `ScalingConfigDataClass`es (using the underlying PGF) and to create a `ScalingConfigDataClass` from a PGF.

Please note that this is an internal only change intended to actually make `ResourceChangingScheduler` work. In the future, `ResourceChangingScheduler` should be updated to operate on `ScalingConfigDataClass`es instead of PGFs as it is now. That will require a deprecation cycle.

Signed-off-by: Stefan van der Kleij <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIR/Tune] ResourceChangingScheduler & AIR API
4 participants