-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Fix ResourceChangingScheduler
not working with AIR
#26307
[AIR] Fix ResourceChangingScheduler
not working with AIR
#26307
Conversation
@sumanthratna FYI this PR will conflict with yours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Yard1! Changes to Trainer lgtm.
Can we make sure to create an issue to track the remaining follow up items from the PR description and comments in the code?
# Same check as in TrialRunner | ||
if not (isinstance(self.fail_fast, bool) or self.fail_fast.upper() != "RAISE"): | ||
raise ValueError( | ||
"fail_fast must be one of {bool, 'raise'}. " f"Got {self.fail_fast}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fail_fast must be one of {bool, 'raise'}. " f"Got {self.fail_fast}." | |
"fail_fast" must be one of {bool, 'RAISE'}. " f"Got {self.fail_fast}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use lowercase in the docstring, so I figured we may as well be consistent here
@@ -41,7 +42,7 @@ def trial_id(self) -> str: | |||
return self._session.trial_info.id | |||
|
|||
@property | |||
def trial_resources(self) -> Dict[str, float]: | |||
def trial_resources(self) -> "PlacementGroupFactory": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own understanding, is this a behavior change, or is this correcting the type hint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Behavior change. we agreed upon this with @xwjiang2010 - basically, just returning the dictionary of resources is not sufficient because we lose the bundle information. In the future we can make it return scaling configs instead, once we transition PGF into a fully developer API.
@@ -178,7 +179,7 @@ def _validate_attributes(self): | |||
if not isinstance(self.scaling_config, dict): | |||
raise ValueError( | |||
f"`scaling_config` should be an instance of `dict`, " | |||
f"found {type(self.run_config)} with value `{self.run_config}`." | |||
f"found {type(self.scaling_config)} with value `{self.scaling_config}`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice good catch!
Will make issues once this is merged! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, left minor comments
…ct#26307) This PR ensures that the new trial resources set by `ResourceChangingScheduler` are respected by the train loop logic by modifying the scaling config to match. Previously, even though trials had their resources updated, the scaling config was not modified which lead to eg. new workers not being spawned in the `DataParallelTrainer` even though resources were available. In order to accomplish this, `ScalingConfigDataClass` is updated to allow equality comparisons with other `ScalingConfigDataClass`es (using the underlying PGF) and to create a `ScalingConfigDataClass` from a PGF. Please note that this is an internal only change intended to actually make `ResourceChangingScheduler` work. In the future, `ResourceChangingScheduler` should be updated to operate on `ScalingConfigDataClass`es instead of PGFs as it is now. That will require a deprecation cycle. Signed-off-by: Stefan van der Kleij <[email protected]>
Why are these changes needed?
This PR ensures that the new trial resources set by
ResourceChangingScheduler
are respected by the train loop logic by modifying the scaling config to match. Previously, even though trials had their resources updated, the scaling config was not modified which lead to eg. new workers not being spawned in theDataParallelTrainer
even though resources were available.In order to accomplish this,
ScalingConfigDataClass
is updated to allow equality comparisons with otherScalingConfigDataClass
es (using the underlying PGF) and to create aScalingConfigDataClass
from a PGF.Please note that this is an internal only change intended to actually make
ResourceChangingScheduler
work. In the future,ResourceChangingScheduler
should be updated to operate onScalingConfigDataClass
es instead of PGFs as it is now. That will require a deprecation cycle.Related issue number
Closes #26130
Checks
scripts/format.sh
to lint the changes in this PR.