-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train+tune] Refactor restoration configuration to be centered around storage_path
#42853
[train+tune] Refactor restoration configuration to be centered around storage_path
#42853
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…ore_terminated_trainer
This reverts commit b6a8b81. Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…ore_terminated_trainer
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, thanks for the much cleaner API. Main consideration is just if we choose to expose this publicly or keep it internal.
Members: | ||
RESUME: Resume from the latest checkpoint. | ||
RESTART: Restart from the beginning (with no checkpoint). | ||
IGNORE: Ignore this trial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Upon first read I wasn't sure what "ignore" meant, but not sure off the top of my head if there's a clearer word we can use here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feeling for me. SKIP
might be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to SKIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great simplification with _ResumeConfig API! Left some comments
Members: | ||
RESUME: Resume from the latest checkpoint. | ||
RESTART: Restart from the beginning (with no checkpoint). | ||
IGNORE: Ignore this trial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feeling for me. SKIP
might be better?
python/ray/tune/tune.py
Outdated
@@ -259,7 +304,7 @@ def run( | |||
max_failures: int = 0, | |||
fail_fast: bool = False, | |||
restore: Optional[str] = None, | |||
resume: Union[bool, str] = False, | |||
resume_config: Optional[_ResumeConfig] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just make it a public API? The users will always use this config but cannot find it on our Ray API doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decision: Keep it as a DeveloperAPI
for now, and expose it as a "hidden" experimental argument of Tuner.restore
. Ideally, we wouldn't have this, but there is the dependency of Trainer.restore
on Tuner.restore
for now 😢
…ore_terminated_trainer Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…ore_terminated_trainer
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
… `storage_path` (ray-project#42853) Simplify the restoration logic by using the `ResumeConfig` internally to determine how to treat finished, errored, and unfinished trials. There are no more "LOCAL", "REMOTE, or "PROMPT" modes of resuming. --------- Signed-off-by: Justin Yu <[email protected]>
Why are these changes needed?
This PR cleans up restoration logic in Tune.
Context
Previously, we had
local_dir
andupload_dir
as two separate configurations for where to store experiment results. Results would first go tolocal_dir
, then toupload_dir
. (Checkpoints were the exception, which had a special codepath to upload directly to cloud.)Now, the
storage_path
is the only location that users can expect all their run data to go to. The local directory just serves as a staging ground to eventually copy to storage. This is an implementation detail that ideally doesn't need to exist.At the moment, it's not actually possible to recover successfully from the "local directory" (set by the
RAY_AIR_LOCAL_CACHE_DIR
environment variable), when a storage path is set. Checkpoints are uploaded directly to the storage path, so restoring from the local staging directory will result in trials starting from scratch.Therefore, the
resume="REMOTE"
andresume="LOCAL"
configurations are outdated artifacts of the old (<2.7) checkpointing/storage implementation.resume="AUTO"
is now the only codepath, which greatly simplifies the restoration code.Change Summary
Given this context, these are the main changes from this PR:
"+"
. (Ex:AUTO+RESTART_ERRORED
-->ResumeConfig(errored=ResumeConfig.ResumeType.RESTART)
.)ResumeType.IGNORE
setting that can be used to only restart/resume subsets of trials based on their statuses.resume_from_checkpoint
to start a new experiment is what we recommend), but it is a common user ask to be able to continue the run, while still tracking the top K checkpoints.API Change Summary
Tuner.restore(..., _resume_config)
is now possible, but this change is backwards compatible, sinceTuner.restore(resume_errored, ...)
etc. are still possible.tune.run(resume_config)
is now the simpler alternative totune.run(resume)
, and some legacy resume strings have been hard-deprecated. There is no hard API change here though.TuneController(resume)
is nowTuneController(resume_config)
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.