-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train+tune] Local directory refactor (1/n): Write launcher state files (tuner.pkl
, trainer.pkl
) directly to storage
#43369
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]> update trainer._save usage in test Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…entrypoints Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
self._run_config.name | ||
or StorageContext.get_experiment_dir_name(self.converted_trainable) | ||
) | ||
storage = StorageContext( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be trying to instantiate this StorageContext
only in one place and then pass it around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hard to do since we'd need to pass the context through from BaseTrainer.fit
to the public Tuner
interface.
One alternative we discussed is having a global storage context, but that'd also require a get_or_create_storage_context
logic to account for users coming in through any of the 3 entrypoints (tuner, trainer, tune.run).
I will clarify the usage is just to access the path and the filesystem so that we don't re-implement that logic.
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]> fix lint Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…pickled one is instead) Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
@@ -713,7 +713,7 @@ def create_trainable_with_params(): | |||
) | |||
return trainable_with_params | |||
|
|||
exp_name = "restore_with_params" | |||
exp_name = f"restore_with_params-{use_function_trainable=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was required to avoid a bug where:
- Both tests would use the same exp name and save things to the same driver staging dir
tuner.pkl
gets downloaded to the staging dir when the test runs the first time with theuse_function_trainable=True
.- Then, when the second run happens, it writes the correct
tuner.pkl
to storage, but then does another driver sync where the oldtuner.pkl
that was downloaded gets uploaded again. - This caused an error when restoring the 2nd run.
This problem gets fixed in the follow-up PR by using unique staging directories for each new ray train experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Left some minor comments.
@@ -42,14 +42,6 @@ | |||
_SELF = "self" | |||
|
|||
|
|||
_TUNER_FAILED_MSG = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we no longer send this failure message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only caught TuneError
, which only gets raised from tune.run(raise_on_failed_trial=True)
. However, the Tuner always passes False
for this flag, so this error message doesn't actually get printed ever.
ray/python/ray/tune/impl/tuner_internal.py
Line 594 in 93d8d96
raise_on_failed_trial=False, |
@@ -713,7 +713,7 @@ def create_trainable_with_params(): | |||
) | |||
return trainable_with_params | |||
|
|||
exp_name = "restore_with_params" | |||
exp_name = f"restore_with_params-{use_function_trainable=}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should never run Tune twice in the same job?
I think this will no longer be a problem after your second PR, which separates driver staging directories with timestamp.
@@ -302,7 +302,6 @@ def test_run_config_in_trainer_and_tuner( | |||
assert not (tmp_path / "trainer").exists() | |||
assert both_msg not in caplog.text | |||
else: | |||
assert tuner._local_tuner.get_run_config() == RunConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it because we inject a default storage context into the run config if not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because I set the run_config.name
in the trainer, then pass that along to the tuner. So, the tuner's run config is no longer the default RunConfig
. This is so that we don't generate the experiment name multiple times (which could lead to different folders being used by the trainer vs. tuner).
…es (`tuner.pkl`, `trainer.pkl`) directly to storage (ray-project#43369) This PR updates `Trainer`s and the `Tuner` to upload its state directly to `storage_path`, rather than dumping it in a local directory and relying on driver syncing to upload. --------- Signed-off-by: Justin Yu <[email protected]>
…es (`tuner.pkl`, `trainer.pkl`) directly to storage (ray-project#43369) This PR updates `Trainer`s and the `Tuner` to upload its state directly to `storage_path`, rather than dumping it in a local directory and relying on driver syncing to upload. --------- Signed-off-by: Justin Yu <[email protected]>
Why are these changes needed?
This PR updates
Trainer
s and theTuner
to upload its state directly tostorage_path
, rather than dumping it in a local directory and relying on driver syncing to upload.This removes the dependency of these pkl files on the
_get_defaults_results_dir
, which is the folder that defaults to~/ray_results
and is set by environment variables.TODOs for follow-up PRs
There's currently an issue where multiple runs with the same
RunConfig(name)
can produce conflicting versions oftuner.pkl
. This can lead to a bug where the files needed to restore a run can be overwritten by a mismatched version. See #43369 (comment) for more details.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.