[tune] Fix BOHB example for new storage #38983

krfricke · 2023-08-28T13:25:35Z

Why are these changes needed?

The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore.

Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked".

I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB.

Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from bracket.trials_to_unpause. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <[email protected]>

justinvyu

Thanks! 1 suggestion

python/ray/tune/execution/tune_controller.py

justinvyu · 2023-08-28T18:49:37Z

python/ray/tune/examples/tf_mnist_example.py

+ def save_checkpoint(self, checkpoint_dir: str):
+ return None
+
+ def load_checkpoint(self, checkpoint):
+ return None


I see -- is this the only place that needs to be updated?

I'm actually surprised this is not happening in more cases (and why it didn't break before) - class trainables default to checkpoint_at_end=True and this can raise a NotImplementedError.

I think there are two options to solve this. 1) Default to False even for class trainables, or 2) Only set True if checkpointing is implemented (method is overwritten).

Looked into this and turns out the default checkpoint_at_end is actually None.

Only in the Tuner do we set it to True for class trainables. In tune.run it just stays as None and no checkpoint happens at the end.

from ray import tune class Test(tune.Trainable): def step(self): return {"done": True} tune.run(Test) # works tune.Tuner(Test).fit() # errors

…-fix

Signed-off-by: Kai Fricke <[email protected]>

justinvyu

LGTM, let's fix that checkpoint_at_end issue 🤯

@justinvyu

The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <[email protected]>

@justinvyu

* [train] enable new persistence mode for core and serve tests (#38938) Signed-off-by: Matthew Deng <[email protected]> * [train] New persistence mode: Update 🐠 `ML Libraries w/ Ray Client Examples (Python 3.7)` (#38923) Signed-off-by: Justin Yu <[email protected]> * [train] remove non-URI assertion (#38944) Signed-off-by: Matthew Deng <[email protected]> * [train] New persistence mode: Update 📖 `Doc tests and examples (excluding Ray AIR examples)` (#38940) Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Matthew Deng <[email protected]> Co-authored-by: Matthew Deng <[email protected]> * disable legacy sync config logic in trainable (#38952) Signed-off-by: Justin Yu <[email protected]> * [2.7 CI][New Persistent Mode][6/n] 📖 ✈️ Ray AIR examples (#38918) Signed-off-by: woshiyyya <[email protected]> * [2.7 CI][New Persistent Mode][2/n] 📺 📖 Doc GPU tests and examples (#38905) Signed-off-by: woshiyyya <[email protected]> * [2.7 CI][New Persistent Mode][4/n] 📺 🚂 Train GPU tests & 🚂 Datasets Train Integration GPU Tests and Examples (#38910) Signed-off-by: woshiyyya <[email protected]> Signed-off-by: Justin Yu <[email protected]> Co-authored-by: Justin Yu <[email protected]> * [2.7 CI][New Persistent Mode][1/n] 📺 ✈️ AIR GPU tests (ray/air) & ⚡ :python: Lightning 2.0 Train GPU tests (#38903) Signed-off-by: woshiyyya <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]> * [train] Fix broken tune tests and support ray storage (#38950) This PR re-introduces support for ray storage ray.init(storage="s3:https://...") and fixes a broken tune controller test. Signed-off-by: Justin Yu <[email protected]> * [train] New persistence mode: Finish migrating `xgb`, `lgbm` and `sklearn` trainers, checkpoints + tests (#38959) Signed-off-by: Justin Yu <[email protected]> * [2.7 CI][New Persistent Mode][5/n] 📖 Doc examples for external code (#38915) Signed-off-by: woshiyyya <[email protected]> * [train][rllib] temporarily disable new persistence mode for rllib tests (#38965) Signed-off-by: Matthew Deng <[email protected]> * [2.7 CI][New Persistent Mode][8/n] ✈️ AIR tests (ray/air) (#38932) Signed-off-by: woshiyyya <[email protected]> * [tune] Storage: 🐙 🧠 Tune tests and examples {using RLlib} migration (#38895) Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: matthewdeng <[email protected]> * [train] Fix MosaicTrainer example and unit test (#38970) Signed-off-by: Justin Yu <[email protected]> * [air/release] Fix dreambooth example image preprocessing logic (#39020) Signed-off-by: Justin Yu <[email protected]> * [train] clean up ray.train._checkpoint imports (#38951) Signed-off-by: Matthew Deng <[email protected]> * [train] high level cleanup of Ray Train docs (#38971) Signed-off-by: Matthew Deng <[email protected]> * [wip][docs] update FrameworkPredictor examples (#38634) Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> * [train] Add documentation for using metadata argument to save preprocessors (#38701) * [Train] Restructure Ray Train Example Page (#38814) Signed-off-by: woshiyyya <[email protected]> * [air] Deprecate some fields/classes that are supposed to be gone in 2.6. (#38794) Signed-off-by: xwjiang2010 <[email protected]> * [tune/storage] Fix Tune multinode tests (#39050) Fixes multinode tests by using the new train.report() API. Signed-off-by: Kai Fricke <[email protected]> * [tune] Fix BOHB example for new storage (#38983) The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <[email protected]> * [Release Test] Fix `long_running_horovod_tune_test`. (#39012) Signed-off-by: Yunxuan Xiao <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]> * [train] New persistence mode: `StorageContext` unit tests (#39023) Signed-off-by: Justin Yu <[email protected]> * [train] enable train + tune tests and examples (#39021) Signed-off-by: Matthew Deng <[email protected]> * [rllib] Fix storage-path related tests (#38947) This PR fixes rllib-related tests that didn't pass changes related to the new storage context. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: matthewdeng <[email protected]> * [train] New persistence mode: Migrate 🐙 `Tune tests and examples (medium)` (#39081) Signed-off-by: Justin Yu <[email protected]> --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: Justin Yu <[email protected]> Signed-off-by: woshiyyya <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: matthewdeng <[email protected]> Signed-off-by: xwjiang2010 <[email protected]> Signed-off-by: Yunxuan Xiao <[email protected]> Co-authored-by: Justin Yu <[email protected]> Co-authored-by: Yunxuan Xiao <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: xwjiang2010 <[email protected]>

@justinvyu

The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: e428265 <[email protected]>

@justinvyu

The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <[email protected]>

@justinvyu

The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Jim Thompson <[email protected]>

@justinvyu

The new storage path does not create "empty" checkpoints per default anymore. Previously, when no checkpoint is saved, PAUSEing a trial would create a dummy checkpoint that only contains trial metadata (such as the iteration number). This is not the case anymore. Examples now have to implement checkpointing to properly restore previous state. This was also true previously - but some of our simple examples (e.g. the one in this PR) didn't implement it and still "worked". I think it's fine to keep the functionality as is and require our examples to show checkpointing implementations. This will ensure that users don't shoot their feet trying to use e.g. BOHB. Separately, BOHB was malfunctioning as trials were repeatedly PAUSED and restarted as they've never been removed from `bracket.trials_to_unpause`. @justinvyu mentioned this in the review where it was introduced and I believed at the time it wasn't necessary - turns out it is, as we can end up in a situation where a bracket is never finished because trials are constantly running. This was not caught by any tests. We should add one in a follow-up - for now we can proceed with this PR to pick onto Ray 2.7. Signed-off-by: Kai Fricke <[email protected]> Signed-off-by: Victor <[email protected]>

[tune] Fix BOHB example for new storage

770d4ad

Signed-off-by: Kai Fricke <[email protected]>

krfricke requested review from richardliaw, xwjiang2010, amogkam, matthewdeng, Yard1, maxpumperla and a team as code owners August 28, 2023 13:25

krfricke assigned matthewdeng Aug 28, 2023

krfricke added the Ray 2.7 label Aug 28, 2023

Remove from trials_to_unpause

78d4e66

Signed-off-by: Kai Fricke <[email protected]>

krfricke assigned justinvyu Aug 28, 2023

fix

f973f84

Signed-off-by: Kai Fricke <[email protected]>

justinvyu reviewed Aug 28, 2023

View reviewed changes

matthewdeng added the v2.7.0-pick label Aug 28, 2023

Kai Fricke added 2 commits August 29, 2023 10:10

Merge remote-tracking branch 'upstream/master' into tune/bohb-example…

cf5f61d

…-fix

trial.is_saving

ce3f477

Signed-off-by: Kai Fricke <[email protected]>

krfricke added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 29, 2023

justinvyu approved these changes Aug 29, 2023

View reviewed changes

matthewdeng approved these changes Aug 29, 2023

View reviewed changes

krfricke merged commit 2c5d354 into ray-project:master Aug 29, 2023
2 checks passed

krfricke deleted the tune/bohb-example-fix branch August 29, 2023 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Fix BOHB example for new storage #38983

[tune] Fix BOHB example for new storage #38983

krfricke commented Aug 28, 2023 •

edited

Loading

justinvyu left a comment

justinvyu Aug 28, 2023

krfricke Aug 29, 2023

justinvyu Aug 29, 2023

justinvyu left a comment

[tune] Fix BOHB example for new storage #38983

[tune] Fix BOHB example for new storage #38983

Conversation

krfricke commented Aug 28, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

justinvyu left a comment

Choose a reason for hiding this comment

justinvyu Aug 28, 2023

Choose a reason for hiding this comment

krfricke Aug 29, 2023

Choose a reason for hiding this comment

justinvyu Aug 29, 2023

Choose a reason for hiding this comment

justinvyu left a comment

Choose a reason for hiding this comment

krfricke commented Aug 28, 2023 •

edited

Loading