Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Checkpoint recovery refactoring #439

Merged
merged 39 commits into from
Apr 21, 2021
Merged

Checkpoint recovery refactoring #439

merged 39 commits into from
Apr 21, 2021

Conversation

melanibe
Copy link
Contributor

@melanibe melanibe commented Apr 19, 2021

Changes the way we handle checkpoints for run recoveries, for the following use cases:

  1. If AML fails for unkown reason and we want to (manually) restart the run: provide --run_recovery_id=old_run_id tag -> this should download all checkpoints available in the old_run checkpoint folder and copy them in the new run's checkpoint folder.
  2. If an AML job gets preempted, the job will automatically restart with all file system in place as interrupted -> we can find the last recovery checkpoint in the local checkpoint folder and provide it as recovery checkpoint to PL. This will avoid restarting training from scratch in case of pre-emption in the future.
  3. For some use cases, we may want to keep more than one recovery checkpoint -> allow the user to specify the number of recovery checkpoint to keep.

This PR also gets rid of deprecated start_epoch config argument as PL will automatically restore epoch flag, optimizer, lr scheduler state from the loaded recovery checkpoint itself

@melanibe melanibe changed the title Melanibe/checkpoint saving Checkpoint recovery refractoring Apr 19, 2021
@melanibe melanibe marked this pull request as ready for review April 20, 2021 10:11
@melanibe melanibe requested review from ant0nsc and Shruthi42 and removed request for ant0nsc April 20, 2021 10:16
@melanibe
Copy link
Contributor Author

Closes #311

InnerEye/ML/deep_learning_config.py Outdated Show resolved Hide resolved
InnerEye/ML/common.py Outdated Show resolved Hide resolved
InnerEye/ML/common.py Outdated Show resolved Hide resolved
InnerEye/ML/deep_learning_config.py Outdated Show resolved Hide resolved
InnerEye/ML/model_training.py Show resolved Hide resolved
InnerEye/ML/utils/checkpoint_handling.py Show resolved Hide resolved
InnerEye/ML/utils/run_recovery.py Show resolved Hide resolved
Tests/ML/models/test_scalar_model.py Outdated Show resolved Hide resolved
Tests/ML/models/test_scalar_model.py Show resolved Hide resolved
Tests/ML/test_model_training.py Show resolved Hide resolved
@ant0nsc ant0nsc changed the title Checkpoint recovery refractoring Checkpoint recovery refactoring Apr 20, 2021
@melanibe melanibe enabled auto-merge (squash) April 21, 2021 09:44
@melanibe melanibe merged commit adffa95 into main Apr 21, 2021
@melanibe melanibe deleted the melanibe/checkpoint-saving branch April 21, 2021 14:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants