Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Commit

Permalink
Checkpoint recovery refactoring (#439)
Browse files Browse the repository at this point in the history
* Add auto-restart

* Change handling of checkpoints and clean-up

* Save last k recovery checkpoints

* Log epoch for keeping last ckpt

* Keeping k last checkpoints

* Add possibility to recover from particular checkpoint

* Update tests

* Check k recovery

* Re-add skipif

* Correct pick up of recovery runs and add test

* Correct pick up of recovery runs and add test

* Remove all start epochs

* Remove all start epochs

* Spimplify run recovery logic

* Fix it

* Merge conflicts import errors

* Fix it

* Fix tests in test_scalar_model.py

* Fix tests in test_model_util.py

* Fix tests in test_scalar_model.py

* Fix tests in test_model_training.py

* Avoid forcing the user to log epoch

* Fix test_get_checkpoints

* Fix test_checkpoint_handling.py

* Fix callback

* Update CHANGELOG.md

* Self PR review comments

* Fix more tests

* Fix argument in test

* Mypy

* Update InnerEye-DeepLearning.iml

* Update InnerEye-DeepLearning.iml

* Fix mypy errors

* Address PR comment

* Typo

* mypy fix

* just style
  • Loading branch information
melanibe committed Apr 21, 2021
1 parent f421234 commit adffa95
Show file tree
Hide file tree
Showing 24 changed files with 265 additions and 257 deletions.
2 changes: 1 addition & 1 deletion .idea/InnerEye-DeepLearning.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ created.
- ([#417](https://github.com/microsoft/InnerEye-DeepLearning/pull/417)) Added a generic way of adding PyTorch Lightning
models to the toolbox. It is now possible to train almost any Lightning model with the InnerEye toolbox in AzureML,
with only minimum code changes required. See [the MD documentation](docs/bring_your_own_model.md) for details.
- ([#438](https://github.com/microsoft/InnerEye-DeepLearning/pull/438)) Add links and small docs to InnerEye-Gateway and InnerEye-Inference
- ([#430](https://github.com/microsoft/InnerEye-DeepLearning/pull/430)) Update conversion to 1.0.1 InnerEye-DICOM-RT to
add: manufacturer, SoftwareVersions, Interpreter and ROIInterpretedTypes.
- ([#385](https://github.com/microsoft/InnerEye-DeepLearning/pull/385)) Add the ability to train a model on multiple
Expand Down Expand Up @@ -48,6 +47,10 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
- ([#405](https://github.com/microsoft/InnerEye-DeepLearning/pull/405)) Cross-validation runs for classification models
now also generate a report notebook summarising the metrics from the individual splits. Also includes minor formatting
improvements for standard classification reports.
- ([#438](https://github.com/microsoft/InnerEye-DeepLearning/pull/438)) Add links and small docs to InnerEye-Gateway and InnerEye-Inference
- ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Enable automatic job recovery from last recovery
checkpoint in case of job pre-emption on AML. Give the possibility to the user to keep more than one recovery
checkpoint.

### Changed

Expand All @@ -62,8 +65,11 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
end-to-end test for classification cross-validation. WARNING: upgrade PL version causes hanging of multi-node
training.
- ([#437])(https://github.com/microsoft/InnerEye-DeepLearning/pull/437)) Upgrade to PyTorch-Lightning 1.2.8.
- ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Recovery checkpoints are now
named `recovery_epoch=x.ckpt` instead of `recovery.ckpt` or `recovery-v0.ckpt`.

### Fixed

- ([#422](https://github.com/microsoft/InnerEye-DeepLearning/pull/422)) Documentation - clarified `setting_up_aml.md`
datastore creation instructions and fixed small typos in `hello_world_model.md`
- ([#432](https://github.com/microsoft/InnerEye-DeepLearning/pull/432)) Fixed cross-validation for classification
Expand All @@ -73,7 +79,9 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
set, display an error message and terminate the run.
- ([#437](https://github.com/microsoft/InnerEye-DeepLearning/pull/437)) Fixed multi-node DDP bug in PL v1.2.8. Re-add
end-to-end test for multi-node.

### Removed
- ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Deprecated `start_epoch` config argument.

### Deprecated

Expand Down
95 changes: 51 additions & 44 deletions InnerEye/ML/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,13 @@
# ------------------------------------------------------------------------------------------
import abc
import logging
import re
from datetime import datetime
from enum import Enum, unique
from pathlib import Path
from typing import Any, Dict, List, Optional
from typing import Any, Dict, List, Optional, Tuple

import numpy as np

DATASET_CSV_FILE_NAME = "dataset.csv"
CHECKPOINT_SUFFIX = ".ckpt"
Expand Down Expand Up @@ -61,18 +64,16 @@ def get_feature_length(self, column: str) -> int:
raise NotImplementedError("get_feature_length must be implemented by sub classes")


def create_recovery_checkpoint_path(path: Path) -> Path:
def get_recovery_checkpoint_path(path: Path) -> Path:
"""
Returns the file name of a recovery checkpoint in the given folder. Raises a FileNotFoundError if no
Returns the path to the last recovery checkpoint in the given folder or the provided filename. Raises a
FileNotFoundError if no
recovery checkpoint file is present.
:param path: Path to checkpoint folder
"""
# Recovery checkpoints are written alternately as recovery.ckpt and recovery-v0.ckpt.
best_checkpoint1 = path / f"{RECOVERY_CHECKPOINT_FILE_NAME_WITH_SUFFIX}"
best_checkpoint2 = path / f"{RECOVERY_CHECKPOINT_FILE_NAME}-v0{CHECKPOINT_SUFFIX}"
for p in [best_checkpoint1, best_checkpoint2]:
if p.is_file():
return p
recovery_ckpt_and_epoch = find_recovery_checkpoint_and_epoch(path)
if recovery_ckpt_and_epoch is not None:
return recovery_ckpt_and_epoch[0]
files = list(path.glob("*"))
raise FileNotFoundError(f"No checkpoint files found in {path}. Existing files: {' '.join(p.name for p in files)}")

Expand All @@ -85,34 +86,55 @@ def get_best_checkpoint_path(path: Path) -> Path:
return path / BEST_CHECKPOINT_FILE_NAME_WITH_SUFFIX


def keep_latest(path: Path, search_pattern: str) -> Optional[Path]:
def find_all_recovery_checkpoints(path: Path) -> Optional[List[Path]]:
"""
Extracts all file starting with RECOVERY_CHECKPOINT_FILE_NAME in path
:param path:
:return:
"""
all_recovery_files = [f for f in path.glob(RECOVERY_CHECKPOINT_FILE_NAME + "*")]
if len(all_recovery_files) == 0:
return None
return all_recovery_files


PathAndEpoch = Tuple[Path, int]


def extract_latest_checkpoint_and_epoch(available_files: List[Path]) -> PathAndEpoch:
"""
Checkpoints are saved as recovery_epoch={epoch}.ckpt, find the latest ckpt and epoch number.
:param available_files: all available checkpoints
:return: path the checkpoint from latest epoch and epoch number
"""
recovery_epochs = [int(re.findall(r"[\d]+", f.stem)[0]) for f in available_files]
idx_max_epoch = int(np.argmax(recovery_epochs))
return available_files[idx_max_epoch], recovery_epochs[idx_max_epoch]


def find_recovery_checkpoint_and_epoch(path: Path) -> Optional[PathAndEpoch]:
"""
Looks at all files that match the given pattern via "glob", and deletes all of them apart from the most most
recent file. The surviving file is returned. If there is no single file that matches the search pattern, then
return None.
Looks at all the recovery files, extracts the epoch number for all of them and returns the most recent (latest
epoch)
checkpoint path along with the corresponding epoch number. If no recovery checkpoint are found, return None.
:param path: The folder to start searching in.
:param search_pattern: The glob pattern that specifies the files that should be searched.
:return: None if there is no file matching the search pattern, or a Path object that has the latest file matching
the pattern.
"""
files_and_mod_time = [(f, f.stat().st_mtime) for f in path.glob(search_pattern)]
files_and_mod_time.sort(key=lambda f: f[1], reverse=True)
for (f, _) in files_and_mod_time[1:]:
logging.info(f"Removing file: {f}")
f.unlink()
if files_and_mod_time:
return files_and_mod_time[0][0]
:return: None if there is no file matching the search pattern, or a Tuple with Path object and integer pointing to
recovery checkpoint path and recovery epoch.
"""
available_checkpoints = find_all_recovery_checkpoints(path)
if available_checkpoints is not None:
return extract_latest_checkpoint_and_epoch(available_checkpoints)
return None


def keep_best_checkpoint(path: Path) -> Path:
def create_best_checkpoint(path: Path) -> Path:
"""
Clean up all checkpoints that are found in the given folder, and keep only the "best" one. "Best" is at the moment
defined as being the last checkpoint, but could be based on some defined policy. The best checkpoint will be
renamed to `best_checkpoint.ckpt`. All other files checkpoint files
but the best will be removed (or an existing checkpoint renamed to be the best checkpoint).
Creates the best checkpoint file. "Best" is at the moment defined as being the last checkpoint, but could be
based on some defined policy.
The best checkpoint will be renamed to `best_checkpoint.ckpt`.
:param path: The folder that contains all checkpoint files.
"""
logging.debug(f"Files in checkpoint folder: {' '.join(p.name for p in path.glob('*'))}")
last_ckpt = path / LAST_CHECKPOINT_FILE_NAME_WITH_SUFFIX
all_files = f"Existing files: {' '.join(p.name for p in path.glob('*'))}"
if not last_ckpt.is_file():
Expand All @@ -124,21 +146,6 @@ def keep_best_checkpoint(path: Path) -> Path:
return best


def cleanup_checkpoint_folder(path: Path) -> None:
"""
Removes surplus files from the checkpoint folder, and unifies the names of the files that are kept:
1) Keep only the most recent recovery checkpoint file
2) Chooses the best checkpoint file according to keep_best_checkpoint, and rename it to
BEST_CHECKPOINT_FILE_NAME_WITH_SUFFIX
:param path: The folder containing all model checkpoints.
"""
logging.info(f"Files in checkpoint folder: {' '.join(p.name for p in path.glob('*'))}")
recovery = keep_latest(path, RECOVERY_CHECKPOINT_FILE_NAME + "*")
if recovery:
recovery.rename(path / RECOVERY_CHECKPOINT_FILE_NAME_WITH_SUFFIX)
keep_best_checkpoint(path)


def create_unique_timestamp_id() -> str:
"""
Creates a unique string using the current time in UTC, up to seconds precision, with characters that
Expand Down
1 change: 0 additions & 1 deletion InnerEye/ML/configs/segmentation/BasicModel2Epochs.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,6 @@ def __init__(self, **kwargs: Any) -> None:
class_weights=equally_weighted_classes(fg_classes),
num_dataload_workers=1,
train_batch_size=8,
start_epoch=0,
num_epochs=2,
recovery_checkpoint_save_interval=1,
use_mixed_precision=True,
Expand Down
1 change: 0 additions & 1 deletion InnerEye/ML/configs/segmentation/GbmBase.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ def __init__(self, **kwargs: Any) -> None:
tail=[1.0],
class_weights=equally_weighted_classes(fg_classes),
train_batch_size=8,
start_epoch=0,
num_epochs=200,
l_rate=1e-3,
l_rate_polynomial_gamma=0.9,
Expand Down
1 change: 0 additions & 1 deletion InnerEye/ML/configs/segmentation/HeadAndNeckBase.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,6 @@ def __init__(self,
norm_method=PhotometricNormalizationMethod.CtWindow,
level=50,
window=600,
start_epoch=0,
l_rate=1e-3,
min_l_rate=1e-5,
l_rate_polynomial_gamma=0.9,
Expand Down
1 change: 0 additions & 1 deletion InnerEye/ML/configs/segmentation/HelloWorld.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@ def __init__(self, **kwargs: Any) -> None:
# and testing (ie: how many epochs to test)
num_dataload_workers=0,
train_batch_size=2,
start_epoch=0,
num_epochs=2,
recovery_checkpoint_save_interval=1,
use_mixed_precision=True,
Expand Down
1 change: 0 additions & 1 deletion InnerEye/ML/configs/segmentation/Lung.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ def __init__(self, **kwargs: Any) -> None:
train_batch_size=8,
inference_batch_size=1,
inference_stride_size=(64, 256, 256),
start_epoch=0,
num_epochs=140,
l_rate=1e-3,
min_l_rate=1e-5,
Expand Down
1 change: 0 additions & 1 deletion InnerEye/ML/configs/segmentation/ProstateBase.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,6 @@ def __init__(self,
num_epochs=120,
opt_eps=1e-4,
optimizer_type=OptimizerType.Adam,
start_epoch=0,
test_crop_size=(128, 512, 512),
train_batch_size=2,
use_mixed_precision=True,
Expand Down
5 changes: 2 additions & 3 deletions InnerEye/ML/configs/unit_testing/passthrough_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@
# ------------------------------------------------------------------------------------------
import random
from typing import Any, List

import numpy as np
import pandas as pd
import torch
from torch.nn.parameter import Parameter

from InnerEye.Common.type_annotations import TupleInt3
from InnerEye.ML.config import equally_weighted_classes, ModelArchitectureConfig, SegmentationModelBase
from InnerEye.ML.config import ModelArchitectureConfig, SegmentationModelBase, equally_weighted_classes
from InnerEye.ML.configs.segmentation.Lung import AZURE_DATASET_ID
from InnerEye.ML.models.architectures.base_model import BaseSegmentationModel
from InnerEye.ML.models.parallel.model_parallel import get_device_from_parameters, move_to_device
from InnerEye.ML.utils.model_metadata_util import generate_random_colours_list
from InnerEye.ML.utils.split_dataset import DatasetSplits


RANDOM_COLOUR_GENERATOR = random.Random(0)
RECTANGLE_STROKE_THICKNESS = 3

Expand Down Expand Up @@ -48,7 +48,6 @@ def __init__(self, **kwargs: Any) -> None:
inference_batch_size=1,
class_weights=equally_weighted_classes(fg_classes, background_weight=0.02),
feature_channels=[1],
start_epoch=0,
num_epochs=1,
# Necessary to avoid https://github.com/pytorch/pytorch/issues/45324
max_num_gpus=1,
Expand Down
18 changes: 10 additions & 8 deletions InnerEye/ML/deep_learning_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,8 @@
from InnerEye.Common.fixed_paths import DEFAULT_AML_UPLOAD_DIR, DEFAULT_LOGS_DIR_NAME
from InnerEye.Common.generic_parsing import CudaAwareConfig, GenericConfig
from InnerEye.Common.type_annotations import PathOrString, TupleFloat2
from InnerEye.ML.common import DATASET_CSV_FILE_NAME, ModelExecutionMode, \
create_recovery_checkpoint_path, create_unique_timestamp_id, \
get_best_checkpoint_path
from InnerEye.ML.common import DATASET_CSV_FILE_NAME, ModelExecutionMode, create_unique_timestamp_id, \
get_best_checkpoint_path, get_recovery_checkpoint_path

# A folder inside of the outputs folder that will contain all information for running the model in inference mode
FINAL_MODEL_FOLDER = "final_model"
Expand Down Expand Up @@ -352,7 +351,7 @@ def get_path_to_checkpoint(self) -> Path:
"""
Returns the full path to a recovery checkpoint.
"""
return create_recovery_checkpoint_path(self.checkpoint_folder)
return get_recovery_checkpoint_path(self.checkpoint_folder)

def get_path_to_best_checkpoint(self) -> Path:
"""
Expand Down Expand Up @@ -435,6 +434,11 @@ class TrainerParams(CudaAwareConfig):
doc="Save epoch checkpoints when epoch number is a multiple "
"of recovery_checkpoint_save_interval. The intended use "
"is to allow restore training from failed runs.")
recovery_checkpoints_save_last_k: int = param.Integer(default=1, bounds=(-1, None),
doc="Number of recovery checkpoints to keep. Recovery "
"checkpoints will be stored as recovery_epoch:{"
"epoch}.ckpt. If set to -1 keep all recovery "
"checkpoints.")
detect_anomaly: bool = param.Boolean(False, doc="If true, test gradients for anomalies (NaN or Inf) during "
"training.")
use_mixed_precision: bool = param.Boolean(False, doc="If true, mixed precision training is activated during "
Expand All @@ -454,9 +458,6 @@ class TrainerParams(CudaAwareConfig):
doc="Controls the PyTorch Lightning trainer flags 'deterministic' and 'benchmark'. If "
"'pl_deterministic' is True, results are perfectly reproducible. If False, they are not, but "
"you may see training speed increases.")
start_epoch: int = param.Integer(0, bounds=(0, None), doc="The first epoch to train. Set to 0 to start a new "
"training. Set to a value larger than zero for starting"
" from a checkpoint.")


class DeepLearningConfig(WorkflowParams,
Expand Down Expand Up @@ -546,6 +547,7 @@ def __init__(self, **params: Any) -> None:
# This should be annotated as torch.utils.data.Dataset, but we don't want to import torch here.
self._datasets_for_training: Optional[Dict[ModelExecutionMode, Any]] = None
self._datasets_for_inference: Optional[Dict[ModelExecutionMode, Any]] = None
self.recovery_start_epoch = 0
super().__init__(throw_if_unknown_param=True, **params)
logging.info("Creating the default output folder structure.")
self.create_filesystem(fixed_paths.repository_root_directory())
Expand Down Expand Up @@ -609,7 +611,7 @@ def get_train_epochs(self) -> List[int]:
Returns the epochs for which training will be performed.
:return:
"""
return list(range(self.start_epoch + 1, self.num_epochs + 1))
return list(range(self.recovery_start_epoch + 1, self.num_epochs + 1))

def get_total_number_of_training_epochs(self) -> int:
"""
Expand Down
4 changes: 2 additions & 2 deletions InnerEye/ML/lightning_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@
from InnerEye.Common.type_annotations import DictStrFloat
from InnerEye.ML.common import ModelExecutionMode
from InnerEye.ML.config import SegmentationModelBase
from InnerEye.ML.deep_learning_config import DatasetParams, DeepLearningConfig, WorkflowParams, OutputParams, \
TrainerParams
from InnerEye.ML.deep_learning_config import DatasetParams, DeepLearningConfig, OutputParams, TrainerParams, \
WorkflowParams
from InnerEye.ML.lightning_container import LightningContainer
from InnerEye.ML.lightning_loggers import StoringLogger
from InnerEye.ML.metrics import EpochTimers, MAX_ITEM_LOAD_TIME_SEC, store_epoch_metrics
Expand Down
Loading

0 comments on commit adffa95

Please sign in to comment.