Checkpoint recovery refactoring (#439)

* Add auto-restart * Change handling of checkpoints and clean-up * Save last k recovery checkpoints * Log epoch for keeping last ckpt * Keeping k last checkpoints * Add possibility to recover from particular checkpoint * Update tests * Check k recovery * Re-add skipif * Correct pick up of recovery runs and add test * Correct pick up of recovery runs and add test * Remove all start epochs * Remove all start epochs * Spimplify run recovery logic * Fix it * Merge conflicts import errors * Fix it * Fix tests in test_scalar_model.py * Fix tests in test_model_util.py * Fix tests in test_scalar_model.py * Fix tests in test_model_training.py * Avoid forcing the user to log epoch * Fix test_get_checkpoints * Fix test_checkpoint_handling.py * Fix callback * Update CHANGELOG.md * Self PR review comments * Fix more tests * Fix argument in test * Mypy * Update InnerEye-DeepLearning.iml * Update InnerEye-DeepLearning.iml * Fix mypy errors * Address PR comment * Typo * mypy fix * just style
microsoft · Apr 21, 2021 · adffa95 · adffa95
1 parent f421234
commit adffa95
Show file tree

Hide file tree

Showing 24 changed files with 265 additions and 257 deletions.
diff --git a/.idea/InnerEye-DeepLearning.iml b/.idea/InnerEye-DeepLearning.iml
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,7 +16,6 @@ created.
 - ([#417](https://github.com/microsoft/InnerEye-DeepLearning/pull/417)) Added a generic way of adding PyTorch Lightning
 models to the toolbox. It is now possible to train almost any Lightning model with the InnerEye toolbox in AzureML,
 with only minimum code changes required. See [the MD documentation](docs/bring_your_own_model.md) for details.
-- ([#438](https://github.com/microsoft/InnerEye-DeepLearning/pull/438)) Add links and small docs to InnerEye-Gateway and InnerEye-Inference
 - ([#430](https://github.com/microsoft/InnerEye-DeepLearning/pull/430)) Update conversion to 1.0.1 InnerEye-DICOM-RT to
  add: manufacturer, SoftwareVersions, Interpreter and ROIInterpretedTypes.
 - ([#385](https://github.com/microsoft/InnerEye-DeepLearning/pull/385)) Add the ability to train a model on multiple
@@ -48,6 +47,10 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
 - ([#405](https://github.com/microsoft/InnerEye-DeepLearning/pull/405)) Cross-validation runs for classification models
  now also generate a report notebook summarising the metrics from the individual splits. Also includes minor formatting
  improvements for standard classification reports.
+- ([#438](https://github.com/microsoft/InnerEye-DeepLearning/pull/438)) Add links and small docs to InnerEye-Gateway and InnerEye-Inference
+- ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Enable automatic job recovery from last recovery
+ checkpoint in case of job pre-emption on AML. Give the possibility to the user to keep more than one recovery
+ checkpoint.
 
 ### Changed
 
@@ -62,8 +65,11 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
  end-to-end test for classification cross-validation. WARNING: upgrade PL version causes hanging of multi-node
  training.
 - ([#437])(https://github.com/microsoft/InnerEye-DeepLearning/pull/437)) Upgrade to PyTorch-Lightning 1.2.8.
+- ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Recovery checkpoints are now
+ named `recovery_epoch=x.ckpt` instead of `recovery.ckpt` or `recovery-v0.ckpt`.
 
 ### Fixed
+
 - ([#422](https://github.com/microsoft/InnerEye-DeepLearning/pull/422)) Documentation - clarified `setting_up_aml.md`
  datastore creation instructions and fixed small typos in `hello_world_model.md`
 - ([#432](https://github.com/microsoft/InnerEye-DeepLearning/pull/432)) Fixed cross-validation for classification
@@ -73,7 +79,9 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
  set, display an error message and terminate the run.
 - ([#437](https://github.com/microsoft/InnerEye-DeepLearning/pull/437)) Fixed multi-node DDP bug in PL v1.2.8. Re-add
  end-to-end test for multi-node.
+
 ### Removed
+- ([#439](https://github.com/microsoft/InnerEye-DeepLearning/pull/439)) Deprecated `start_epoch` config argument.
 
 ### Deprecated
 

diff --git a/InnerEye/ML/common.py b/InnerEye/ML/common.py
@@ -4,10 +4,13 @@
 # ------------------------------------------------------------------------------------------
 import abc
 import logging
+import re
 from datetime import datetime
 from enum import Enum, unique
 from pathlib import Path
-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Tuple
+
+import numpy as np
 
 DATASET_CSV_FILE_NAME = "dataset.csv"
 CHECKPOINT_SUFFIX = ".ckpt"
@@ -61,18 +64,16 @@ def get_feature_length(self, column: str) -> int:
  raise NotImplementedError("get_feature_length must be implemented by sub classes")
 
 
-def create_recovery_checkpoint_path(path: Path) -> Path:
+def get_recovery_checkpoint_path(path: Path) -> Path:
  """
- Returns the file name of a recovery checkpoint in the given folder. Raises a FileNotFoundError if no
+ Returns the path to the last recovery checkpoint in the given folder or the provided filename. Raises a
+ FileNotFoundError if no
  recovery checkpoint file is present.
  :param path: Path to checkpoint folder
  """
- # Recovery checkpoints are written alternately as recovery.ckpt and recovery-v0.ckpt.
- best_checkpoint1 = path / f"{RECOVERY_CHECKPOINT_FILE_NAME_WITH_SUFFIX}"
- best_checkpoint2 = path / f"{RECOVERY_CHECKPOINT_FILE_NAME}-v0{CHECKPOINT_SUFFIX}"
- for p in [best_checkpoint1, best_checkpoint2]:
- if p.is_file():
- return p
+ recovery_ckpt_and_epoch = find_recovery_checkpoint_and_epoch(path)
+ if recovery_ckpt_and_epoch is not None:
+ return recovery_ckpt_and_epoch[0]
  files = list(path.glob("*"))
  raise FileNotFoundError(f"No checkpoint files found in {path}. Existing files: {' '.join(p.name for p in files)}")
 
@@ -85,34 +86,55 @@ def get_best_checkpoint_path(path: Path) -> Path:
  return path / BEST_CHECKPOINT_FILE_NAME_WITH_SUFFIX
 
 
-def keep_latest(path: Path, search_pattern: str) -> Optional[Path]:
+def find_all_recovery_checkpoints(path: Path) -> Optional[List[Path]]:
+ """
+ Extracts all file starting with RECOVERY_CHECKPOINT_FILE_NAME in path
+ :param path:
+ :return:
+ """
+ all_recovery_files = [f for f in path.glob(RECOVERY_CHECKPOINT_FILE_NAME + "*")]
+ if len(all_recovery_files) == 0:
+ return None
+ return all_recovery_files
+
+
+PathAndEpoch = Tuple[Path, int]
+
+
+def extract_latest_checkpoint_and_epoch(available_files: List[Path]) -> PathAndEpoch:
+ """
+ Checkpoints are saved as recovery_epoch={epoch}.ckpt, find the latest ckpt and epoch number.
+ :param available_files: all available checkpoints
+ :return: path the checkpoint from latest epoch and epoch number
+ """
+ recovery_epochs = [int(re.findall(r"[\d]+", f.stem)[0]) for f in available_files]
+ idx_max_epoch = int(np.argmax(recovery_epochs))
+ return available_files[idx_max_epoch], recovery_epochs[idx_max_epoch]
+
+
+def find_recovery_checkpoint_and_epoch(path: Path) -> Optional[PathAndEpoch]:
  """
- Looks at all files that match the given pattern via "glob", and deletes all of them apart from the most most
- recent file. The surviving file is returned. If there is no single file that matches the search pattern, then
- return None.
+ Looks at all the recovery files, extracts the epoch number for all of them and returns the most recent (latest
+ epoch)
+ checkpoint path along with the corresponding epoch number. If no recovery checkpoint are found, return None.
  :param path: The folder to start searching in.
- :param search_pattern: The glob pattern that specifies the files that should be searched.
- :return: None if there is no file matching the search pattern, or a Path object that has the latest file matching
- the pattern.
- """
- files_and_mod_time = [(f, f.stat().st_mtime) for f in path.glob(search_pattern)]
- files_and_mod_time.sort(key=lambda f: f[1], reverse=True)
- for (f, _) in files_and_mod_time[1:]:
- logging.info(f"Removing file: {f}")
- f.unlink()
- if files_and_mod_time:
- return files_and_mod_time[0][0]
+ :return: None if there is no file matching the search pattern, or a Tuple with Path object and integer pointing to
+ recovery checkpoint path and recovery epoch.
+ """
+ available_checkpoints = find_all_recovery_checkpoints(path)
+ if available_checkpoints is not None:
+ return extract_latest_checkpoint_and_epoch(available_checkpoints)
  return None
 
 
-def keep_best_checkpoint(path: Path) -> Path:
+def create_best_checkpoint(path: Path) -> Path:
  """
- Clean up all checkpoints that are found in the given folder, and keep only the "best" one. "Best" is at the moment
- defined as being the last checkpoint, but could be based on some defined policy. The best checkpoint will be
- renamed to `best_checkpoint.ckpt`. All other files checkpoint files
- but the best will be removed (or an existing checkpoint renamed to be the best checkpoint).
+ Creates the best checkpoint file. "Best" is at the moment defined as being the last checkpoint, but could be
+ based on some defined policy.
+ The best checkpoint will be renamed to `best_checkpoint.ckpt`.
  :param path: The folder that contains all checkpoint files.
  """
+ logging.debug(f"Files in checkpoint folder: {' '.join(p.name for p in path.glob('*'))}")
  last_ckpt = path / LAST_CHECKPOINT_FILE_NAME_WITH_SUFFIX
  all_files = f"Existing files: {' '.join(p.name for p in path.glob('*'))}"
  if not last_ckpt.is_file():
@@ -124,21 +146,6 @@ def keep_best_checkpoint(path: Path) -> Path:
  return best
 
 
-def cleanup_checkpoint_folder(path: Path) -> None:
- """
- Removes surplus files from the checkpoint folder, and unifies the names of the files that are kept:
- 1) Keep only the most recent recovery checkpoint file
- 2) Chooses the best checkpoint file according to keep_best_checkpoint, and rename it to
- BEST_CHECKPOINT_FILE_NAME_WITH_SUFFIX
- :param path: The folder containing all model checkpoints.
- """
- logging.info(f"Files in checkpoint folder: {' '.join(p.name for p in path.glob('*'))}")
- recovery = keep_latest(path, RECOVERY_CHECKPOINT_FILE_NAME + "*")
- if recovery:
- recovery.rename(path / RECOVERY_CHECKPOINT_FILE_NAME_WITH_SUFFIX)
- keep_best_checkpoint(path)
-
-
 def create_unique_timestamp_id() -> str:
  """
  Creates a unique string using the current time in UTC, up to seconds precision, with characters that

diff --git a/InnerEye/ML/configs/segmentation/BasicModel2Epochs.py b/InnerEye/ML/configs/segmentation/BasicModel2Epochs.py
@@ -35,7 +35,6 @@ def __init__(self, **kwargs: Any) -> None:
  class_weights=equally_weighted_classes(fg_classes),
  num_dataload_workers=1,
  train_batch_size=8,
- start_epoch=0,
  num_epochs=2,
  recovery_checkpoint_save_interval=1,
  use_mixed_precision=True,

diff --git a/InnerEye/ML/configs/segmentation/GbmBase.py b/InnerEye/ML/configs/segmentation/GbmBase.py
@@ -42,7 +42,6 @@ def __init__(self, **kwargs: Any) -> None:
  tail=[1.0],
  class_weights=equally_weighted_classes(fg_classes),
  train_batch_size=8,
- start_epoch=0,
  num_epochs=200,
  l_rate=1e-3,
  l_rate_polynomial_gamma=0.9,

diff --git a/InnerEye/ML/configs/segmentation/HeadAndNeckBase.py b/InnerEye/ML/configs/segmentation/HeadAndNeckBase.py
@@ -99,7 +99,6 @@ def __init__(self,
  norm_method=PhotometricNormalizationMethod.CtWindow,
  level=50,
  window=600,
- start_epoch=0,
  l_rate=1e-3,
  min_l_rate=1e-5,
  l_rate_polynomial_gamma=0.9,

diff --git a/InnerEye/ML/configs/segmentation/HelloWorld.py b/InnerEye/ML/configs/segmentation/HelloWorld.py
@@ -58,7 +58,6 @@ def __init__(self, **kwargs: Any) -> None:
  # and testing (ie: how many epochs to test)
  num_dataload_workers=0,
  train_batch_size=2,
- start_epoch=0,
  num_epochs=2,
  recovery_checkpoint_save_interval=1,
  use_mixed_precision=True,

diff --git a/InnerEye/ML/configs/segmentation/Lung.py b/InnerEye/ML/configs/segmentation/Lung.py
@@ -47,7 +47,6 @@ def __init__(self, **kwargs: Any) -> None:
  train_batch_size=8,
  inference_batch_size=1,
  inference_stride_size=(64, 256, 256),
- start_epoch=0,
  num_epochs=140,
  l_rate=1e-3,
  min_l_rate=1e-5,

diff --git a/InnerEye/ML/configs/segmentation/ProstateBase.py b/InnerEye/ML/configs/segmentation/ProstateBase.py
@@ -76,7 +76,6 @@ def __init__(self,
  num_epochs=120,
  opt_eps=1e-4,
  optimizer_type=OptimizerType.Adam,
- start_epoch=0,
  test_crop_size=(128, 512, 512),
  train_batch_size=2,
  use_mixed_precision=True,

diff --git a/InnerEye/ML/configs/unit_testing/passthrough_model.py b/InnerEye/ML/configs/unit_testing/passthrough_model.py
@@ -4,20 +4,20 @@
 # ------------------------------------------------------------------------------------------
 import random
 from typing import Any, List
+
 import numpy as np
 import pandas as pd
 import torch
 from torch.nn.parameter import Parameter
 
 from InnerEye.Common.type_annotations import TupleInt3
-from InnerEye.ML.config import equally_weighted_classes, ModelArchitectureConfig, SegmentationModelBase
+from InnerEye.ML.config import ModelArchitectureConfig, SegmentationModelBase, equally_weighted_classes
 from InnerEye.ML.configs.segmentation.Lung import AZURE_DATASET_ID
 from InnerEye.ML.models.architectures.base_model import BaseSegmentationModel
 from InnerEye.ML.models.parallel.model_parallel import get_device_from_parameters, move_to_device
 from InnerEye.ML.utils.model_metadata_util import generate_random_colours_list
 from InnerEye.ML.utils.split_dataset import DatasetSplits
 
-
 RANDOM_COLOUR_GENERATOR = random.Random(0)
 RECTANGLE_STROKE_THICKNESS = 3
 
@@ -48,7 +48,6 @@ def __init__(self, **kwargs: Any) -> None:
  inference_batch_size=1,
  class_weights=equally_weighted_classes(fg_classes, background_weight=0.02),
  feature_channels=[1],
- start_epoch=0,
  num_epochs=1,
  # Necessary to avoid https://github.com/pytorch/pytorch/issues/45324
  max_num_gpus=1,

diff --git a/InnerEye/ML/deep_learning_config.py b/InnerEye/ML/deep_learning_config.py
@@ -19,9 +19,8 @@
 from InnerEye.Common.fixed_paths import DEFAULT_AML_UPLOAD_DIR, DEFAULT_LOGS_DIR_NAME
 from InnerEye.Common.generic_parsing import CudaAwareConfig, GenericConfig
 from InnerEye.Common.type_annotations import PathOrString, TupleFloat2
-from InnerEye.ML.common import DATASET_CSV_FILE_NAME, ModelExecutionMode, \
- create_recovery_checkpoint_path, create_unique_timestamp_id, \
- get_best_checkpoint_path
+from InnerEye.ML.common import DATASET_CSV_FILE_NAME, ModelExecutionMode, create_unique_timestamp_id, \
+ get_best_checkpoint_path, get_recovery_checkpoint_path
 
 # A folder inside of the outputs folder that will contain all information for running the model in inference mode
 FINAL_MODEL_FOLDER = "final_model"
@@ -352,7 +351,7 @@ def get_path_to_checkpoint(self) -> Path:
  """
  Returns the full path to a recovery checkpoint.
  """
- return create_recovery_checkpoint_path(self.checkpoint_folder)
+ return get_recovery_checkpoint_path(self.checkpoint_folder)
 
  def get_path_to_best_checkpoint(self) -> Path:
  """
@@ -435,6 +434,11 @@ class TrainerParams(CudaAwareConfig):
  doc="Save epoch checkpoints when epoch number is a multiple "
  "of recovery_checkpoint_save_interval. The intended use "
  "is to allow restore training from failed runs.")
+ recovery_checkpoints_save_last_k: int = param.Integer(default=1, bounds=(-1, None),
+ doc="Number of recovery checkpoints to keep. Recovery "
+ "checkpoints will be stored as recovery_epoch:{"
+ "epoch}.ckpt. If set to -1 keep all recovery "
+ "checkpoints.")
  detect_anomaly: bool = param.Boolean(False, doc="If true, test gradients for anomalies (NaN or Inf) during "
  "training.")
  use_mixed_precision: bool = param.Boolean(False, doc="If true, mixed precision training is activated during "
@@ -454,9 +458,6 @@ class TrainerParams(CudaAwareConfig):
  doc="Controls the PyTorch Lightning trainer flags 'deterministic' and 'benchmark'. If "
  "'pl_deterministic' is True, results are perfectly reproducible. If False, they are not, but "
  "you may see training speed increases.")
- start_epoch: int = param.Integer(0, bounds=(0, None), doc="The first epoch to train. Set to 0 to start a new "
- "training. Set to a value larger than zero for starting"
- " from a checkpoint.")
 
 
 class DeepLearningConfig(WorkflowParams,
@@ -546,6 +547,7 @@ def __init__(self, **params: Any) -> None:
  # This should be annotated as torch.utils.data.Dataset, but we don't want to import torch here.
  self._datasets_for_training: Optional[Dict[ModelExecutionMode, Any]] = None
  self._datasets_for_inference: Optional[Dict[ModelExecutionMode, Any]] = None
+ self.recovery_start_epoch = 0
  super().__init__(throw_if_unknown_param=True, **params)
  logging.info("Creating the default output folder structure.")
  self.create_filesystem(fixed_paths.repository_root_directory())
@@ -609,7 +611,7 @@ def get_train_epochs(self) -> List[int]:
  Returns the epochs for which training will be performed.
  :return:
  """
- return list(range(self.start_epoch + 1, self.num_epochs + 1))
+ return list(range(self.recovery_start_epoch + 1, self.num_epochs + 1))
 
  def get_total_number_of_training_epochs(self) -> int:
  """

diff --git a/InnerEye/ML/lightning_base.py b/InnerEye/ML/lightning_base.py
@@ -20,8 +20,8 @@
 from InnerEye.Common.type_annotations import DictStrFloat
 from InnerEye.ML.common import ModelExecutionMode
 from InnerEye.ML.config import SegmentationModelBase
-from InnerEye.ML.deep_learning_config import DatasetParams, DeepLearningConfig, WorkflowParams, OutputParams, \
- TrainerParams
+from InnerEye.ML.deep_learning_config import DatasetParams, DeepLearningConfig, OutputParams, TrainerParams, \
+ WorkflowParams
 from InnerEye.ML.lightning_container import LightningContainer
 from InnerEye.ML.lightning_loggers import StoringLogger
 from InnerEye.ML.metrics import EpochTimers, MAX_ITEM_LOAD_TIME_SEC, store_epoch_metrics