Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Enable building an ensemble model from the cross validation checkpoints of a BYO Lightning model #529

Closed
wants to merge 101 commits into from
Closed
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
1250de0
Use registered model for inference
Shruthi42 Jun 23, 2021
efc34fb
Merge branch 'main' into shbannur/load_registered_models
Shruthi42 Jun 23, 2021
f94b2e9
Bug fix
Shruthi42 Jun 23, 2021
7430785
Fix tests
Shruthi42 Jun 23, 2021
229b7ea
Fix tests
Shruthi42 Jun 23, 2021
16a09c3
mypy
Shruthi42 Jun 23, 2021
ee957d8
Fix tests
Shruthi42 Jun 23, 2021
596efee
Add tests
Shruthi42 Jun 24, 2021
2c2160f
Fix tests
Shruthi42 Jun 24, 2021
fe6fa93
Fix tests
Shruthi42 Jun 24, 2021
45895e7
Add test
Shruthi42 Jun 24, 2021
41f1b48
Fix tests
Shruthi42 Jun 24, 2021
f6bb7a2
Fix test
Shruthi42 Jul 5, 2021
3e3b069
Remove unnecessary function
Shruthi42 Jul 5, 2021
c0de1e6
Update tests
Shruthi42 Jul 5, 2021
0889a88
Flake8
Shruthi42 Jul 5, 2021
f98b22e
Fix tests
Shruthi42 Jul 5, 2021
d73f769
mypy
Shruthi42 Jul 5, 2021
f4dfbe7
Merge branch 'main' into shbannur/load_registered_models
Shruthi42 Jul 5, 2021
2767e18
Loosening multiple checkpoint check in run_inference_for_lightning_mo…
Jul 6, 2021
8a63e0a
WiP very scrappy!
Jul 7, 2021
116e566
WiP more mess
Jul 8, 2021
7107e27
WiP: bones of test class
Jul 8, 2021
d0c7724
Refactoring run_inference_for_lightning_models
Jul 9, 2021
6735d2c
mypy fixes
Jul 9, 2021
585511b
Merge branch 'main' into timregan/527-ensembles-for-BYOL-xval
Jul 9, 2021
4df8c09
WiP annotations and test
Jul 10, 2021
05a67dc
WiP: saving mid task for lunch
dumbledad Jul 10, 2021
8d93907
Correcting GPU -> CPU typo in comment
dumbledad Jul 11, 2021
e7ba7f1
Method can be static
dumbledad Jul 11, 2021
e4ebfcb
WiP fiddling
dumbledad Jul 11, 2021
7625a48
Example ensemble from InnerEyeInference
dumbledad Jul 11, 2021
f5b288c
flake8 and mypy fixes
dumbledad Jul 11, 2021
a79673f
mypy fixes
dumbledad Jul 11, 2021
3e9ec7e
tidying unused parameters
Jul 11, 2021
559a717
WiP simple temp test for train/test ensemble
Jul 11, 2021
f7be221
Renaming InnerEyeInference methods
Jul 12, 2021
a3179a8
tidy up
Jul 12, 2021
f7bdc7f
Matching new naming
Jul 12, 2021
939ec5a
renaming params
Jul 12, 2021
f206057
naming
Jul 12, 2021
d71f498
Unit test WiP
Jul 12, 2021
6987f99
don't be strict with state_dict
dumbledad Jul 12, 2021
f6e025f
first unit test takes shape
dumbledad Jul 12, 2021
500bf11
Unit test works, but doesn't check much
dumbledad Jul 12, 2021
0d97fe9
renaming unit test
Jul 13, 2021
b204f08
Merge branch 'main' into shbannur/load_registered_models
Shruthi42 Jul 13, 2021
dd17d78
Change docstring
Shruthi42 Jul 13, 2021
388e0a8
Update CHANGELOG.md
Shruthi42 Jul 13, 2021
d727fe1
Rename
Shruthi42 Jul 13, 2021
fa7a6e4
Fix test
Shruthi42 Jul 13, 2021
581c6a9
WiP ensemble unit test with value check
Jul 13, 2021
1134c0f
Merge branch 'main' into timregan/527-ensembles-for-BYOL-xval
Jul 14, 2021
e35db5b
Address PR comments
Shruthi42 Jul 14, 2021
9093b7e
Use list of pytest markers
Shruthi42 Jul 14, 2021
bf072c0
Move model_id to WorkflowParams
Shruthi42 Jul 14, 2021
5a76cc1
missed some name changes
Jul 14, 2021
2d75d24
WiP swapping back to checkpoints not accruing child runs
Jul 14, 2021
e064483
Refactor extra_downloaded_run_id
Shruthi42 Jul 14, 2021
168eb29
unit test working
Jul 14, 2021
838bb48
Update documentation and argparser
Shruthi42 Jul 14, 2021
0f3690c
flake & mypy
Jul 14, 2021
9c7f6b4
Revert changes to generic_parsing
Shruthi42 Jul 14, 2021
d612c6e
Update documentation
Shruthi42 Jul 14, 2021
6861178
Flake8 and mypy
Shruthi42 Jul 14, 2021
98c7683
Shruthi's changes to run_ml
Jul 14, 2021
54c845e
Merge branch 'shbannur/load_registered_models' into timregan/527-ense…
Jul 14, 2021
48aca37
WiP abstracting ensemble inference
Jul 14, 2021
d3d8477
WiP
dumbledad Jul 15, 2021
a4ea25a
Ensemble inference base
dumbledad Jul 15, 2021
5e7f0b1
Merge branch 'main' into timregan/527-ensembles-for-BYOL-xval
dumbledad Jul 15, 2021
a71fa90
Ended up with changes to 2 files I did not touch!
dumbledad Jul 15, 2021
0874a74
Restoring (and fixing) run_ml changes
dumbledad Jul 16, 2021
88bb96d
mypy
dumbledad Jul 16, 2021
424b2e7
WiP
dumbledad Jul 16, 2021
4903b96
run_ml unit test v1
Jul 16, 2021
547ed7d
WiP pre pruning ensemble stuff
Jul 16, 2021
538a7a4
refactored to avoid recursion blow-up
Jul 16, 2021
002886c
Merge branch 'main' into timregan/527-ensembles-for-BYOL-xval
dumbledad Jul 17, 2021
c6ca034
additional comments and remove inheritance
dumbledad Jul 17, 2021
484fe4a
removing duplicated unit test
dumbledad Jul 17, 2021
b771137
more comments
dumbledad Jul 17, 2021
6d57d08
test tidy
dumbledad Jul 17, 2021
6a9fd75
flake fixes
dumbledad Jul 17, 2021
cea51e4
WiP
dumbledad Jul 17, 2021
9fc696a
on_ensemble_inference_start needn't call down
dumbledad Jul 17, 2021
b947452
Old WiP changes
dumbledad Jul 17, 2021
f7d22f4
Adding HelloEnsembleInference
dumbledad Jul 17, 2021
af079ae
run_ml changes with parameter
dumbledad Jul 17, 2021
38b12b7
Wi{ on mypy and tidy pre unit test fix
dumbledad Jul 18, 2021
c255774
mypy fixes
dumbledad Jul 18, 2021
2db1857
import fix so test discovery works
dumbledad Jul 18, 2021
21044e6
Fixing test discovery
dumbledad Jul 18, 2021
4f10169
WiP fixing unit test
dumbledad Jul 18, 2021
e0e31e6
Unit test works
dumbledad Jul 18, 2021
1d23735
file system test
dumbledad Jul 18, 2021
1486bff
Adding register and actually building ensemble
Jul 19, 2021
ff8fef1
Removed call to innnereye_config
Jul 19, 2021
ddb3c9f
unit test fix
dumbledad Jul 22, 2021
e71d49b
fix for the reference error on AzureML
dumbledad Jul 23, 2021
db18fdd
Merge branch 'main' into timregan/527-ensembles-for-BYOL-xval
dumbledad Jul 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ module on test data with partial ground truth files. (Also [522](https://github.
- ([#502](https://github.com/microsoft/InnerEye-DeepLearning/pull/502)) More flags for fine control of when to run inference.
- ([#492](https://github.com/microsoft/InnerEye-DeepLearning/pull/492)) Adding capability for regression tests for test
jobs that run in AzureML.

- ([#509](https://github.com/microsoft/InnerEye-DeepLearning/pull/509)) Run inference on registered models (single and
ensemble) using the parameter `model_id`.
### Changed
- ([#531])(https://github.com/microsoft/InnerEye-DeepLearning/pull/531)) Updated PL to 1.3.8, torchmetrics and pl-bolts and changed relevant metrics and SSL code API.
- ([#533](https://github.com/microsoft/InnerEye-DeepLearning/pull/533)) Better defaults for inference on ensemble children.
Expand All @@ -42,6 +43,8 @@ multiple large checkpoints can time out.
### Removed

- ([#520](https://github.com/microsoft/InnerEye-DeepLearning/pull/520)) Disable glaucoma job from Azure pipeline.
- ([#509](https://github.com/microsoft/InnerEye-DeepLearning/pull/509)) Parameters `local_weights_path` and
`weights_url` can no longer be used to initialize a training run, only inference runs.

### Deprecated

Expand Down
11 changes: 3 additions & 8 deletions InnerEye/Azure/azure_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,14 +77,9 @@ class AzureConfig(GenericConfig):
pytest_mark: str = param.String(doc="If provided, run pytest instead of model training. pytest will only "
"run the tests that have the mark given in this argument "
"('--pytest_mark gpu' will run all tests marked with 'pytest.mark.gpu')")
run_recovery_id: str = param.String(doc="A run recovery id string in the form 'experiment name:run id'"
" to use for inference or recovering a model training run.")
pretraining_run_recovery_id: str = param.String(default=None,
allow_None=True,
doc="Extra run recovery id to download checkpoints from,"
"for custom modules (e.g. for loading pretrained weights)."
"Warning: this argument will be ignored for InnerEyeContainer"
"models.")
run_recovery_id: str = param.String(doc="A run recovery id string in the form 'experiment name:run id' "
"to use for inference, recovering a model training run or to register "
"a model.")
experiment_name: str = param.String(doc="If provided, use this string as the name of the AzureML experiment. "
"If not provided, create the experiment off the git branch name.")
build_number: int = param.Integer(0, doc="The numeric ID of the Azure pipeline that triggered this training run.")
Expand Down
3 changes: 0 additions & 3 deletions InnerEye/Common/fixed_paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,6 @@ def repository_root_directory(path: Optional[PathOrString] = None) -> Path:
# The folder at the project root directory that holds datasets for local execution.
DATASETS_DIR_NAME = "datasets"

# Points to a folder at the project root directory that holds model weights downloaded from URLs.
MODEL_WEIGHTS_DIR_NAME = "modelweights"

ML_RELATIVE_SOURCE_PATH = os.path.join("ML")
ML_RELATIVE_RUNNER_PATH = os.path.join(ML_RELATIVE_SOURCE_PATH, "runner.py")
ML_FULL_SOURCE_FOLDER_PATH = str(repository_root_directory() / ML_RELATIVE_SOURCE_PATH)
Expand Down
12 changes: 6 additions & 6 deletions InnerEye/ML/configs/classification/CovidHierarchicalModel.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,21 +170,21 @@ def create_model(self) -> LightningModule:

def _get_ssl_checkpoint_path(self) -> Path:
# Get the SSL weights from the AML run provided via "pretraining_run_recovery_id" command line argument.
# Accessible via extra_downloaded_run_id field of the config.
assert self.extra_downloaded_run_id is not None
assert isinstance(self.extra_downloaded_run_id, RunRecovery)
# Accessible via pretraining_run_checkpoints field of the config.
assert self.pretraining_run_checkpoints is not None
assert isinstance(self.pretraining_run_checkpoints, RunRecovery)
ssl_path = self.checkpoint_folder / "ssl_checkpoint.ckpt"

if not ssl_path.exists(): # for test (when it is already present) we don't need to redo this.
if self.name_of_checkpoint is not None:
logging.info(f"Using checkpoint: {self.name_of_checkpoint} as starting point.")
path_to_checkpoint = self.extra_downloaded_run_id.checkpoints_roots[0] / self.name_of_checkpoint
path_to_checkpoint = self.pretraining_run_checkpoints.checkpoints_roots[0] / self.name_of_checkpoint
else:
path_to_checkpoint = self.extra_downloaded_run_id.get_best_checkpoint_paths()[0]
path_to_checkpoint = self.pretraining_run_checkpoints.get_best_checkpoint_paths()[0]
if not path_to_checkpoint.exists():
logging.info("No best checkpoint found for this model. Getting the latest recovery "
"checkpoint instead.")
path_to_checkpoint = self.extra_downloaded_run_id.get_recovery_checkpoint_paths()[0]
path_to_checkpoint = self.pretraining_run_checkpoints.get_recovery_checkpoint_paths()[0]
assert path_to_checkpoint.exists()
path_to_checkpoint.rename(ssl_path)
return ssl_path
Expand Down
2 changes: 1 addition & 1 deletion InnerEye/ML/configs/other/HelloContainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ def on_test_epoch_end(self) -> None:
"""
average_mse = torch.mean(torch.stack(self.test_mse))
Path("test_mse.txt").write_text(str(average_mse.item()))
Path("test_mae.txt").write_text(str(self.test_mae.compute()))
Path("test_mae.txt").write_text(str(self.test_mae.compute().item()))


class HelloContainer(LightningContainer):
Expand Down
35 changes: 24 additions & 11 deletions InnerEye/ML/deep_learning_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@
EXTRA_RUN_SUBFOLDER = "extra_run_id"

ARGS_TXT = "args.txt"
WEIGHTS_FILE = "weights.pth"


@unique
Expand Down Expand Up @@ -216,16 +215,25 @@ class WorkflowParams(param.Parameterized):
ensemble_inference_on_test_set: Optional[bool] = \
param.Boolean(None,
doc="If set, enable/disable full image inference on test set after ensemble training.")
weights_url: str = param.String(doc="If provided, a url from which weights will be downloaded and used for model "
"initialization.")
local_weights_path: Optional[Path] = param.ClassSelector(class_=Path,
default=None,
allow_None=True,
doc="The path to the weights to use for model "
"initialization, when training outside AzureML.")
weights_url: List[str] = param.List(default=[], class_=str,
doc="If provided, a set of urls from which checkpoints will be downloaded"
"and used for training/inference.")
local_weights_path: List[Path] = param.List(default=[], class_=Path,
doc="A list of checkpoints paths to use for training/inference, "
"when training is running outside Azure.")
model_id: str = param.String(default="",
doc="A model id string in the form 'model name:version' "
"to use a registered model for inference.")
generate_report: bool = param.Boolean(default=True,
doc="If True (default), write a modelling report in HTML format. If False,"
"do not write that report.")
pretraining_run_recovery_id: str = param.String(default=None,
allow_None=True,
doc="Extra run recovery id to download checkpoints from,"
"for custom modules (e.g. for loading pretrained weights)."
"The downloaded RunRecovery object will be available in"
"pretraining_run_checkpoints.")

# The default multiprocessing start_method in both PyTorch and the Python standard library is "fork" for Linux and
# "spawn" (the only available method) for Windows. There is some evidence that using "forkserver" on Linux
# can reduce the chance of stuck jobs.
Expand All @@ -248,8 +256,13 @@ class WorkflowParams(param.Parameterized):
"be relative to the repository root directory.")

def validate(self) -> None:
if self.weights_url and self.local_weights_path:
raise ValueError("Cannot specify both local_weights_path and weights_url.")
if sum([bool(param) for param in [self.weights_url, self.local_weights_path, self.model_id]]) > 1:
raise ValueError("Cannot specify more than one of local_weights_path, weights_url or model_id.")

if self.model_id:
if len(self.model_id.split(":")) != 2:
raise ValueError(
f"model_id should be in the form 'model_name:version', got {self.model_id}")

if self.number_of_cross_validation_splits == 1:
raise ValueError("At least two splits required to perform cross validation, but got "
Expand Down Expand Up @@ -713,7 +726,7 @@ def __init__(self, **params: Any) -> None:
self.create_filesystem(fixed_paths.repository_root_directory())
# Disable the PL progress bar because all InnerEye models have their own console output
self.pl_progress_bar_refresh_rate = 0
self.extra_downloaded_run_id: Optional[Any] = None
self.pretraining_run_checkpoints: Optional[Any] = None

def validate(self) -> None:
"""
Expand Down
45 changes: 12 additions & 33 deletions InnerEye/ML/lightning_container.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
# Licensed under the MIT License (MIT). See LICENSE in the repo root for license information.
# ------------------------------------------------------------------------------------------
import abc
from pathlib import Path
from typing import Any, Dict, Iterator, List, Optional, Tuple
from pathlib import Path

import param
import torch
Expand All @@ -19,7 +19,7 @@
from InnerEye.Common.metrics_constants import TrackedMetrics
from InnerEye.ML.common import ModelExecutionMode
from InnerEye.ML.deep_learning_config import DatasetParams, OptimizerParams, OutputParams, TrainerParams, \
WorkflowParams, load_checkpoint
WorkflowParams
from InnerEye.ML.utils import model_util
from InnerEye.ML.utils.lr_scheduler import SchedulerWithWarmUp
from InnerEye.ML.utils.run_recovery import RunRecovery
Expand Down Expand Up @@ -151,7 +151,7 @@ def __init__(self, **kwargs: Any) -> None:
super().__init__(**kwargs)
self._model: Optional[LightningModule] = None
self._model_name = type(self).__name__
self.extra_downloaded_run_id: Optional[RunRecovery] = None
self.pretraining_run_checkpoints: Optional[RunRecovery] = None
self.num_nodes = 1

def validate(self) -> None:
Expand Down Expand Up @@ -250,36 +250,6 @@ def before_training_on_all_ranks(self) -> None:
"""
pass

def load_checkpoint_and_modify(self, path_to_checkpoint: Path) -> Dict[str, Any]:
"""
This method is called when a file with weights for network initialization is supplied at container level,
in the self.weights_url or self.local_weights_path fields. It can load that file as a Torch checkpoint,
and rename parameters.

By default, uses torch.load to read and return the state dict from the checkpoint file, and does no modification
of the checkpoint file.

Overloading this function:
When weights_url or local_weights_path is set, the file downloaded may not be in the exact
format expected by the model's load_state_dict() - for example, pretrained Imagenet weights for networks
may have mismatched layer names in different implementations.
In such cases, you can overload this function to extract the state dict from the checkpoint.

NOTE: The model checkpoint will be loaded using the torch function load_state_dict() with argument strict=False,
so extra care needs to be taken to check that the state dict is valid.
Check the logs for warnings related to missing and unexpected keys.
See https://pytorch.org/tutorials/beginner/saving_loading_models.html#warmstarting-model-using-parameters
-from-a-different-model
for an explanation on why strict=False is useful when loading parameters from other models.
:param path_to_checkpoint: Path to the checkpoint file.
:return: Dictionary with model and optimizer state dicts. The dict should have at least the following keys:
1. Key ModelAndInfo.MODEL_STATE_DICT_KEY and value set to the model state dict.
2. Key ModelAndInfo.EPOCH_KEY and value set to the checkpoint epoch.
Other (optional) entries corresponding to keys ModelAndInfo.OPTIMIZER_STATE_DICT_KEY and
ModelAndInfo.MEAN_TEACHER_STATE_DICT_KEY are also supported.
"""
return load_checkpoint(path_to_checkpoint=path_to_checkpoint, use_gpu=self.use_gpu)

# The code from here on does not need to be modified.

@property
Expand Down Expand Up @@ -334,6 +304,15 @@ def get_hyperdrive_config(self, run_config: ScriptRunConfig) -> HyperDriveConfig
else:
return self.get_parameter_search_hyperdrive_config(run_config)

def load_model_checkpoint(self, checkpoint_path: Path) -> None:
"""
Load a checkpoint from the given path. We need to define a separate method since pytorch lightning cannot
access the _model attribute to modify it.
"""
if self._model is None:
raise ValueError("No Lightning module has been set yet.")
self._model = type(self._model).load_from_checkpoint(checkpoint_path=str(checkpoint_path))

def __str__(self) -> str:
"""Returns a string describing the present object, as a list of key: value strings."""
arguments_str = "\nContainer:\n"
Expand Down
Loading