Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Commit

Permalink
DOC: Improve doc for checkpointing (#747)
Browse files Browse the repository at this point in the history
  • Loading branch information
ant0nsc authored Jun 21, 2022
1 parent 6a103f1 commit 106c101
Showing 1 changed file with 60 additions and 31 deletions.
91 changes: 60 additions & 31 deletions docs/building_models.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
# Building Models

### Setting up training
## Setting up training

To train new models, you can either work within the InnerEye/ directory hierarchy or create a local hierarchy beside it
and with the same internal organization (although with far fewer files).
We recommend the latter as it offers more flexibility and better separation of concerns. Here we will assume you
create a directory `InnerEyeLocal` beside `InnerEye`.

As well as your configurations (dealt with below) you will need these files:

* `settings.yml`: A file similar to `InnerEye\settings.yml` containing all your Azure settings.
The value of `extra_code_directory` should (in our example) be `'InnerEyeLocal'`,
and model_configs_namespace should be `'InnerEyeLocal.ML.configs'`.
* A folder like `InnerEyeLocal` that contains your additional code, and model configurations.
* A file `InnerEyeLocal/ML/runner.py` that invokes the InnerEye training runner, but that points the code to your environment and Azure
settings.

```python
from pathlib import Path
import os
Expand All @@ -32,14 +34,15 @@ if __name__ == '__main__':
main()
```

### Creating the model configuration
## Creating the model configuration

You will find a variety of model configurations [here](/InnerEye/ML/configs/segmentation). Those not ending
in `Base.py` reference open-sourced data and can be used as they are. Those ending in `Base.py`
are partially specified, and can be used by having other model configurations inherit from them and supply the missing
parameter values: a dataset ID at least, and optionally other values. For example, a `Prostate` model might inherit
very simply from `ProstateBase` by creating `Prostate.py` in the directory `InnerEyeLocal/ML/configs/segmentation`
with the following contents:

```python
from InnerEye.ML.configs.segmentation.ProstateBase import ProstateBase

Expand All @@ -50,12 +53,14 @@ class Prostate(ProstateBase):
ground_truth_ids=["femur_r", "femur_l", "rectum", "prostate"],
azure_dataset_id="name-of-your-AML-dataset-with-prostate-data")
```

The allowed parameters and their meanings are defined in [`SegmentationModelBase`](/InnerEye/ML/config.py).
The class name must be the same as the basename of the file containing it, so `Prostate.py` must contain `Prostate`.
In `settings.yml`, set `model_configs_namespace` to `InnerEyeLocal.ML.configs` so this config
is found by the runner.

A `Head and Neck` model might inherit from `HeadAndNeckBase` by creating `HeadAndNeck.py` with the following contents:

```python
from InnerEye.ML.configs.segmentation.HeadAndNeckBase import HeadAndNeckBase

Expand All @@ -67,13 +72,14 @@ class HeadAndNeck(HeadAndNeckBase):
azure_dataset_id="name-of-your-AML-dataset-with-prostate-data")
```

### Training a new model
## Training a new model

* Set up your model configuration as above and update `azure_dataset_id` to the name of your Dataset in the AML workspace.
It is enough to put your dataset into blob storage. The dataset should be a contained in a folder at the root of the datasets container.
The InnerEye runner will check if there is a dataset in the AzureML workspace already, and if not, generate it directly from blob storage.

* Train a new model, for example `Prostate`:

```shell
python InnerEyeLocal/ML/runner.py --azureml --model=Prostate
```
Expand All @@ -83,44 +89,48 @@ this case, you would simply omit the `azureml` flag, and instead of specifying
`azure_dataset_id` in the class constructor, you can instead use `local_dataset="my/data/folder"`,
where the folder `my/data/folder` contains a `dataset.csv` file and all the files that are referenced therein.

### Boolean Options
## Boolean Options

Note that for command line options that take a boolean argument, and that are `False` by default, there are multiple ways of setting the option. For example alternatives to `--azureml` include:

* `--azureml=True`, or `--azureml=true`, or `--azureml=T`, or `--azureml=t`
* `--azureml=Yes`, or `--azureml=yes`, or `--azureml=Y`, or `--azureml=y`
* `--azureml=On`, or `--azureml=on`
* `--azureml=1`

Conversely, for command line options that take a boolean argument, and that are `True` by default, there are multiple ways of un-setting the option. For example alternatives to `--no-train` include:

* `--train=False`, or `--train=false`, or `--train=F`, or `--train=f`
* `--train=No`, or `--train=no`, or `--train=N`, or `--train=n`
* `--train=Off`, or `--train=off`
* `--train=0`

## Training using multiple machines

### Training using multiple machines
To speed up training in AzureML, you can use multiple machines, by specifying the additional
`--num_nodes` argument. For example, to use 2 machines to train, specify:

```shell
python InnerEyeLocal/ML/runner.py --azureml --model=Prostate --num_nodes=2
```

On each of the 2 machines, all available GPUs will be used. Model inference will always use only one machine.

For the Prostate model, we observed a 2.8x speedup for model training when using 4 nodes, and a 1.65x speedup
when using 2 nodes.

### AzureML Run Hierarchy
## AzureML Run Hierarchy

AzureML structures all jobs in a hierarchical fashion:

* The top-level concept is a workspace
* Inside of a workspace, there are multiple experiments. Upon starting a training run, the name of the experiment
needs to be supplied. The InnerEye toolbox is set specifically to work with git repositories, and it automatically
sets the experiment name to match the name of the current git branch.
* Inside of an experiment, there are multiple runs. When starting the InnerEye toolbox as above, a run will be created.
* A run can have child runs - see below in the discussion about cross validation.


### K-Fold Model Cross Validation
## K-Fold Model Cross Validation

For running K-fold cross validation, the InnerEye toolbox schedules multiple training runs in the cloud that run
at the same time (provided that the cluster has capacity). This means that a complete cross validation run usually
Expand All @@ -137,12 +147,14 @@ The dataset splits for those `N` child runs will be
computed from the union of the Training and Validation sets. The Test set is unchanged. Note that the Test set can be
empty, in which case the union of all validation sets for the `N` child runs will be the full dataset.

### Recovering failed runs and continuing training
## Recovering failed runs and continuing training

To train further with an already-created model, give the above command with the `run_recovery_id` argument:
```

```shell
--run_recovery_id=foo_bar:foo_bar_12345_abcd
```

The run recovery ID is of the form "experiment_id:run_id". When you trained your original model, it will have been
queued as a "Run" inside of an "Experiment". The experiment will be given a name derived from the branch name - for
example, branch `foo/bar` will queue a run in experiment `foo_bar`. Inside the "Tags" section of your run, you should
Expand All @@ -151,20 +163,23 @@ see an element `run_recovery_id`. It will look something like `foo_bar:foo_bar_1
If you are recovering a HyperDrive run, the value of `--run_recovery_id` should for the parent,
and `--number_of_cross_validation_splits` should have the same value as in the recovered run.
For example:
```

```shell
--run_recovery_id=foo_bar:HD_55d4beef-7be9-45d7-89a5-1acf1f99078a --start_epoch=120 --number_of_cross_validation_splits=5
```

The run recovery ID of a parent HyperDrive run is currently not displayed in the "Details" section
of the AzureML UI. The easiest way to get it is to go to any of the child runs and use its
run recovery ID without the final underscore and digit.

### Testing an existing model
## Testing an existing model

To evaluate an existing model on a test set, you can use registered models from previous runs in AzureML, a set of
local checkpoints or a set of URLs pointing to model checkpoints. For all these options, you will need to set the
flag `no-train` along with additional command line arguments to specify the checkpoints.

#### From a registered model on AzureML:
### From a registered model on AzureML

You will need to specify the registered model to run on using the `model_id` argument. You can find the model name and
version by clicking on `Registered Models` on the Details tab of a run in the AzureML UI.
The model id is of the form "model_name:model_version". Thus your command should look like this:
Expand All @@ -173,39 +188,50 @@ The model id is of the form "model_name:model_version". Thus your command should
python Inner/ML/runner.py --azureml --model=Prostate --cluster=my_cluster_name \
--no-train --model_id=Prostate:1
```
#### From local checkpoints:

### From local checkpoints

To evaluate a model using one or more local checkpoints, use the `local_weights_path` argument to specify the path(s) to the
model checkpoint(s) on the local disk.

```shell
python Inner/ML/runner.py --model=Prostate --no-train --local_weights_path=path_to_your_checkpoint
```

To run on multiple checkpoints (if you have trained an ensemble model), specify each checkpoint using the argument
`local_weights_path`.

```shell
python Inner/ML/runner.py --model=Prostate --no-train --local_weights_path=path_to_first_checkpoint,path_to_second_checkpoint
```

#### From URLs:
### From URLs

To evaluate a model using one or more checkpoints each specified by a URL, use the `weights_url` argument to specify the
url(s) from which the model checkpoint(s) should be downloaded.

```shell
python Inner/ML/runner.py --model=Prostate --no-train --weights_url=url_for_your_checkpoint
```

To run on multiple checkpoints (if you have trained an ensemble model), specify each checkpoint using the argument
`weights_url`.

```shell
python Inner/ML/runner.py --model=Prostate --no-train --weights_url=url_for_first_checkpoint,url_for_second_checkpoint
```

#### Running a registered AzureML model on a single image on the local disk
### Running a registered AzureML model on a single image on the local disk

To submit an AzureML run to apply a model to a single image on your local disc,
you can use the script `submit_for_inference.py`, with a command of this form:

```shell
python InnerEye/Scripts/submit_for_inference.py --image_file ~/somewhere/ct.nii.gz --model_id Prostate:555 \
--settings ../somewhere_else/settings.yml --download_folder ~/my_existing_folder
```

### Model Ensembles
## Model Ensembles

An ensemble model will be created automatically and registered in the AzureML model registry whenever cross-validation
models are trained. The ensemble model creation is done by the child whose `cross_validation_split_index` is 0;
Expand All @@ -220,14 +246,15 @@ As well as registering the model, child run 0 runs the ensemble model on the val
aggregated based on the `ensemble_aggregation_type` value in the model config,
and the generated posteriors are passed to the usual model testing downstream pipelines, e.g. metrics computation.


##### Interpreting results
### Interpreting results

Once your HyperDrive AzureML runs are completed, you can visualize the results by running the
[`plot_cross_validation.py`](/InnerEye/ML/visualizers/plot_cross_validation.py) script locally:

```shell
python InnerEye/ML/visualizers/plot_cross_validation.py --run_recovery_id ... --epoch ...
```

filling in the run recovery ID of the parent run and the epoch number (one of the test epochs, e.g. the last epoch)
for which you want results plotted. The script will also output several `..._outliers.txt` file with all of the outliers
across the splits and a portal query to
Expand Down Expand Up @@ -288,19 +315,20 @@ the `metrics.csv` files of the current run and the comparison run(s).
with one of the baselines. Each one is named `AAA_vs_BBB.png`, where `AAA` and `BBB` are the run IDs
of the two models. Each plot shows the Dice scores on the test set for the models.
* For both segmentation and classification models an IPython Notebook `report.ipynb` will be generated in the
`outputs` directory.
* For segmentation models, this report is based on the full image results of the model checkpoint that performed
the best on the validation set. This report will contain detailed metrics per structure, and outliers to help
model development.
* For classification models, the report is based on the validation and test results from the last epoch. It shows
metrics on the validation and test sets, ROC and PR Curves, and a list of the best and worst performing images
from the test set.
`outputs` directory. This report will be based on the checkpoint that was written in the last training
epoch (stored in `checkpoints/last.ckpt`).
* For segmentation models, this report will contain detailed metrics per structure, and outliers
(test set images that had a particularly high error rate for one or more structures). The information
about outliers can be used to double-check the existing annotations for errors.
* For classification models, the report shows metrics on the validation and test sets, ROC and PR Curves,
and a list of the best and worst performing images from the test set.

Ensemble models are created by the zero'th child (with `cross_validation_split_index=0`) in each
cross-validation run. Results from inference on the test and validation sets are uploaded to the
parent run, and can be found in `epoch_NNN` directories as above.
In addition, various scores and plots from the ensemble and from individual child
runs are uploaded to the parent run, in the `CrossValResults` directory. This contains:

* Subdirectories named 0, 1, 2, ... for all the child runs including the zero'th one, as well
as `ENSEMBLE`, containing their respective `epoch_NNN` directories.
* Files `Dice_Test_Splits.png` and `Dice_Val_Splits.png`, containing box plots of the Dice scores
Expand All @@ -318,7 +346,7 @@ runs are uploaded to the parent run, in the `CrossValResults` directory. This co
There is also a directory `BaselineComparisons`, containing the Wilcoxon test results and
scatterplots for the ensemble, as described above for single runs.

### Augmentations for classification models.
## Augmentations for classification models

For classification models, you can define an augmentation pipeline to apply to your images input (resp. segmentations) at
training, validation and test time. In order to define such a series of transformations, you will need to overload the
Expand All @@ -328,7 +356,8 @@ a `ModelTransformsPerExecutionMode`, that maps each execution mode to one transf
ensures the correct conversion of 2D or 3D PIL.Image or tensor inputs to the obtained pipeline.

`ImageTransformationPipeline` takes two arguments for its constructor:
* `transforms`: a list of image transforms, in particular you can feed in standard [torchvision transforms](https://pytorch.org/vision/0.8/transforms.html) or

* `transforms`: a list of image transforms, in particular you can feed in standard [torchvision transforms](https://pytorch.org/vision/0.8/transforms.html) or
any other transforms as long as they support an input `[Z, C, H, W]` (where Z is the 3rd dimension (1 for 2D images),
C number of channels, H and W the height and width of each 2D slide - this is supported for standard torchvision
transforms.). You can also define your own transforms as long as they expect such a `[Z, C, H, W]` input. You can
Expand All @@ -351,7 +380,7 @@ def get_image_transform(self) -> ModelTransformsPerExecutionMode:
test=ImageTransformationPipeline(transforms=[Resize(256)]))
```

### Segmentation Models and Inference.
## Segmentation Models and Inference

By default when building a segmentation model a full image inference will be performed on the validation and test data sets;
and when building an ensemble model, a full image inference will be performed on the test data set only (because the
Expand All @@ -360,15 +389,15 @@ There are a total of six command line options for controlling this in more detai

For non-ensemble models use any of the following command line options to enable or disable inference on training, test, or validation data sets:

```
```shell
--inference_on_train_set=True or False
--inference_on_test_set=True or False
--inference_on_val_set=True or False
```

For ensemble models use any of the following corresponding command line options:

```
```shell
--ensemble_inference_on_train_set=True or False
--ensemble_inference_on_test_set=True or False
--ensemble_inference_on_val_set=True or False
Expand Down

0 comments on commit 106c101

Please sign in to comment.