DOC: Improve doc for checkpointing (#747)

microsoft · Jun 21, 2022 · 106c101 · 106c101
1 parent 6a103f1
commit 106c101
Showing 1 changed file with 60 additions and 31 deletions.
diff --git a/docs/building_models.md b/docs/building_models.md
@@ -1,19 +1,21 @@
 # Building Models
 
-### Setting up training
+## Setting up training
 
 To train new models, you can either work within the InnerEye/ directory hierarchy or create a local hierarchy beside it
 and with the same internal organization (although with far fewer files).
 We recommend the latter as it offers more flexibility and better separation of concerns. Here we will assume you
 create a directory `InnerEyeLocal` beside `InnerEye`.
 
 As well as your configurations (dealt with below) you will need these files:
+
 * `settings.yml`: A file similar to `InnerEye\settings.yml` containing all your Azure settings.
 The value of `extra_code_directory` should (in our example) be `'InnerEyeLocal'`,
 and model_configs_namespace should be `'InnerEyeLocal.ML.configs'`.
 * A folder like `InnerEyeLocal` that contains your additional code, and model configurations.
 * A file `InnerEyeLocal/ML/runner.py` that invokes the InnerEye training runner, but that points the code to your environment and Azure
 settings.
+
 ```python
 from pathlib import Path
 import os
@@ -32,14 +34,15 @@ if __name__ == '__main__':
     main()
 ```
 
-### Creating the model configuration
+## Creating the model configuration
 
 You will find a variety of model configurations [here](/InnerEye/ML/configs/segmentation). Those not ending
 in `Base.py` reference open-sourced data and can be used as they are. Those ending in `Base.py`
 are partially specified, and can be used by having other model configurations inherit from them and supply the missing
 parameter values: a dataset ID at least, and optionally other values. For example, a `Prostate` model might inherit
 very simply from `ProstateBase` by creating `Prostate.py` in the directory `InnerEyeLocal/ML/configs/segmentation`
 with the following contents:
+
 ```python
 from InnerEye.ML.configs.segmentation.ProstateBase import ProstateBase
 
@@ -50,12 +53,14 @@ class Prostate(ProstateBase):
             ground_truth_ids=["femur_r", "femur_l", "rectum", "prostate"],
             azure_dataset_id="name-of-your-AML-dataset-with-prostate-data")
 ```
+
 The allowed parameters and their meanings are defined in [`SegmentationModelBase`](/InnerEye/ML/config.py).
 The class name must be the same as the basename of the file containing it, so `Prostate.py` must contain `Prostate`.
 In `settings.yml`, set `model_configs_namespace` to `InnerEyeLocal.ML.configs` so this config
 is found by the runner.
 
 A `Head and Neck` model might inherit from `HeadAndNeckBase` by creating `HeadAndNeck.py` with the following contents:
+
 ```python
 from InnerEye.ML.configs.segmentation.HeadAndNeckBase import HeadAndNeckBase
 
@@ -67,13 +72,14 @@ class HeadAndNeck(HeadAndNeckBase):
             azure_dataset_id="name-of-your-AML-dataset-with-prostate-data")
 ```
 
-### Training a new model
+## Training a new model
 
 * Set up your model configuration as above and update `azure_dataset_id` to the name of your Dataset in the AML workspace.
 It is enough to put your dataset into blob storage. The dataset should be a contained in a folder at the root of the datasets container.
 The InnerEye runner will check if there is a dataset in the AzureML workspace already, and if not, generate it directly from blob storage.
 
 * Train a new model, for example `Prostate`:
+
 ```shell
 python InnerEyeLocal/ML/runner.py --azureml --model=Prostate
 ```
@@ -83,44 +89,48 @@ this case, you would simply omit the `azureml` flag, and instead of specifying
 `azure_dataset_id` in the class constructor, you can instead use `local_dataset="my/data/folder"`,
 where the folder `my/data/folder` contains a `dataset.csv` file and all the files that are referenced therein.
 
-### Boolean Options
+## Boolean Options
 
 Note that for command line options that take a boolean argument, and that are `False` by default, there are multiple ways of setting the option. For example alternatives to  `--azureml` include:
+
 * `--azureml=True`, or `--azureml=true`, or `--azureml=T`, or `--azureml=t`
 * `--azureml=Yes`, or `--azureml=yes`, or `--azureml=Y`, or `--azureml=y`
 * `--azureml=On`, or `--azureml=on`
 * `--azureml=1`
 
 Conversely, for command line options that take a boolean argument, and that are `True` by default, there are multiple ways of un-setting the option. For example alternatives to `--no-train` include:
+
 * `--train=False`, or `--train=false`, or `--train=F`, or `--train=f`
 * `--train=No`, or `--train=no`, or `--train=N`, or `--train=n`
 * `--train=Off`, or `--train=off`
 * `--train=0`
 
+## Training using multiple machines
 
-### Training using multiple machines
 To speed up training in AzureML, you can use multiple machines, by specifying the additional
 `--num_nodes` argument. For example, to use 2 machines to train, specify:
+
 ```shell
 python InnerEyeLocal/ML/runner.py --azureml --model=Prostate --num_nodes=2
 ```
+
 On each of the 2 machines, all available GPUs will be used. Model inference will always use only one machine.
 
 For the Prostate model, we observed a 2.8x speedup for model training when using 4 nodes, and a 1.65x speedup
 when using 2 nodes.
 
-### AzureML Run Hierarchy
+## AzureML Run Hierarchy
 
 AzureML structures all jobs in a hierarchical fashion:
+
 * The top-level concept is a workspace
 * Inside of a workspace, there are multiple experiments. Upon starting a training run, the name of the experiment
 needs to be supplied. The InnerEye toolbox is set specifically to work with git repositories, and it automatically
 sets the experiment name to match the name of the current git branch.
 * Inside of an experiment, there are multiple runs. When starting the InnerEye toolbox as above, a run will be created.
 * A run can have child runs - see below in the discussion about cross validation.
 
-
-### K-Fold Model Cross Validation
+## K-Fold Model Cross Validation
 
 For running K-fold cross validation, the InnerEye toolbox schedules multiple training runs in the cloud that run
 at the same time (provided that the cluster has capacity). This means that a complete cross validation run usually
@@ -137,12 +147,14 @@ The dataset splits for those `N` child runs will be
 computed from the union of the Training and Validation sets. The Test set is unchanged. Note that the Test set can be
 empty, in which case the union of all validation sets for the `N` child runs will be the full dataset.
 
-### Recovering failed runs and continuing training
+## Recovering failed runs and continuing training
 
 To train further with an already-created model, give the above command with the `run_recovery_id` argument:
-```
+
+```shell
 --run_recovery_id=foo_bar:foo_bar_12345_abcd
 ```
+
 The run recovery ID is of the form "experiment_id:run_id". When you trained your original model, it will have been
 queued as a "Run" inside of an "Experiment". The experiment will be given a name derived from the branch name - for
 example, branch `foo/bar` will queue a run in experiment `foo_bar`. Inside the "Tags" section of your run, you should
@@ -151,20 +163,23 @@ see an element `run_recovery_id`. It will look something like `foo_bar:foo_bar_1
 If you are recovering a HyperDrive run, the value of `--run_recovery_id` should for the parent,
 and `--number_of_cross_validation_splits` should have the same value as in the recovered run.
 For example:
-```
+
+```shell
 --run_recovery_id=foo_bar:HD_55d4beef-7be9-45d7-89a5-1acf1f99078a --start_epoch=120 --number_of_cross_validation_splits=5
 ```
 
 The run recovery ID of a parent HyperDrive run is currently not displayed in the "Details" section
 of the AzureML UI. The easiest way to get it is to go to any of the child runs and use its
 run recovery ID without the final underscore and digit.
 
-### Testing an existing model
+## Testing an existing model
+
 To evaluate an existing model on a test set, you can use registered models from previous runs in AzureML, a set of
 local checkpoints or a set of URLs pointing to model checkpoints. For all these options, you will need to set the
 flag `no-train` along with additional command line arguments to specify the checkpoints.
 
-#### From a registered model on AzureML:
+### From a registered model on AzureML
+
 You will need to specify the registered model to run on using the `model_id` argument. You can find the model name and
 version by clicking on `Registered Models` on the Details tab of a run in the AzureML UI.
 The model id is of the form "model_name:model_version". Thus your command should look like this:
@@ -173,39 +188,50 @@ The model id is of the form "model_name:model_version". Thus your command should
 python Inner/ML/runner.py --azureml --model=Prostate --cluster=my_cluster_name \
    --no-train --model_id=Prostate:1
 ```
-#### From local checkpoints:
+
+### From local checkpoints
+
 To evaluate a model using one or more local checkpoints, use the `local_weights_path` argument to specify the path(s) to the
 model checkpoint(s) on the local disk.
+
 ```shell
 python Inner/ML/runner.py --model=Prostate --no-train --local_weights_path=path_to_your_checkpoint
 ```
+
 To run on multiple checkpoints (if you have trained an ensemble model), specify each checkpoint using the argument
 `local_weights_path`.
+
 ```shell
 python Inner/ML/runner.py --model=Prostate --no-train --local_weights_path=path_to_first_checkpoint,path_to_second_checkpoint
 ```
 
-#### From URLs:
+### From URLs
+
 To evaluate a model using one or more checkpoints each specified by a URL, use the `weights_url` argument to specify the
 url(s) from which the model checkpoint(s) should be downloaded.
+
 ```shell
 python Inner/ML/runner.py --model=Prostate --no-train --weights_url=url_for_your_checkpoint
 ```
+
 To run on multiple checkpoints (if you have trained an ensemble model), specify each checkpoint using the argument
 `weights_url`.
+
 ```shell
 python Inner/ML/runner.py --model=Prostate --no-train --weights_url=url_for_first_checkpoint,url_for_second_checkpoint
 ```
 
-#### Running a registered AzureML model on a single image on the local disk
+### Running a registered AzureML model on a single image on the local disk
+
 To submit an AzureML run to apply a model to a single image on your local disc,
 you can use the script `submit_for_inference.py`, with a command of this form:
+
 ```shell
 python InnerEye/Scripts/submit_for_inference.py --image_file ~/somewhere/ct.nii.gz --model_id Prostate:555 \
   --settings ../somewhere_else/settings.yml --download_folder ~/my_existing_folder
 ```
 
-### Model Ensembles
+## Model Ensembles
 
 An ensemble model will be created automatically and registered in the AzureML model registry whenever cross-validation
 models are trained. The ensemble model creation is done by the child whose `cross_validation_split_index` is 0;
@@ -220,14 +246,15 @@ As well as registering the model, child run 0 runs the ensemble model on the val
 aggregated based on the `ensemble_aggregation_type` value in the model config,
 and the generated posteriors are passed to the usual model testing downstream pipelines, e.g. metrics computation.
 
-
-##### Interpreting results
+### Interpreting results
 
 Once your HyperDrive AzureML runs are completed, you can visualize the results by running the
 [`plot_cross_validation.py`](/InnerEye/ML/visualizers/plot_cross_validation.py) script locally:
+
 ```shell
 python InnerEye/ML/visualizers/plot_cross_validation.py --run_recovery_id ... --epoch ...
 ```
+
 filling in the run recovery ID of the parent run and the epoch number (one of the test epochs, e.g. the last epoch)
 for which you want results plotted. The script will also output several `..._outliers.txt` file with all of the outliers
 across the splits and a portal query to
@@ -288,19 +315,20 @@ the `metrics.csv` files of the current run and the comparison run(s).
   with one of the baselines. Each one is named `AAA_vs_BBB.png`, where `AAA` and `BBB` are the run IDs
   of the two models. Each plot shows the Dice scores on the test set for the models.
   * For both segmentation and classification models an IPython Notebook `report.ipynb` will be generated in the
-   `outputs` directory.
-    * For segmentation models, this report is based on the full image results of the model checkpoint that performed
-    the best on the validation set. This report will contain detailed metrics per structure, and outliers to help
-    model development.
-    * For classification models, the report is based on the validation and test results from the last epoch. It shows
-     metrics on the validation and test sets, ROC and PR Curves, and a list of the best and worst performing images
-     from the test set.
+   `outputs` directory. This report will be based on the checkpoint that was written in the last training
+   epoch (stored in `checkpoints/last.ckpt`).
+    * For segmentation models, this report will contain detailed metrics per structure, and outliers
+      (test set images that had a particularly high error rate for one or more structures). The information
+      about outliers can be used to double-check the existing annotations for errors.
+    * For classification models, the report shows metrics on the validation and test sets, ROC and PR Curves,
+      and a list of the best and worst performing images from the test set.
 
 Ensemble models are created by the zero'th child (with `cross_validation_split_index=0`) in each
 cross-validation run. Results from inference on the test and validation sets are uploaded to the
 parent run, and can be found in `epoch_NNN` directories as above.
 In addition, various scores and plots from the ensemble and from individual child
 runs are uploaded to the parent run, in the `CrossValResults` directory. This contains:
+
 * Subdirectories named 0, 1, 2, ... for all the child runs including the zero'th one, as well
  as `ENSEMBLE`, containing their respective `epoch_NNN` directories.
 * Files `Dice_Test_Splits.png` and `Dice_Val_Splits.png`, containing box plots of the Dice scores
@@ -318,7 +346,7 @@ runs are uploaded to the parent run, in the `CrossValResults` directory. This co
 There is also a directory `BaselineComparisons`, containing the Wilcoxon test results and
 scatterplots for the ensemble, as described above for single runs.
 
-### Augmentations for classification models.
+## Augmentations for classification models
 
 For classification models, you can define an augmentation pipeline to apply to your images input (resp. segmentations) at
 training, validation and test time. In order to define such a series of transformations, you will need to overload the
@@ -328,7 +356,8 @@ a `ModelTransformsPerExecutionMode`, that maps each execution mode to one transf
 ensures the correct conversion of 2D or 3D PIL.Image or tensor inputs to the obtained pipeline.
 
 `ImageTransformationPipeline` takes two arguments for its constructor:
- * `transforms`: a list of image transforms, in particular you can feed in standard [torchvision transforms](https://pytorch.org/vision/0.8/transforms.html) or
+
+* `transforms`: a list of image transforms, in particular you can feed in standard [torchvision transforms](https://pytorch.org/vision/0.8/transforms.html) or
 any other transforms as long as they support an input `[Z, C, H, W]` (where Z is the 3rd dimension (1 for 2D images),
    C number of channels, H and W the height and width of each 2D slide - this is supported for standard torchvision
    transforms.). You can also define your own transforms as long as they expect such a `[Z, C, H, W]` input. You can
@@ -351,7 +380,7 @@ def get_image_transform(self) -> ModelTransformsPerExecutionMode:
         test=ImageTransformationPipeline(transforms=[Resize(256)]))
 ```
 
-### Segmentation Models and Inference.
+## Segmentation Models and Inference
 
 By default when building a segmentation model a full image inference will be performed on the validation and test data sets;
 and when building an ensemble model, a full image inference will be performed on the test data set only (because the
@@ -360,15 +389,15 @@ There are a total of six command line options for controlling this in more detai
 
 For non-ensemble models use any of the following command line options to enable or disable inference on training, test, or validation data sets:
 
-```
+```shell
 --inference_on_train_set=True or False
 --inference_on_test_set=True or False
 --inference_on_val_set=True or False
 ```
 
 For ensemble models use any of the following corresponding command line options:
 
-```
+```shell
 --ensemble_inference_on_train_set=True or False
 --ensemble_inference_on_test_set=True or False
 --ensemble_inference_on_val_set=True or False