Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Commit

Permalink
Fix some Sphinx warnings (#699)
Browse files Browse the repository at this point in the history
* Fix shell lexer name

* Update CHANGELOG

* Fix CHANGELOG

* Fix "html_static_path entry '_static' does not exist"

* Clean up preprocess script

* Fix link to InnerEye-DataQuality

* Use shutil.copy to copy files

* Remove extra info from CHANGELOG

* Fix broken link to LICENSE

* Fix lexer name for YAML

* Remove colons from headers

* Fix InnerEye module not being found
  • Loading branch information
fepegar committed Mar 22, 2022
1 parent 95d8b72 commit 45e7d5f
Show file tree
Hide file tree
Showing 10 changed files with 142 additions and 143 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs that run in AzureML.
`NIH_COVID_BYOL` to specify the name of the SSL training dataset.
- ([#560](https://github.com/microsoft/InnerEye-DeepLearning/pull/560)) Added pre-commit hooks.
- ([#619](https://github.com/microsoft/InnerEye-DeepLearning/pull/619)) Add DeepMIL PANDA
- ([#559](https://github.com/microsoft/InnerEye-DeepLearning/pull/559)) Adding the accompanying code for the ["Active label cleaning: Improving dataset quality under resource constraints"](https://arxiv.org/abs/2109.00574) paper. The code can be found in the [InnerEye-DataQuality](InnerEye-DataQuality/README.md) subfolder. It provides tools for training noise robust models, running label cleaning simulation and loading our label cleaning benchmark datasets.
- ([#559](https://github.com/microsoft/InnerEye-DeepLearning/pull/559)) Adding the accompanying code for the ["Active label cleaning: Improving dataset quality under resource constraints"](https://arxiv.org/abs/2109.00574) paper.
- ([#589](https://github.com/microsoft/InnerEye-DeepLearning/pull/589)) Add `LightningContainer.update_azure_config()`
hook to enable overriding `AzureConfig` parameters from a container (e.g. `experiment_name`, `cluster`, `num_nodes`).
- ([#617](https://github.com/microsoft/InnerEye-DeepLearning/pull/617)) Commandline flag `pl_check_val_every_n_epoch` to control how often validation is happening
Expand Down Expand Up @@ -97,6 +97,7 @@ gets uploaded to AzureML, by skipping all test folders.

### Fixed

- ([#699](https://github.com/microsoft/InnerEye-DeepLearning/pull/699)) Fix Sphinx warnings.
- ([#682](https://github.com/microsoft/InnerEye-DeepLearning/pull/682)) Ensure the shape of input patches is compatible with model constraints.
- ([#681](https://github.com/microsoft/InnerEye-DeepLearning/pull/681)) Pad model outputs if they are smaller than the inputs.
- ([#683](https://github.com/microsoft/InnerEye-DeepLearning/pull/683)) Fix missing separator error in docs Makefile.
Expand Down
8 changes: 3 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Further detailed instructions, including setup in Azure, are here:
1. [Model diagnostics](docs/model_diagnostics.md)
1. [Move a model to a different workspace](docs/move_model.md)
1. [Working with FastMRI models](docs/fastmri.md)
1. [Active label cleaning and noise robust learning toolbox](InnerEye-DataQuality/README.md)
1. [Active label cleaning and noise robust learning toolbox](https://github.com/microsoft/InnerEye-DeepLearning/blob/1606729c7a16e1bfeb269694314212b6e2737939/InnerEye-DataQuality/README.md)

## Deployment

Expand All @@ -133,7 +133,7 @@ Details can be found [here](docs/deploy_on_aml.md).

## Licensing

[MIT License](LICENSE)
[MIT License](/LICENSE)

**You are responsible for the performance, the necessary testing, and if needed any regulatory clearance for
any of the models produced by this toolbox.**
Expand All @@ -157,7 +157,7 @@ Oktay O., Nanavati J., Schwaighofer A., Carter D., Bristow M., Tanno R., Jena R.

Bannur S., Oktay O., Bernhardt M, Schwaighofer A., Jena R., Nushi B., Wadhwani S., Nori A., Natarajan K., Ashraf S., Alvarez-Valle J., Castro D. C.: Hierarchical Analysis of Visual COVID-19 Features from Chest Radiographs. ICML 2021 Workshop on Interpretable Machine Learning in Healthcare. [https://arxiv.org/abs/2107.06618](https://arxiv.org/abs/2107.06618)

Bernhardt M., Castro D. C., Tanno R., Schwaighofer A., Tezcan K. C., Monteiro M., Bannur S., Lungren M., Nori S., Glocker B., Alvarez-Valle J., Oktay. O: Active label cleaning for improved dataset quality under resource constraints. [https://www.nature.com/articles/s41467-022-28818-3](https://www.nature.com/articles/s41467-022-28818-3). Accompagnying code [InnerEye-DataQuality](InnerEye-DataQuality/README.md)
Bernhardt M., Castro D. C., Tanno R., Schwaighofer A., Tezcan K. C., Monteiro M., Bannur S., Lungren M., Nori S., Glocker B., Alvarez-Valle J., Oktay. O: Active label cleaning for improved dataset quality under resource constraints. [https://www.nature.com/articles/s41467-022-28818-3](https://www.nature.com/articles/s41467-022-28818-3). Accompagnying code [InnerEye-DataQuality](https://github.com/microsoft/InnerEye-DeepLearning/blob/1606729c7a16e1bfeb269694314212b6e2737939/InnerEye-DataQuality/README.md)

## Contributing

Expand All @@ -175,5 +175,3 @@ contact [[email protected]](mailto:[email protected]) with any additio

## This toolbox is maintained by the
[Microsoft Medical Image Analysis team](https://www.microsoft.com/en-us/research/project/medical-image-analysis/).


92 changes: 46 additions & 46 deletions docs/building_models.md

Large diffs are not rendered by default.

36 changes: 18 additions & 18 deletions docs/debugging_and_monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,21 @@

### Using TensorBoard to monitor AzureML jobs

* **Existing jobs**: execute [`InnerEye/Azure/tensorboard_monitor.py`](/InnerEye/Azure/tensorboard_monitor.py)
with either an experiment id `--experiment_name` or a list of run ids `--run_ids job1,job2,job3`.
If an experiment id is provided then all of the runs in that experiment will be monitored. Additionally You can also
* **Existing jobs**: execute [`InnerEye/Azure/tensorboard_monitor.py`](/InnerEye/Azure/tensorboard_monitor.py)
with either an experiment id `--experiment_name` or a list of run ids `--run_ids job1,job2,job3`.
If an experiment id is provided then all of the runs in that experiment will be monitored. Additionally You can also
filter runs by type by the run's status, setting the `--filters Running,Completed` parameter to a subset of
`[Running, Completed, Failed, Canceled]`. By default Failed and Canceled runs are excluded.

To quickly access this script from PyCharm, there is a template PyCharm run configuration
`Template: Tensorboard monitoring` in the repository. Create a copy of that, and modify the commandline
To quickly access this script from PyCharm, there is a template PyCharm run configuration
`Template: Tensorboard monitoring` in the repository. Create a copy of that, and modify the commandline
arguments with your jobs to monitor.

* **New jobs**: when queuing a new AzureML job, pass `--tensorboard`, which will automatically start a new TensorBoard
session, monitoring the newly queued job.
session, monitoring the newly queued job.

### Resource Monitor
GPU and CPU usage can be monitored throughout the execution of a run (local and AML) by setting the monitoring interval
GPU and CPU usage can be monitored throughout the execution of a run (local and AML) by setting the monitoring interval
for the resource monitor eg: `--monitoring_interval_seconds=5`. This will spawn a separate process at the start of the
run which will log both GPU and CPU utilization and memory consumption. These metrics will be written to AzureML as
well as a separate TensorBoard logs file under `Diagnostics`.
Expand All @@ -26,12 +26,12 @@ well as a separate TensorBoard logs file under `Diagnostics`.
For full debugging of any non-trivial model, you will need a GPU. Some basic debugging can also be carried out on
standard Linux or Windows machines.

The main entry point into the code is [`InnerEye/ML/runner.py`](/InnerEye/ML/runner.py). The code takes its
configuration elements from commandline arguments and a settings file,
[`InnerEye/settings.yml`](/InnerEye/settings.yml).
The main entry point into the code is [`InnerEye/ML/runner.py`](/InnerEye/ML/runner.py). The code takes its
configuration elements from commandline arguments and a settings file,
[`InnerEye/settings.yml`](/InnerEye/settings.yml).

A password for the (optional) Azure Service
Principal is read from `InnerEyeTestVariables.txt` in the repository root directory. The file
A password for the (optional) Azure Service
Principal is read from `InnerEyeTestVariables.txt` in the repository root directory. The file
is expected to contain a line of the form
```
APPLICATION_KEY=<app key for your AML workspace>
Expand All @@ -48,7 +48,7 @@ create a copy of the template run configuration, and change the arguments to sui

Here are a few hints how you can reduce the complexity of training if you need to debug an issue. In most cases,
you should then be able to rely on a CPU machine.
* Reduce the number of feature channels in your model. If you run a UNet, for example, you can set
* Reduce the number of feature channels in your model. If you run a UNet, for example, you can set
`feature_channels = [1]` in your model definition file.
* Train only for a single epoch. You can set `--num_epochs=1` via the commandline or the `more_switches` variable
if you start your training via a build definition. This will only create a model checkpoint at epoch 1, and ignore
Expand All @@ -63,16 +63,16 @@ With the above settings, you should be able to get a model training run to compl
### Verify your changes using a simplified fast model

If you made any changes to the code that submits experiments (either `azure_runner.py` or `runner.py` or code
imported by those), validate them using a model training run in Azure. You can queue a model training run for the
imported by those), validate them using a model training run in Azure. You can queue a model training run for the
simplified `BasicModel2Epochs` model.


# Debugging on an AzureML node

It is sometimes possible to get a Python debugging (pdb) session on the main process for a model
training run on an AzureML compute cluster, for example if a run produces unexpected output,
or is silent what seems like an unreasonably long time. For this to work, you will need to
have created the cluster with ssh access enabled; it is not currently possible to add this
or is silent what seems like an unreasonably long time. For this to work, you will need to
have created the cluster with ssh access enabled; it is not currently possible to add this
after the cluster is created. The steps are as follows.

* From the "Details" tab in the run's page, note the Run ID, then click on the target name under
Expand All @@ -82,13 +82,13 @@ after the cluster is created. The steps are as follows.
supply the password chosen when the cluster was created.
* Type "bash" for a nicer command shell (optional).
* Identify the main python process with a command such as
```shell script
```shell
ps aux | grep 'python.*runner.py' | egrep -wv 'bash|grep'
```
You may need to vary this if it does not yield exactly one line of output.
* Note the process identifier (the value in the PID column, generally the second one).
* Issue the commands
```shell script
```shell
kill -TRAP nnnn
nc 127.0.0.1 4444
```
Expand Down
Loading

0 comments on commit 45e7d5f

Please sign in to comment.