Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

FastMRI dataset onboarding script and detailed examples #444

Merged
merged 64 commits into from
May 19, 2021
Merged
Show file tree
Hide file tree
Changes from 55 commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
ee10d2c
better logging
ant0nsc Apr 20, 2021
60ebb7c
increase upload timeout
ant0nsc Apr 21, 2021
f9ea89d
onboarding script
ant0nsc Apr 22, 2021
85af616
project file
ant0nsc Apr 22, 2021
14cb742
Merge remote-tracking branch 'origin/main' into antonsc/fastmri
ant0nsc Apr 22, 2021
b112445
changelog
ant0nsc Apr 22, 2021
a1ba5fb
flake8
ant0nsc Apr 22, 2021
4892cee
doc
ant0nsc Apr 22, 2021
e2e9252
fix auth problems
ant0nsc Apr 23, 2021
b20cdb6
fix AWS problem
ant0nsc Apr 23, 2021
a513d8d
Merge remote-tracking branch 'origin/main' into antonsc/fastmri
ant0nsc Apr 23, 2021
c4dbeb6
style fix
ant0nsc Apr 23, 2021
3a15484
fix .tar.gz problem
ant0nsc Apr 26, 2021
3fcf966
fix multi-node problem on HelloContainerfla
ant0nsc Apr 26, 2021
ecab8f2
mypy
ant0nsc Apr 26, 2021
35dac4c
docu
ant0nsc Apr 26, 2021
f63d460
docu
ant0nsc Apr 26, 2021
d24a7dd
running fastmri on knee_singlecoil
ant0nsc Apr 26, 2021
1a6463e
logging noise
ant0nsc Apr 27, 2021
7e5edd2
downgrade azure-mgmt-resource because it leads to loads of warnings
ant0nsc Apr 27, 2021
8ad9354
bug fix
ant0nsc Apr 27, 2021
bc3c37c
cleanup
ant0nsc Apr 27, 2021
7a5014d
flake
ant0nsc Apr 27, 2021
25f8147
docu
ant0nsc Apr 27, 2021
d51679a
docu
ant0nsc Apr 27, 2021
aba61d7
docu
ant0nsc Apr 28, 2021
24a56c4
docu
ant0nsc Apr 28, 2021
3d6251a
progress bar
ant0nsc May 4, 2021
22342c7
rename func
ant0nsc May 4, 2021
b64ee72
PR doc
ant0nsc May 11, 2021
acb51f9
adding more models
ant0nsc May 11, 2021
10fede1
Merge remote-tracking branch 'origin/main' into antonsc/fastmri
ant0nsc May 11, 2021
8d281a0
docu
ant0nsc May 11, 2021
8826890
docu
ant0nsc May 11, 2021
dfdb900
mypy
ant0nsc May 11, 2021
ef52486
test fix
ant0nsc May 12, 2021
8d452ae
Adding more hooks
ant0nsc May 12, 2021
4741fe3
Merge remote-tracking branch 'origin/main' into antonsc/fastmri
ant0nsc May 12, 2021
18d2c63
merge
ant0nsc May 12, 2021
16d5c9c
adding fixed mountpoints
ant0nsc May 12, 2021
cfb32e6
mypy
ant0nsc May 12, 2021
0e2a28d
mypy
ant0nsc May 12, 2021
4da0544
PR doc
ant0nsc May 12, 2021
b1aeba2
doc
ant0nsc May 12, 2021
f88b253
test fix
ant0nsc May 12, 2021
307366f
test fix
ant0nsc May 12, 2021
799a531
docu
ant0nsc May 12, 2021
ab450c3
fallback
ant0nsc May 12, 2021
7ea6cba
removing "unused params" warning
ant0nsc May 12, 2021
040aebc
docker warning
ant0nsc May 12, 2021
0bfaea7
mypy
ant0nsc May 12, 2021
87726ab
docu
ant0nsc May 12, 2021
c683cb1
test fix
ant0nsc May 12, 2021
03065a9
docu
ant0nsc May 12, 2021
da26cd7
docu
ant0nsc May 12, 2021
932dfa8
accidental changes
ant0nsc May 14, 2021
6ef12ab
PR comments
ant0nsc May 14, 2021
b163757
Update InnerEye/Scripts/prepare_fastmri.py
ant0nsc May 14, 2021
d5e2520
fix stuck HelloContainer problem
ant0nsc May 18, 2021
fe16627
diagnostics
ant0nsc May 18, 2021
2bd0eba
Merge remote-tracking branch 'origin/main' into antonsc/fastmri
ant0nsc May 18, 2021
6ada342
remove accidental exit(1)
ant0nsc May 18, 2021
eb303eb
unique name
ant0nsc May 19, 2021
1fb63dc
Merge branch 'main' into antonsc/fastmri
ant0nsc May 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .idea/InnerEye-DeepLearning.iml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 4 additions & 3 deletions .idea/runConfigurations/Template__Run_ML_on_AzureML.xml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
(`ScalarLoss.CustomClassification` and `CustomRegression`), prediction targets (`ScalarModelBase.target_names`),
and reporting (`ModelConfigBase.generate_custom_report()`) in scalar configs, providing more flexibility for defining
model configs with custom behaviour while leveraging the existing InnerEye workflows.
- ([#444](https://github.com/microsoft/InnerEye-DeepLearning/pull/444)) Added setup scripts and documentation to work
with the FastMRI challenge datasets.
- ([#445](https://github.com/microsoft/InnerEye-DeepLearning/pull/445)) Adding test coverage for the `HelloContainer`
model with multiple GPUs
- ([#450](https://github.com/microsoft/InnerEye-DeepLearning/pull/450)) Adds the metric "Accuracy at threshold 0.5" to the classification report (`classification_crossval_report.ipynb`).
Expand Down Expand Up @@ -94,6 +96,9 @@ with only minimum code changes required. See [the MD documentation](docs/bring_y
named `recovery_epoch=x.ckpt` instead of `recovery.ckpt` or `recovery-v0.ckpt`.
- ([#451](https://github.com/microsoft/InnerEye-DeepLearning/pull/451)) Change the signature for function `generate_custom_report`
in `ModelConfigBase` to take only the path to the reports folder and a `ModelProcessing` object.
- ([#444](https://github.com/microsoft/InnerEye-DeepLearning/pull/444)) The method `before_training_on_rank_zero` of
the `LightningContainer` class has been renamed to `before_training_on_global_rank_zero`. The order in which the
hooks are called has been changed.

### Fixed

Expand Down
67 changes: 66 additions & 1 deletion InnerEye/Azure/azure_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@
from typing import Any, Callable, Dict, List, Optional, Union

import param
from azureml.core import Run, ScriptRunConfig, Workspace
from azureml.core import Dataset, Datastore, Run, ScriptRunConfig, Workspace
from azureml.core.authentication import InteractiveLoginAuthentication, ServicePrincipalAuthentication
from azureml.data import FileDataset
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig
from azureml.train.hyperdrive import HyperDriveConfig
from git import Repo

Expand All @@ -25,6 +27,8 @@
# The name of the "azureml" property of AzureConfig
AZURECONFIG_SUBMIT_TO_AZUREML = "azureml"

INPUT_DATA_KEY = "input_data"


@dataclass(frozen=True)
class GitInformation:
Expand Down Expand Up @@ -242,6 +246,67 @@ def fetch_run(self, run_recovery_id: str) -> Run:
"""
return fetch_run(workspace=self.get_workspace(), run_recovery_id=run_recovery_id)

def get_or_create_dataset(self, azure_dataset_id: str) -> FileDataset:
"""
Looks in the AzureML datastore for a dataset of the given name. If there is no such dataset, a dataset is
created and registered, assuming that the files are in a folder that has the same name as the dataset.
For example, if azure_dataset_id is 'foo', then the 'foo' dataset should be pointing to the folder
<container_root>/datasets/foo

WARNING: the behaviour of Dataset.File.from_files, used below, is idiosyncratic. For example,
ant0nsc marked this conversation as resolved.
Show resolved Hide resolved
if "mydataset" storage has two "foo..." subdirectories each containing
a file dataset.csv and a directory ABC,

datastore = Datastore.get(workspace, "mydataset")
# This dataset has the file(s) in foo-bar01 at top level, e.g. dataset.csv
ds1 = Dataset.File.from_files([(datastore, "foo-bar01/*")])
# This dataset has two directories at top level, each with a name matching foo-bar*, and each
# containing dataset.csv.
ds2 = Dataset.File.from_files([(datastore, "foo-bar*/*")])
# This dataset contains a single directory "mydataset" at top level, containing a subdirectory
# foo-bar01, containing dataset.csv and (part of) ABC.
ds3 = Dataset.File.from_files([(datastore, "foo-bar01/*"),
(datastore, "foo-bar01/ABC/abc_files/*/*.nii.gz")])

These behaviours can be verified by calling "ds.download()" on each dataset ds.
"""
if not self.azureml_datastore:
raise ValueError("No value set for 'azureml_datastore' (name of the datastore in the AzureML workspace)")
logging.info(f"Retrieving datastore '{self.azureml_datastore}' from AzureML workspace")
workspace = self.get_workspace()
datastore = Datastore.get(workspace, self.azureml_datastore)
try:
logging.info(f"Trying to retrieve AzureML Dataset '{azure_dataset_id}'")
azureml_dataset = Dataset.get_by_name(workspace, name=azure_dataset_id)
logging.info("Dataset found.")
except:
logging.info(f"Dataset does not yet exist, creating a new one from data in folder '{azure_dataset_id}'")
# See WARNING above before changing the from_files call!
azureml_dataset = Dataset.File.from_files([(datastore, azure_dataset_id)])
logging.info("Registering the dataset for future use.")
azureml_dataset.register(workspace, name=azure_dataset_id)
return azureml_dataset

def get_dataset_consumption(self,
azure_dataset_id: str,
dataset_index: int,
mountpoint: str) -> DatasetConsumptionConfig:
"""
Creates a configuration for using an AzureML dataset inside of an AzureML run. This will make the AzureML
dataset with given name available as a named input, using INPUT_DATA_KEY as the key.
:param mountpoint: The path at which the dataset should be mounted, if using dataset mounting rather than
downloading.
:param azure_dataset_id: The name of the dataset in blob storage to be used for this run. This can be an empty
string to not use any datasets.
:param dataset_index: suffix for the dataset name, dataset name will be set to INPUT_DATA_KEY_idx
"""
azureml_dataset = self.get_or_create_dataset(azure_dataset_id=azure_dataset_id)
if not azureml_dataset:
raise ValueError(f"AzureML dataset {azure_dataset_id} could not be found or created.")
named_input = azureml_dataset.as_named_input(f"{INPUT_DATA_KEY}_{dataset_index}")
path_on_compute = mountpoint or None
return named_input.as_mount(path_on_compute) if self.use_dataset_mount else named_input.as_download()


@dataclass
class SourceConfig:
Expand Down
Loading