Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. #447

asantamariapang · 2021-04-28T01:59:44Z

Fixed bug #428: ensure that dataset.csv is checked right at the start of training. This applies to segmentation models.

Added method validated_channel_ids in file dataset_util.py.
The method returns the full path for files specified in the training, validation and testing datasets.

Given any channel type (e.g., image_channels or ground_truth_ids or mask_id) defined in the config file there is associated row value for any given subject.
Exits if there is no corresponding corresponding mapping from subject to channel.

Added validation for segmentation pipeline to ensure all data (mask, region, channels) are provided and file names do exist. Core method is validated_channel_ids (InnerEye\Ml\dataset_util.py) and called from: InnerEyeContainer->setup (lighting_base.py) and load_dataset_sources->get_paths_for_channel_ids (full_image_dataset.py)

Updated changelog for bug: 428

ant0nsc

There is no test for this code yet.

CHANGELOG.md

InnerEye/ML/dataset/full_image_dataset.py

InnerEye/ML/lightning_base.py

InnerEye/ML/utils/dataset_util.py

InnerEye/ML/lightning_base.py

1) Addressed #447, except get_dataset_splits 2) Added unit test 'test_converts_channels_to_file_paths' 3) Fixed issue when running in Ubuntu 20.04: https://github.com/microsoft/InnerEye-DeepLearning/runs/2453669197?check_suite_focus=true

InnerEye/ML/lightning_base.py

Tests/ML/datasets/test_dataset.py

InnerEye/ML/dataset/full_image_dataset.py

Tests/ML/datasets/test_dataset.py

…g function call Changed to: split_dataset = self.config.get_dataset_splits() for split_data in [split_dataset.train, split_dataset.val, split_dataset.test]:

…nesep

Tested two conditions: 1: No errors reported if no channels nor files are missing 2: Errors reported for missing channels and files

.idea/InnerEye-DeepLearning.iml

ant0nsc · 2021-05-11T13:11:25Z

.idea/runConfigurations/Template__Run_ML_on_local_machine.xml

@@ -6,11 +6,12 @@
 <envs>
 <env name="PYTHONUNBUFFERED" value="1" />
 </envs>
- <option name="SDK_HOME" value="" />
+ <option name="SDK_HOME" value="wsl:https://Ubuntu-20.04/home/alberto/miniconda3/envs/InnerEye/bin/python" />


please undo

Restored file version.

ant0nsc · 2021-05-11T13:14:26Z

InnerEye/ML/dataset/full_image_dataset.py

+
+ for channel_id in channels:
+ row = rows.loc[rows[CSV_CHANNEL_HEADER] == channel_id]
+ channel_failure_flag: bool = False


I think you can remove this flag, and replace it with a check for failed_channel_info being empty

Removed flag and added "else" condition. In general can't check check for failed_channel_info being empty since accumulates errors from multiple channels for same subject.

ant0nsc · 2021-05-11T13:16:56Z

Tests/ML/datasets/test_dataset.py

+ # ["1", "train_and_test_data/id1_channel1.nii.gz", "channel1", "1"], 
+ # ["1", "train_and_test_data/id1_channel1.nii.gz", "channel2", "1"],
+ # ["1", "train_and_test_data/id1_mask.nii.gz", "mask", "1"],


can you add a comment saying why this is commented out?

Added comment explaining why is commented out.

Tests/ML/datasets/test_dataset.py

ant0nsc · 2021-05-12T13:07:10Z

InnerEye/ML/dataset/full_image_dataset.py

@@ -282,15 +282,13 @@ def converts_channels_to_file_paths(channels: List[str],

 for channel_id in channels:
 row = rows.loc[rows[CSV_CHANNEL_HEADER] == channel_id]
- channel_failure_flag: bool = False
 if len(row) == 0:
 failed_channel_info += f"Patient {patient_id} does not have channel '{channel_id}'" + os.linesep
 channel_failure_flag = True


this flag is no longer used?

Removed unused flag.

ant0nsc · 2021-05-12T13:08:59Z

.idea/InnerEye-DeepLearning.iml

@@ -4,7 +4,7 @@
 <content url="file:https://$MODULE_DIR$">
 <sourceFolder url="file:https://$MODULE_DIR$" isTestSource="false" />
 </content>
- <orderEntry type="jdk" jdkName="3.7 @ Ubuntu 20.04" jdkType="Python SDK" />
+ <orderEntry type="jdk" jdkName="3.7 @ Ubuntu-20.04" jdkType="Python SDK" />


what if you manually set back to "3.7 @ Ubuntu 20.04"?

Manually set back to "3.7 @ Ubuntu 20.04".

ant0nsc · 2021-05-12T13:10:56Z

InnerEye/ML/dataset/full_image_dataset.py

@@ -262,6 +264,40 @@ def _load_dataset_sources(self) -> Dict[str, PatientDatasetSource]:
 )


+def converts_channels_to_file_paths(channels: List[str],


naming conventions would suggest to call that "convert..." rather than "converts...."

Changed naming conventions from "converts..." to "convert..."

Shruthi42 · 2021-05-12T14:17:51Z

/azp run

Image files from csv are copy to: "downloaded", "mounted" and "root" directories.

ant0nsc · 2021-05-14T13:18:22Z

Tests/ML/test_download_upload.py

+ df = pandas.read_csv(str(full_ml_test_data_path(DATASET_CSV_FILE_NAME)), usecols=['filePath'])
+
+ path = test_output_dirs.root_dir / "mounted"
+ path.mkdir(exist_ok=True)
+
+ train_and_test_data_path = path / "train_and_test_data"
+ train_and_test_data_path.mkdir(exist_ok=True)
+
+ for filePath in set(df['filePath'].values):
+ shutil.copy(full_ml_test_data_path() / filePath, path / filePath)
+


as discussed, it would be simpler to make a change in _test_mount_for_lightning_container. There, the dataset.csv file is already copied to both "mounted" and "downloaded" folders.

Implemented in _test_mount_for_lightning_container.

ant0nsc · 2021-05-14T13:19:03Z

Tests/ML/test_download_upload.py

+ if not is_lightning_model:
+ # Consider three directories, i) "downloaded", ii) "mounted", iii) "train_and_test_data" Path for ""
+ # represents "train_and_test_data" and is empty since the string "train_and_test_data" is in data.csv
+ # file and corresponds to the case container.local_dataset == root
+ for myPath in ["downloaded", "mounted", ""]:
+
+ path = test_output_dirs.root_dir / myPath
+ path.mkdir(exist_ok=True)
+
+ train_and_test_data_path = path / "train_and_test_data"
+ train_and_test_data_path.mkdir(exist_ok=True)
+
+ for filePath in set(df['filePath'].values):
+ shutil.copy(full_ml_test_data_path() / filePath, path / filePath)


as discussed, it would be simpler to make a change in _test_mount_for_lightning_container. There, the dataset.csv file is already copied to both "mounted" and "downloaded" folders.
shutil.copytree(full_ml_test_data_path("train_and_test_data"), path) or shutil.copytree(full_ml_test_data_path("train_and_test_data"), path / "train_and_test_data")

Implemented changes to recursively copy full_ml_test_data_path("train_and_test_data") to path / "train_and_test_data"

…ainer." This reverts commit 734d5e6.

This reverts commit 997d76a.

…unt_path, test_output_dirs.root_dir

ant0nsc · 2021-05-14T14:09:58Z

.gitignore

+*.iml
+.idea/InnerEye-DeepLearning.iml


Please don't

the project file should remain checked in.

ant0nsc · 2021-05-14T14:10:08Z

.idea/InnerEye-DeepLearning.iml

@@ -4,7 +4,7 @@
 <content url="file:https://$MODULE_DIR$">
 <sourceFolder url="file:https://$MODULE_DIR$" isTestSource="false" />
 </content>
- <orderEntry type="jdk" jdkName="3.7 @ Ubuntu 20.04" jdkType="Python SDK" />
+ <orderEntry type="jdk" jdkName="Python 3.7 (InnerEye)" jdkType="Python SDK" />


please undo

ant0nsc · 2021-05-14T14:12:14Z

Tests/ML/test_download_upload.py

+ if (path / "train_and_test_data").is_dir():
+ shutil.rmtree(path / "train_and_test_data")
+
+ # Creates directory structure and copy data
+ shutil.copytree(full_ml_test_data_path("train_and_test_data"), path / "train_and_test_data")


I have this pet peeve about repeated code, and in particular repeated constants. Would be good to store "train_and_test_data" in a variable.

ant0nsc · 2021-05-14T14:13:43Z

Tests/ML/test_download_upload.py

+ # With runs outside of AzureML, a local dataset should be used as-is. Azure dataset ID is ignored here.
+ shutil.copy(full_ml_test_data_path(DATASET_CSV_FILE_NAME), root / DATASET_CSV_FILE_NAME)


why is this still needed? I thought your code is already copying everything in the test... method?

…ble.

ant0nsc · 2021-05-14T16:10:41Z

Tests/ML/test_download_upload.py

@@ -143,13 +143,14 @@ def _test_mount_for_lightning_container(test_output_dirs: OutputFolderForTests,
 download_path = test_output_dirs.root_dir / "downloaded"
 mount_path = test_output_dirs.root_dir / "mounted"
 if not is_lightning_model:
+ train_and_test_data : str = "train_and_test_data"


you don't normally need the : str annotation. but no need to change that.

InnerEye/ML/dataset/full_image_dataset.py

…taset_root_folder'.

…no data type casting.

Alberto Santamaria-Pang added 2 commits April 27, 2021 18:40

Updated Changelog for bug #428

b383c84

Updated changelog for bug: 428

ant0nsc suggested changes Apr 28, 2021

View reviewed changes

Alberto Santamaria-Pang added 2 commits April 28, 2021 20:11

added test unit: test_converts_channels_to_file_paths

62fbd67

added test unit: test_converts_channels_to_file_paths

d985bdf

asantamariapang changed the title ~~Alberto~~ Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. Apr 29, 2021

ant0nsc reviewed Apr 29, 2021

View reviewed changes

InnerEye/ML/lightning_base.py Outdated Show resolved Hide resolved

ant0nsc reviewed Apr 29, 2021

View reviewed changes

Tests/ML/datasets/test_dataset.py Outdated Show resolved Hide resolved

ant0nsc suggested changes Apr 29, 2021

View reviewed changes

Alberto Santamaria-Pang added 10 commits May 3, 2021 09:51

Update PR number

c06d9cc

Updated PR number

45c3929

Updated PR number

e7e8a2d

Updated PR number

841928d

Stored the result of get_dataset_splits in a variable instead of usin…

8ffdce2

…g function call Changed to: split_dataset = self.config.get_dataset_splits() for split_data in [split_dataset.train, split_dataset.val, split_dataset.test]:

renamed variable name

9347511

removed assert statement

9d7d81b

Added patient_id to file error description and replaced "\n" by os.li…

16dd1ab

…nesep

Fixed typo

f44d316

Updated unit test test_converts_channels_to_file_paths

4576858

Tested two conditions: 1: No errors reported if no channels nor files are missing 2: Errors reported for missing channels and files

ant0nsc suggested changes May 11, 2021

View reviewed changes

Alberto Santamaria-Pang added 3 commits May 11, 2021 17:36

Restored version to master.

63e6368

Improved, simplified unit test

650ab01

Optimized code, removed channel_failure_flag variable

850a07e

ant0nsc suggested changes May 12, 2021

View reviewed changes

Alberto Santamaria-Pang added 3 commits May 12, 2021 06:38

Removed unused flag.

103f4cc

Removed space from jdkName variable value.

3c7c624

Changed test unit function and method name.

91ff0e3

ant0nsc previously approved these changes May 12, 2021

View reviewed changes

Merge branch 'main' into alberto

ec1a4a2

asantamariapang dismissed ant0nsc’s stale review via ec1a4a2 May 12, 2021 15:02

Alberto Santamaria-Pang added 4 commits May 12, 2021 08:23

Added new line to comply with flake8: "W292 no newline at end of file".

25d57d7

Fixed type hints to address errors from mypy typecheck.

58a2b93

Fixed unit test when adding sanity check for files.

997d76a

Image files from csv are copy to: "downloaded", "mounted" and "root" directories.

Files from data.csv are copied before mounting lightning container.

734d5e6

ant0nsc suggested changes May 14, 2021

View reviewed changes

Alberto Santamaria-Pang added 3 commits May 14, 2021 06:26

Revert "Files from data.csv are copied before mounting lightning cont…

8bf822b

…ainer." This reverts commit 734d5e6.

Revert "Fixed unit test when adding sanity check for files."

864ef81

This reverts commit 997d76a.

Recursively copy "train_and_test_data" directory to download_path, mo…

5d8663f

…unt_path, test_output_dirs.root_dir

ant0nsc suggested changes May 14, 2021

View reviewed changes

Alberto Santamaria-Pang added 2 commits May 14, 2021 07:24

Avoid copying data.csv twice and store "train_and_test_data" in varia…

16b1fd9

…ble.

Reverted changes.

79b3d14

ant0nsc previously approved these changes May 14, 2021

View reviewed changes

Fixed extra space and fixed unnecessary type annotation.

eb30f8e

asantamariapang dismissed ant0nsc’s stale review via eb30f8e May 14, 2021 17:30

ant0nsc previously approved these changes May 14, 2021

View reviewed changes

Shruthi42 reviewed May 17, 2021

View reviewed changes

InnerEye/ML/dataset/full_image_dataset.py Outdated Show resolved Hide resolved

InnerEye/ML/dataset/full_image_dataset.py Outdated Show resolved Hide resolved

Removed unused variables and removed optional data type for 'local_da…

d512c51

…taset_root_folder'.

asantamariapang dismissed ant0nsc’s stale review via d512c51 May 17, 2021 17:43

Alberto Santamaria-Pang added 2 commits May 17, 2021 11:06

Removed whitespace.

9e77224

Added condition to verify if "self.config.local_dataset is None" and …

8fd2086

…no data type casting.

ant0nsc approved these changes May 17, 2021

View reviewed changes

Shruthi42 approved these changes May 18, 2021

View reviewed changes

ant0nsc merged commit c2c3729 into main May 18, 2021

ant0nsc deleted the alberto branch May 18, 2021 09:18

ant0nsc linked an issue May 18, 2021 that may be closed by this pull request

Ensure that dataset.csv is checked right at the start of training #428

Closed

ant0nsc mentioned this pull request May 18, 2021

Ensure that dataset.csv is checked right at the start of training #428

Closed

asantamariapang restored the alberto branch May 21, 2021 14:19

asantamariapang deleted the alberto branch May 21, 2021 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. #447

Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. #447

asantamariapang commented Apr 28, 2021 •

edited

Loading

ant0nsc left a comment

ant0nsc May 11, 2021

asantamariapang May 12, 2021

ant0nsc May 11, 2021

asantamariapang May 12, 2021

ant0nsc May 11, 2021

asantamariapang May 12, 2021

ant0nsc May 12, 2021

asantamariapang May 12, 2021

ant0nsc May 12, 2021

asantamariapang May 12, 2021

ant0nsc May 12, 2021

asantamariapang May 12, 2021

Shruthi42 commented May 12, 2021

ant0nsc May 14, 2021

asantamariapang May 14, 2021

ant0nsc May 14, 2021

asantamariapang May 14, 2021

ant0nsc May 14, 2021

ant0nsc May 14, 2021

ant0nsc May 14, 2021

ant0nsc May 14, 2021

ant0nsc May 14, 2021

ant0nsc May 14, 2021

		@@ -262,6 +264,40 @@ def _load_dataset_sources(self) -> Dict[str, PatientDatasetSource]:
		)


		def converts_channels_to_file_paths(channels: List[str],

		# With runs outside of AzureML, a local dataset should be used as-is. Azure dataset ID is ignored here.
		shutil.copy(full_ml_test_data_path(DATASET_CSV_FILE_NAME), root / DATASET_CSV_FILE_NAME)

		*.iml
		.idea/InnerEye-DeepLearning.iml

Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. #447

Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. #447

Conversation

asantamariapang commented Apr 28, 2021 • edited Loading

ant0nsc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Shruthi42 commented May 12, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asantamariapang commented Apr 28, 2021 •

edited

Loading