Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. #447

Merged
merged 35 commits into from
May 18, 2021

Conversation

asantamariapang
Copy link
Contributor

@asantamariapang asantamariapang commented Apr 28, 2021

Fixed bug #428: ensure that dataset.csv is checked right at the start of training. This applies to segmentation models.

Added method validated_channel_ids in file dataset_util.py.
The method returns the full path for files specified in the training, validation and testing datasets.

  1. Given any channel type (e.g., image_channels or ground_truth_ids or mask_id) defined in the config file there is associated row value for any given subject.
  2. Exits if there is no corresponding corresponding mapping from subject to channel.

Alberto Santamaria-Pang added 2 commits April 27, 2021 18:40
Added validation for segmentation pipeline to ensure all data (mask, region, channels) are provided and file names do exist.

Core method is validated_channel_ids (InnerEye\Ml\dataset_util.py) and called from: InnerEyeContainer->setup (lighting_base.py) and
    load_dataset_sources->get_paths_for_channel_ids (full_image_dataset.py)
Updated changelog for bug: 428
Copy link
Contributor

@ant0nsc ant0nsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test for this code yet.

CHANGELOG.md Outdated Show resolved Hide resolved
InnerEye/ML/dataset/full_image_dataset.py Outdated Show resolved Hide resolved
InnerEye/ML/lightning_base.py Outdated Show resolved Hide resolved
InnerEye/ML/lightning_base.py Outdated Show resolved Hide resolved
InnerEye/ML/utils/dataset_util.py Outdated Show resolved Hide resolved
InnerEye/ML/utils/dataset_util.py Outdated Show resolved Hide resolved
InnerEye/ML/lightning_base.py Outdated Show resolved Hide resolved
InnerEye/ML/lightning_base.py Outdated Show resolved Hide resolved
@asantamariapang asantamariapang changed the title Alberto Added a sanity check to ensure there are no missing channels/files in csv prior training/inference. Apr 29, 2021
1) Addressed #447, except get_dataset_splits

2) Added unit test 'test_converts_channels_to_file_paths'

3) Fixed issue when running in Ubuntu 20.04: https://github.com/microsoft/InnerEye-DeepLearning/runs/2453669197?check_suite_focus=true
Tests/ML/datasets/test_dataset.py Outdated Show resolved Hide resolved
InnerEye/ML/dataset/full_image_dataset.py Outdated Show resolved Hide resolved
InnerEye/ML/dataset/full_image_dataset.py Outdated Show resolved Hide resolved
InnerEye/ML/dataset/full_image_dataset.py Outdated Show resolved Hide resolved
Tests/ML/datasets/test_dataset.py Outdated Show resolved Hide resolved
Alberto Santamaria-Pang added 10 commits May 3, 2021 09:51
…g function call

Changed to:
split_dataset = self.config.get_dataset_splits()
for split_data in [split_dataset.train, split_dataset.val, split_dataset.test]:
Tested two conditions:
1: No errors reported if no channels nor files are missing
2: Errors reported for missing channels and files
.idea/InnerEye-DeepLearning.iml Outdated Show resolved Hide resolved
@@ -6,11 +6,12 @@
<envs>
<env name="PYTHONUNBUFFERED" value="1" />
</envs>
<option name="SDK_HOME" value="" />
<option name="SDK_HOME" value="wsl:https://Ubuntu-20.04/home/alberto/miniconda3/envs/InnerEye/bin/python" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please undo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restored file version.


for channel_id in channels:
row = rows.loc[rows[CSV_CHANNEL_HEADER] == channel_id]
channel_failure_flag: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this flag, and replace it with a check for failed_channel_info being empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed flag and added "else" condition. In general can't check check for failed_channel_info being empty since accumulates errors from multiple channels for same subject.

Comment on lines 438 to 440
# ["1", "train_and_test_data/id1_channel1.nii.gz", "channel1", "1"],
# ["1", "train_and_test_data/id1_channel1.nii.gz", "channel2", "1"],
# ["1", "train_and_test_data/id1_mask.nii.gz", "mask", "1"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment saying why this is commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment explaining why is commented out.

Tests/ML/datasets/test_dataset.py Outdated Show resolved Hide resolved
@@ -282,15 +282,13 @@ def converts_channels_to_file_paths(channels: List[str],

for channel_id in channels:
row = rows.loc[rows[CSV_CHANNEL_HEADER] == channel_id]
channel_failure_flag: bool = False
if len(row) == 0:
failed_channel_info += f"Patient {patient_id} does not have channel '{channel_id}'" + os.linesep
channel_failure_flag = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this flag is no longer used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unused flag.

@@ -4,7 +4,7 @@
<content url="file:https://$MODULE_DIR$">
<sourceFolder url="file:https://$MODULE_DIR$" isTestSource="false" />
</content>
<orderEntry type="jdk" jdkName="3.7 @ Ubuntu 20.04" jdkType="Python SDK" />
<orderEntry type="jdk" jdkName="3.7 @ Ubuntu-20.04" jdkType="Python SDK" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you manually set back to "3.7 @ Ubuntu 20.04"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually set back to "3.7 @ Ubuntu 20.04".

@@ -262,6 +264,40 @@ def _load_dataset_sources(self) -> Dict[str, PatientDatasetSource]:
)


def converts_channels_to_file_paths(channels: List[str],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming conventions would suggest to call that "convert..." rather than "converts...."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed naming conventions from "converts..." to "convert..."

ant0nsc
ant0nsc previously approved these changes May 12, 2021
@Shruthi42
Copy link
Contributor

/azp run

Comment on lines 193 to 203
df = pandas.read_csv(str(full_ml_test_data_path(DATASET_CSV_FILE_NAME)), usecols=['filePath'])

path = test_output_dirs.root_dir / "mounted"
path.mkdir(exist_ok=True)

train_and_test_data_path = path / "train_and_test_data"
train_and_test_data_path.mkdir(exist_ok=True)

for filePath in set(df['filePath'].values):
shutil.copy(full_ml_test_data_path() / filePath, path / filePath)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, it would be simpler to make a change in _test_mount_for_lightning_container. There, the dataset.csv file is already copied to both "mounted" and "downloaded" folders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in _test_mount_for_lightning_container.

Comment on lines 237 to 250
if not is_lightning_model:
# Consider three directories, i) "downloaded", ii) "mounted", iii) "train_and_test_data" Path for ""
# represents "train_and_test_data" and is empty since the string "train_and_test_data" is in data.csv
# file and corresponds to the case container.local_dataset == root
for myPath in ["downloaded", "mounted", ""]:

path = test_output_dirs.root_dir / myPath
path.mkdir(exist_ok=True)

train_and_test_data_path = path / "train_and_test_data"
train_and_test_data_path.mkdir(exist_ok=True)

for filePath in set(df['filePath'].values):
shutil.copy(full_ml_test_data_path() / filePath, path / filePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, it would be simpler to make a change in _test_mount_for_lightning_container. There, the dataset.csv file is already copied to both "mounted" and "downloaded" folders.
shutil.copytree(full_ml_test_data_path("train_and_test_data"), path) or shutil.copytree(full_ml_test_data_path("train_and_test_data"), path / "train_and_test_data")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented changes to recursively copy full_ml_test_data_path("train_and_test_data") to path / "train_and_test_data"

.gitignore Outdated
Comment on lines 159 to 160
*.iml
.idea/InnerEye-DeepLearning.iml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the project file should remain checked in.

@@ -4,7 +4,7 @@
<content url="file:https://$MODULE_DIR$">
<sourceFolder url="file:https://$MODULE_DIR$" isTestSource="false" />
</content>
<orderEntry type="jdk" jdkName="3.7 @ Ubuntu 20.04" jdkType="Python SDK" />
<orderEntry type="jdk" jdkName="Python 3.7 (InnerEye)" jdkType="Python SDK" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please undo

Comment on lines 148 to 152
if (path / "train_and_test_data").is_dir():
shutil.rmtree(path / "train_and_test_data")

# Creates directory structure and copy data
shutil.copytree(full_ml_test_data_path("train_and_test_data"), path / "train_and_test_data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this pet peeve about repeated code, and in particular repeated constants. Would be good to store "train_and_test_data" in a variable.

Comment on lines 246 to 247
# With runs outside of AzureML, a local dataset should be used as-is. Azure dataset ID is ignored here.
shutil.copy(full_ml_test_data_path(DATASET_CSV_FILE_NAME), root / DATASET_CSV_FILE_NAME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this still needed? I thought your code is already copying everything in the test... method?

ant0nsc
ant0nsc previously approved these changes May 14, 2021
@@ -143,13 +143,14 @@ def _test_mount_for_lightning_container(test_output_dirs: OutputFolderForTests,
download_path = test_output_dirs.root_dir / "downloaded"
mount_path = test_output_dirs.root_dir / "mounted"
if not is_lightning_model:
train_and_test_data : str = "train_and_test_data"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't normally need the : str annotation. but no need to change that.

ant0nsc
ant0nsc previously approved these changes May 14, 2021
@ant0nsc ant0nsc merged commit c2c3729 into main May 18, 2021
@ant0nsc ant0nsc deleted the alberto branch May 18, 2021 09:18
@ant0nsc ant0nsc linked an issue May 18, 2021 that may be closed by this pull request
@asantamariapang asantamariapang restored the alberto branch May 21, 2021 14:19
@asantamariapang asantamariapang deleted the alberto branch May 21, 2021 15:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that dataset.csv is checked right at the start of training
3 participants