Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Latest commit

 

History

History
331 lines (278 loc) · 16.4 KB

creating_dataset.md

File metadata and controls

331 lines (278 loc) · 16.4 KB

Dataset Creation

This document describes the dataset formats used by InnerEye for segmentation and classification tasks. After creating the dataset, upload it to AzureML blob storage (as described in the AzureML documentation)

Segmentation Datasets

This section walks through the process of creating a dataset in the format expected by the InnerEye package. However, if your dataset is in DICOM-RT format, you should instead use the InnerEye-CreateDataset tool. After creating the dataset, you can also analyze the structures in it.

Segmentation datasets should have the input scans and ground truth segmentations in Nifti format.

InnerEye expects segmentation datasets to have the following structure:

  • Each subject has one or more scans, and one or more segmentation masks. There should be one segmentation mask for each ground truth structure (anatomical structure that the model should segment)

  • For convenience, scans and ground truth masks for different subjects can live in separate folders, but that's not a must.

  • Inside the root folder for the dataset, there should be a file dataset.csv, containing the following fields at minimum:

    • subject: A unique positive integer assigned to every patient
    • channel: The imaging channel or ground truth structure described by this row.
    • filePath: Path to the file for this scan or structure. We support nifti (nii, nii.gz), numpy (npy, npz) and hdf5(h5).
      • For HDF5 path suffix with | separator
        • For images |<dataset_name>|
        • For segmentation binary |<dataset_name>|
        • For segmentation multimap |<dataset_name>||
          • Multimaps are encoded as 0=background and integers for each class.
        • The expected dimensions: (channel, Z, Y, X)
      • For numpy or nifti just the expected format is just the path to the files.
        • For images can be encoded as float32 with dimensions (X, Y, Z)
        • For segmentations should be encoded as binary masks with dimensions (X, Y, Z)

    Additional supported fields include acquisition_date, institutionId, seriesID and tags (meant for miscellaneous labels).

For example, for a CT dataset with two structures heart and lung to be segmented, the dataset folder could look like:

dataset_folder_name
├──dataset.csv
├──subjectID1/
│  ├── ct.nii.gz
│  ├── heart.nii.gz
│  └── lung.nii.gz
├──subjectID2/
|  ├── ct.nii.gz
|  ├── heart.nii.gz
|  ├── lung.nii.gz
├──...

The dataset.csv for this dataset would look like:

subject,filePath,channel
1,subjectID1/ct.nii.gz,ct
1,subjectID1/heart.nii.gz,structure1
1,subjectID1/lung.nii.gz,structure2
2,subjectID2/ct.nii.gz,ct
2,subjectID2/heart.nii.gz,structure1
2,subjectID2/lung.nii.gz,structure2

Note: The paths in the dataset.csv file should not be absolute paths, but relative to the folder that contains `dataset.csv'.

Image size requirements

The images in a dataset must adhere to these constraints:

  • All images, across all subjects, must have already undergone geometric normalization, i.e., all images must have approximately the same voxel size. For example, if all images for subject 1 have voxel size 1.5mm x 1.01mm x 1.01mm, and all images for subject 2 have voxel size 1.51mm x 0.99mm x 0.99mm, this should be fine. In particular, this constraint does not mean that voxels need to be isotropic.
  • All images for a particular subject must have the same dimensions. In the above example, if subjectID1/ct.nii.gz has size 200 x 256 x 256, then subjectID1/heart.nii.gz and subjectID1/lung.nii.gz must have exactly the same dimensions.
  • It is not required that images for different subjects have the same dimensions.

All these constraints are automatically checked and guaranteed if the raw data is in DICOM format and you are using the InnerEye-CreateDataset tool to convert them to Nifti format. Geometric normalization can also be turned on as a pre-processing step.

Uploading to Azure

When running in Azure, you need to upload the folder containing the dataset (i.e., the file dataset.csv and the image referenced therein) to the storage account for datasets. This is the storage account you created in the Azure setup, Step 4.

The best way of uploading the data is via Azure Storage Explorer. Please follow the installation instructions first.

  • Find your Azure subscription in the "Explorer" bar, and inside of that, the "Storage Accounts" field, and the storage account you created for datasets.
  • That storage account should have a section "Blob Containers". Check if there is a container called "datasets" already. If not, create one using the context menu.
  • Navigate into the "datasets" container.
  • Then use "Upload/Upload Folder" and choose the folder that contains your dataset (dataset_folder_name in the above example). Leave all other settings in the upload dialog at their default.
  • This will start the upload. Depending on the number of files, that can of course take some time.

Creating a model configuration

For the above dataset structure for heart and lung segmentation, you would then create a model configuration that contains at least the following fields:

class HeartLungModel(SegmentationModelBase):
    def __init__(self) -> None:
        super().__init__(
            azure_dataset_id="dataset_folder_name",
            # Adjust this to where your dataset_folder is on your local box
            local_dataset="/home/me/dataset_folder_name",
            image_channels=["ct"],
            ground_truth_ids=["heart", "lung"],
            # Segmentation architecture
            architecture="UNet3D",
            feature_channels=[32],
            # Size of patches that are used for training, as (z, y, x) tuple 
            crop_size=(64, 224, 224),
            # Reduce this if you see GPU out of memory errors
            train_batch_size=8,
            # Size of patches that are used when evaluating the model
            test_crop_size=(128, 512, 512),
            inference_stride_size=(64, 256, 256),
            # Use CT Window and Level as image pre-processing
            norm_method=PhotometricNormalizationMethod.CtWindow,
            level=40,
            window=400,
            # Learning rate settings
            l_rate=1e-3,
            min_l_rate=1e-5,
            l_rate_polynomial_gamma=0.9,
            num_epochs=120,
            )

The local_dataset field is required if you want to run the InnerEye toolbox on your own VM, and you want to consume the dataset from local storage. If you want to run the InnerEye toolbox inside of AzureML, you need to supply the azure_dataset_id, pointing to a folder in Azure blob storage. This folder should reside in the datasets container in the storage account that you designated for storing your datasets, see the setup instructions.

Analyzing segmentation datasets

Once you have created your Azure dataset, either by the process described here or with the CreateDataset tool, you may want to analyze it in order to detect images and structures that are outliers with respect to a number of statistics, and which therefore may be erroneous or unsuitable for your application. This can be done using the analyze command provided by InnerEye-CreateDataset.

Classification Datasets

Classification datasets should have a dataset.csv and a folder containing the image files. The dataset.csv should have at least the following fields:

  • subject: The subject ID, a unique positive integer assigned to every image
  • path: Path to the image file for this subject
  • value:
    • For binary classification, a (binary) ground truth label. This can be "true" and "false" or "0" and "1".
    • For multi-label classification, the set of all positive labels for the image, separated by a | character. Ex: "0|2|4" for a sample with true labels 0, 2 and 4 and "" for a sample in which all labels are false.
    • For regression, a scalar value.

These, and other fields which can be added to dataset.csv are described in the examples below.

For each entry (subject ID, label value, etc) needed to construct a single input sample, the entry value is read from the channels and columns specified for that entry.

A simple example

Let's look at how to construct a dataset.csv (and changes we will need to make to the model config file in parallel):

SubjectID, FilePath, Label 
1, images/image1.npy, True
2, images/image2.npy, False

This is the simplest dataset.csv possible. It has two images with subject IDs 1 and 2, stored at images/images1.npy and images/images2.npy. This dataset is a classification dataset, since the label values are binary.

To use this dataset.csv, we need to make some additions to the model config. We will use the GlaucomaPublicExt config from the sample tasks in this example. The class should now resemble:

class GlaucomaPublicExt(GlaucomaPublic):
    def __init__(self) -> None:
        super().__init__(azure_dataset_id="name_of_your_dataset_on_azure",
                         subject_column="SubjectID",
                         image_file_column="FilePath",
                         label_value_column="Label")

The parameters subject_column, channel_column, image_file_column and label_value_column tell InnerEye what columns in the csv contain the subject identifiers, channel names, image file paths and labels.

NOTE: If any of the *_column parameters are not specified, InnerEye will look for these entries under the default column names if default names exist. See the CSV headers in csv_util.py for all the defaults.

Using channels in dataset.csv

Channels are fields in dataset.csv which can be used to filter rows. They are typically used when there are multiple images or labels per subject (for example, if multiple images were taken across a period of time for each subject).

A slightly more complex dataset.csv would be the following:

SubjectID, Channel, FilePath, Label
1, image_feature_1, images/image_1_feature_1.npy,
1, image_feature_2, images/image_1_feature_2.npy,
1, label, , True
2, image_feature_1, images/image_2_feature_1.npy
2, image_feature_2, images/image_2_feature_2.npy
2, label, , False

The config file would be

class GlaucomaPublicExt(GlaucomaPublic):
    def __init__(self) -> None:
        super().__init__(azure_dataset_id="name_of_your_dataset_on_azure",
                         subject_column="SubjectID",
                         channel_column="Channel",
                         image_channels=["image_feature_1", "image_feature_2"],
                         image_file_column="FilePath",
                         label_channels=["label"],
                         label_value_column="Label")

The added parameters image_channels and label_channels tell InnerEye to search for image file paths for each subject in rows labelled with image_feature_1 or image_feature_2 and for label values in the rows labelled with label. Thus, in this dataset, each sample will have 2 image features (read from rows with Channel set to image_feature_1 and image_feature_2) and the associated label (read from the row with Channel set to label).

NOTE: There are no defaults for the *_channels parameters, so these must be set as parameters.

Recognized columns in dataset.csv and filtering based on channels

Other recognized fields, apart from subject, channel, file path and label are numerical features and categorical features. These are extra scalar and categorical values to be used as model input.

Any unrecognized columns (any column which is both not described in the model config and has no default) will be converted to a dict of key-value pairs and stored in an object of type GeneralSampleMetadata in the sample.

SubjectID, Channel, FilePath, Label, Tag, weight, class
1, image_time_1, images/image_1_time_1.npy, True, , ,
1, image_time_2, images/image_1_time_2.npy, False, , ,
1, scalar, , , , 0.5,
1, categorical, , , , , 2
1, tags, , ,foo, ,
2, image_time_1, images/image_2_time_1.npy, True, , ,
2, image_time_2, images/image_2_time_2.npy, True, , ,
2, tags, , , bar, ,
1, scalar, , , , 0.3,
1, categorical, , , , , 4
class GlaucomaPublicExt(GlaucomaPublic):
    def __init__(self) -> None:
        super().__init__(azure_dataset_id="name_of_your_dataset_on_azure",
                         subject_column="SubjectID",
                         channel_column="Channel",
                         image_channels=["image_time_1", "image_time_2"],
                         image_file_column="FilePath",
                         label_channels=["image_time_2"],
                         label_value_column="Label",
                         non_image_feature_channels=["scalar"],
                         numerical_columns=["weight"],
                         categorical_columns="class")

In this example, weight is a scalar feature read from the csv, and class is a categorical feature. The extra field Tag is not a recognized field, and so the dataloader will return the tags in the form of key:value pairs for each sample.

Filtering on channels: This example also shows why filtering values by channel is useful: In this example, each subject has 2 images taken at different times with different label values. By using label_channels=["image_time_2"], we can use the label associated with the second image for all subjects.

Multi-label classification datasets

Classification datasets can be multi-label, i.e. they can have more than one label associated with every sample. In this case, in the label column, separate the (numerical) ground truth labels with a pipe character (|) to provide multiple ground truth labels for the sample.

Note that only multi-label datasets are supported, multi-class datasets (where the labels are mutually exclusive) are not supported.

For example, the dataset.csv for a multi-label task with 4 classes (0, 1, 2, 3) would look like the following:

SubjectID, Channel, FilePath, Label
1, image_feature_1, images/image_1_feature_1.npy,
1, image_feature_2, images/image_1_feature_2.npy,
1, label, , 0|2|3
2, image_feature_1, images/image_2_feature_1.npy
2, image_feature_2, images/image_2_feature_2.npy
2, label, , 1|2
3, image_feature_1, images/image_3_feature_1.npy
3, image_feature_2, images/image_3_feature_2.npy
3, label, , 1
4, image_feature_1, images/image_4_feature_1.npy
4, image_feature_2, images/image_4_feature_2.npy
4, label, , 

Note that the label field for sample 4 is left empty, this indicates that all labels are negative in Sample 4. In multi-label tasks, the negative class (all ground truth classes being false for a sample) should not be considered a separate class, and should be encoded by an empty label field.

The labels which are true for each sample in the dataset.csv shown above are:

  • Sample 1: 0, 2, 3
  • Sample 2: 1, 2
  • Sample 3: 1
  • Sample 4: No labels are true for this sample

The config file would be

class GlaucomaPublicExt(GlaucomaPublic):
    def __init__(self) -> None:
        super().__init__(azure_dataset_id="name_of_your_dataset_on_azure",
                         subject_column="SubjectID",
                         channel_column="Channel",
                         image_channels=["image_feature_1", "image_feature_2"],
                         image_file_column="FilePath",
                         label_channels=["label"],
                         label_value_column="Label",
                         class_names=["class0", "class1", "class2", "class3"])

The added parameter class_names gives the string name corresponding to each ground truth class index. In multi-label configs, the class_names parameter must be specified, so that InnerEye can recognize that the task is a multi-label task and parse the dataset.csv accordingly. In binary tasks, the class_names field can optionally be set to a list with a single string in it corresponding to the name of the positive class.