create dataset and datamodule classes #14

jacob-rosenthal · 2020-10-15T21:20:52Z

Create SlideDataset and TileDataset objects. TileDataset supports both tile-level and slide-level labels.
Also create DataModule object, similar to in pytorch lightning.

Inputs are pandas dataframes specifying filepaths and labels (need to add support for masks as labels).
Then the DataModule object creates dataloaders.

Pseudocode:

camelyon = DataModule(**kwargs)

train_dataloader = camelyon.train_tile_dataloader()
valid_dataloader = camelyon.valid_tile_dataloader()
test_dataloader = camelyon.test_tile_dataloader()

One benefit of this is that we avoid having to specify file structure, since the input dataframes would contain the paths to the slides and the tiles. This helps keep it more modular.

jacob-rosenthal · 2020-10-15T22:46:19Z

Actually, datasets like camelyon would inherit from DataModule. Pseudocode:

class CamelyonDataModule(DataModule):
  def __init__(self, data_dir, **datamodule_kwargs):
    # download camelyon data to data_dir
    self.download(destination=data_dir)
    # prepend data_dir to all the filepaths in slide_anno and tile_anno
    # now we have everything to create the datamodule
    super().__init__(**datamodule_kwargs)

User would do something like:

from pathml.datasets import CamelyonDataModule

camelyon = CamelyonDataModule(data_dir = "./data")

train_dataloader = camelyon.train_tile_dataloader()
valid_dataloader = camelyon.valid_tile_dataloader()
test_dataloader = camelyon.test_tile_dataloader()

or user could of course make their own DataModule from local data

jacob-rosenthal · 2020-10-16T13:26:50Z

Should also include some way to pass a Pipeline object as part of the dataModule. Maybe all Pipelines should always take a dir argument to specify where the tiles are written. Initializing the DataModule would download the data to data_dir, then there would be another step to run the given preprocessing pipeline on each image (in parallel ideally) and then finally dataloaders could be created from the tiles. This way everything is packaged together (wsi dataset, preprocessing pipeline, and train/test/valid splits)

ryanccarelli · 2020-10-26T14:12:32Z

new pull request after discussions in meeting

create dataset and datamodule classes

18e354c

jacob-rosenthal requested a review from ryanccarelli October 15, 2020 21:20

jacob-rosenthal changed the base branch from master to dev October 16, 2020 14:40

ryanccarelli closed this Oct 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create dataset and datamodule classes #14

create dataset and datamodule classes #14

jacob-rosenthal commented Oct 15, 2020

jacob-rosenthal commented Oct 15, 2020

jacob-rosenthal commented Oct 16, 2020

ryanccarelli commented Oct 26, 2020

create dataset and datamodule classes #14

create dataset and datamodule classes #14

Conversation

jacob-rosenthal commented Oct 15, 2020

jacob-rosenthal commented Oct 15, 2020

jacob-rosenthal commented Oct 16, 2020

ryanccarelli commented Oct 26, 2020