Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create dataset and datamodule classes #14

Closed
wants to merge 1 commit into from
Closed

Conversation

jacob-rosenthal
Copy link
Collaborator

Create SlideDataset and TileDataset objects. TileDataset supports both tile-level and slide-level labels.
Also create DataModule object, similar to in pytorch lightning.

Inputs are pandas dataframes specifying filepaths and labels (need to add support for masks as labels).
Then the DataModule object creates dataloaders.

Pseudocode:

camelyon = DataModule(**kwargs)

train_dataloader = camelyon.train_tile_dataloader()
valid_dataloader = camelyon.valid_tile_dataloader()
test_dataloader = camelyon.test_tile_dataloader()

One benefit of this is that we avoid having to specify file structure, since the input dataframes would contain the paths to the slides and the tiles. This helps keep it more modular.

@jacob-rosenthal
Copy link
Collaborator Author

Actually, datasets like camelyon would inherit from DataModule. Pseudocode:

class CamelyonDataModule(DataModule):
  def __init__(self, data_dir, **datamodule_kwargs):
    # download camelyon data to data_dir
    self.download(destination=data_dir)
    # prepend data_dir to all the filepaths in slide_anno and tile_anno
    # now we have everything to create the datamodule
    super().__init__(**datamodule_kwargs)

User would do something like:

from pathml.datasets import CamelyonDataModule

camelyon = CamelyonDataModule(data_dir = "./data")

train_dataloader = camelyon.train_tile_dataloader()
valid_dataloader = camelyon.valid_tile_dataloader()
test_dataloader = camelyon.test_tile_dataloader()

or user could of course make their own DataModule from local data

@jacob-rosenthal
Copy link
Collaborator Author

Should also include some way to pass a Pipeline object as part of the dataModule. Maybe all Pipelines should always take a dir argument to specify where the tiles are written. Initializing the DataModule would download the data to data_dir, then there would be another step to run the given preprocessing pipeline on each image (in parallel ideally) and then finally dataloaders could be created from the tiles. This way everything is packaged together (wsi dataset, preprocessing pipeline, and train/test/valid splits)

@jacob-rosenthal jacob-rosenthal changed the base branch from master to dev October 16, 2020 14:40
@ryanccarelli
Copy link
Contributor

new pull request after discussions in meeting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants