Tarfiles #27

bganglia · 2020-08-06T19:14:45Z

This pull request adds support for opening from tarfiles, zipfiles, and folders with any combination of tarfiles and/or zipfiles, whether in DICOM format or regular image formats such as PNG, and whether using a serial or parallel dataloader.

TODO:

…into tarfiles

ieee8023 · 2020-08-08T19:14:37Z

torchxrayvision/datasets.py

+ def __init__(self, imgpath):
+ if imgpath.endswith(".tar"):
+ self.tarred = tarfile.open(imgpath)
+ self.tar_paths = self.tarred.getmembers()


So you said this takes a lot of time. I see how this approach is super robust. I did some tests for time and it seems like just 30 seconds for the MIMIC data. What about caching this using a dict based on the file path? It would speed things up if multiple objects are created? But it seems like a reasonable price to pay.
The alternative is to skip it and assume a file structure but that is not as nice.

Ok, using a dict, the second load is around 10x faster on my machine.

The dict could also be pickled so there is only one slow load. That option would lead to issues if someone wanted to change the tarfile, although I don't know why they would do that.

Doing hashing and caching a file could be nice but also could be annoying to debug and create more issues (like if there are no write permissions).

ieee8023 · 2020-08-08T19:16:44Z

torchxrayvision/datasets.py

+
+class TarDataset(Dataset):
+ def __init__(self, imgpath):
+ if imgpath.endswith(".tar"):


What about using tarfile.is_tarfile ? This will allow for compressed files which people may want to use.

Good idea, thanks

ieee8023 · 2020-08-08T19:19:24Z

torchxrayvision/datasets.py

 #print(img_path)
- img = imread(img_path)
+ img = self.get_image(imgid)


You added this for NIH_Dataset but it doesn't extend TarDataset.

Ah, let me fix that

ieee8023

I meant it the other way. All the datasets should just become TarDatasets right? This feature makes it much easier to work with all these datasets.

bganglia · 2020-08-08T20:28:48Z

Ah, ok. I agree, it was just a temporary thing because I have only tested this with MIMIC so far. I can fix it the other way and make a quick test

bganglia · 2020-08-10T22:40:45Z

I was running into some weird multiple inheritance issues when I tried creating a ZipDataset class and then tried having datasets that can be initialized from both.

I think that it would be easier to extend to zipfiles, etc. if instead of having a TarDataset class, there was a "storage interface" object that takes a file path in the constructor and has a .get_image() method returning a numpy array. Then the dataloaders could all inherit from a class that picks the right type of storage interface, like this:
https://github.com/bganglia/torchxrayvision/blob/34daddb2551821cce5b5e5d13f7c4b7bba10b56e/torchxrayvision/datasets.py#L267

Instead of having a self.get_image() method, the datasets could just call self.interface.get_image().

…load_dataset()

…ating it

…t_of_date variable to prevent reading from an out-of-date cache

…meter for length of all datasets

… between sessions)

bganglia added 3 commits August 6, 2020 01:07

Open MIMIC from tarfile

9f3eb7b

Merge branch 'master' of https://github.com/ieee8023/torchxrayvision …

0414355

…into tarfiles

Merge branch 'master' of https://github.com/ieee8023/torchxrayvision …

289747d

…into tarfiles

bganglia closed this Aug 6, 2020

revert whitespace

303832c

bganglia reopened this Aug 8, 2020

bganglia marked this pull request as draft August 8, 2020 05:45

ieee8023 reviewed Aug 8, 2020

View reviewed changes

don't use get_image() in NIH_Dataset

da2490b

ieee8023 reviewed Aug 8, 2020

View reviewed changes

bganglia added 16 commits August 8, 2020 16:41

NIH_Dataset extends TarDataset

8fb3f03

Store tarfiles in dictionary

395e5e4

use getnames intead of getmembers

fa69973

use O(n) method for determining imgid from tar_path

abbbfec

random data in MIMIC format

2ba6f5d

script for generating random MIMIC data

cacc3ad

track random MIMIC data

ecbf302

tarfile test using random MIMIC data

04f1a32

fix test directory

90129ab

use .close() on tarfile and regenerate test directory

0aa52a7

support for tarfiles in NIH dataset

349babb

Inherit from TarDataset in PC_Dataset

6999bd3

Storage-agnostic dataset

842ddf8

Inherit from storage agnostic loader

37afa4e

tidy up tarfile code

bbd4007

remove previous TarDataset, ZipDataset classes

34daddb

bganglia added 2 commits August 13, 2020 18:13

Scripts for generating test data

727d9ff

Test data

d2ae7c0

bganglia added 16 commits August 26, 2020 18:15

save progress

5c4117e

inherit from Dataset in NIH_Dataset

2773c69

Add code for automated tests with script-generated data

7ffc252

script for writing random data

68a71ae

fall back on .index() instead of trying to load a cached version in .…

ec9777b

…load_dataset()

support multiprocessing

29498a6

Clean up new code for tests and format interfaces

3674357

write partial metadata files with subset of columns

ccec9ae

Improve caching

c091734

fix tests

e56a565

fix error in data-generation script

1dde4b7

create .torchxrayvision if it does not already exist

1628db4

fix line adding .torchxrayvision

124467c

Commit sample data for testing NLM_TB datasets, instead of auto-gener…

28816e5

…ating it

Commit covid test cases

ce38e57

Include parallel tests again

281935c

bganglia marked this pull request as ready for review August 28, 2020 15:24

bganglia added 13 commits August 28, 2020 12:31

trycatch on reading/writing stored_mappings, with disk_unwriteable_ou…

9c2c9d2

…t_of_date variable to prevent reading from an out-of-date cache

work when .torchxrayvision is not writeable

7c6aebb

remove some print statements

cb97e70

add test simulating an unwriteable disk

950ae96

use filesystem instead of dictionary

300c9d7

rewrite data generation scripts as python, not bash scripts; add para…

218fa75

…meter for length of all datasets

cleanup: better variable names and use blake2b instead of hash (works…

b22cead

… between sessions)

Add test for asserting a dataset loads faster the second time

ae09bc9

Don't invoke duration test, to avoid spurious errors

30c043b

Call on new data generation script

bfdebf2

simplify and improve documentation

0f7ea51

reorganize

71c7a50

Fix path length in CheX_Dataset

1715b9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tarfiles #27

Tarfiles #27

bganglia commented Aug 6, 2020 •

edited

ieee8023 Aug 8, 2020

bganglia Aug 9, 2020

ieee8023 Aug 9, 2020 •

edited

ieee8023 Aug 8, 2020

bganglia Aug 8, 2020

ieee8023 Aug 8, 2020

bganglia Aug 8, 2020

ieee8023 left a comment

bganglia commented Aug 8, 2020

bganglia commented Aug 10, 2020 •

edited

Tarfiles #27

Are you sure you want to change the base?

Tarfiles #27

Conversation

bganglia commented Aug 6, 2020 • edited

ieee8023 Aug 8, 2020

Choose a reason for hiding this comment

bganglia Aug 9, 2020

Choose a reason for hiding this comment

ieee8023 Aug 9, 2020 • edited

Choose a reason for hiding this comment

ieee8023 Aug 8, 2020

Choose a reason for hiding this comment

bganglia Aug 8, 2020

Choose a reason for hiding this comment

ieee8023 Aug 8, 2020

Choose a reason for hiding this comment

bganglia Aug 8, 2020

Choose a reason for hiding this comment

ieee8023 left a comment

Choose a reason for hiding this comment

bganglia commented Aug 8, 2020

bganglia commented Aug 10, 2020 • edited

bganglia commented Aug 6, 2020 •

edited

ieee8023 Aug 9, 2020 •

edited

bganglia commented Aug 10, 2020 •

edited