Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tarfiles #27

Open
wants to merge 54 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
9f3eb7b
Open MIMIC from tarfile
bganglia Aug 3, 2020
0414355
Merge branch 'master' of https://github.com/ieee8023/torchxrayvision …
bganglia Aug 6, 2020
289747d
Merge branch 'master' of https://github.com/ieee8023/torchxrayvision …
bganglia Aug 6, 2020
303832c
revert whitespace
bganglia Aug 8, 2020
da2490b
don't use get_image() in NIH_Dataset
bganglia Aug 8, 2020
8fb3f03
NIH_Dataset extends TarDataset
bganglia Aug 8, 2020
395e5e4
Store tarfiles in dictionary
bganglia Aug 8, 2020
fa69973
use getnames intead of getmembers
bganglia Aug 8, 2020
abbbfec
use O(n) method for determining imgid from tar_path
bganglia Aug 9, 2020
2ba6f5d
random data in MIMIC format
bganglia Aug 9, 2020
cacc3ad
script for generating random MIMIC data
bganglia Aug 9, 2020
ecbf302
track random MIMIC data
bganglia Aug 9, 2020
04f1a32
tarfile test using random MIMIC data
bganglia Aug 9, 2020
90129ab
fix test directory
bganglia Aug 9, 2020
0aa52a7
use .close() on tarfile and regenerate test directory
bganglia Aug 9, 2020
349babb
support for tarfiles in NIH dataset
bganglia Aug 9, 2020
6999bd3
Inherit from TarDataset in PC_Dataset
bganglia Aug 10, 2020
842ddf8
Storage-agnostic dataset
bganglia Aug 10, 2020
37afa4e
Inherit from storage agnostic loader
bganglia Aug 10, 2020
bbd4007
tidy up tarfile code
bganglia Aug 10, 2020
34daddb
remove previous TarDataset, ZipDataset classes
bganglia Aug 10, 2020
727d9ff
Scripts for generating test data
bganglia Aug 13, 2020
d2ae7c0
Test data
bganglia Aug 13, 2020
41b50c4
Tests for zip, tar in MIMIC, NIH, and PC
bganglia Aug 13, 2020
48d8170
clean up storage classes
bganglia Aug 13, 2020
5c4117e
save progress
bganglia Aug 26, 2020
2773c69
inherit from Dataset in NIH_Dataset
bganglia Aug 26, 2020
7ffc252
Add code for automated tests with script-generated data
bganglia Aug 26, 2020
68a71ae
script for writing random data
bganglia Aug 26, 2020
ec9777b
fall back on .index() instead of trying to load a cached version in .…
bganglia Aug 26, 2020
29498a6
support multiprocessing
bganglia Aug 27, 2020
3674357
Clean up new code for tests and format interfaces
bganglia Aug 27, 2020
ccec9ae
write partial metadata files with subset of columns
bganglia Aug 27, 2020
c091734
Improve caching
bganglia Aug 27, 2020
e56a565
fix tests
bganglia Aug 28, 2020
1dde4b7
fix error in data-generation script
bganglia Aug 28, 2020
1628db4
create .torchxrayvision if it does not already exist
bganglia Aug 28, 2020
124467c
fix line adding .torchxrayvision
bganglia Aug 28, 2020
28816e5
Commit sample data for testing NLM_TB datasets, instead of auto-gener…
bganglia Aug 28, 2020
ce38e57
Commit covid test cases
bganglia Aug 28, 2020
281935c
Include parallel tests again
bganglia Aug 28, 2020
9c2c9d2
trycatch on reading/writing stored_mappings, with disk_unwriteable_ou…
bganglia Aug 28, 2020
7c6aebb
work when .torchxrayvision is not writeable
bganglia Aug 28, 2020
cb97e70
remove some print statements
bganglia Aug 28, 2020
950ae96
add test simulating an unwriteable disk
bganglia Aug 28, 2020
300c9d7
use filesystem instead of dictionary
bganglia Aug 28, 2020
218fa75
rewrite data generation scripts as python, not bash scripts; add para…
bganglia Aug 30, 2020
b22cead
cleanup: better variable names and use blake2b instead of hash (works…
bganglia Aug 31, 2020
ae09bc9
Add test for asserting a dataset loads faster the second time
bganglia Aug 31, 2020
30c043b
Don't invoke duration test, to avoid spurious errors
bganglia Aug 31, 2020
bfdebf2
Call on new data generation script
bganglia Aug 31, 2020
0f7ea51
simplify and improve documentation
bganglia Sep 5, 2020
71c7a50
reorganize
bganglia Sep 19, 2020
1715b9d
Fix path length in CheX_Dataset
bganglia Sep 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added tests/NIH_test_data/folder/00000001_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000002_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000003_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000005_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000006_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000007_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000008_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000009_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000010_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/folder/00000011_000.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/NIH_test_data/tar.tar
Binary file not shown.
Binary file added tests/NIH_test_data/zip.zip
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/PC_test_data/tar.tar
Binary file not shown.
Binary file added tests/PC_test_data/zip.zip
Binary file not shown.
156 changes: 156 additions & 0 deletions tests/gen_mimic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
import numpy as np
import pdb
import tarfile
import pandas as pd
from PIL import Image
import random
import argparse
from pathlib import Path
import os

from random_data import write_random_images

def show(x):
print(x)
return x

mimic_metadata_filename = "mimic-cxr-2.0.0-metadata.csv"
mimic_csvdata_filename = "mimic-cxr-2.0.0-negbio.csv"

def generate_random_metadata(n, dimensions):
columns = "dicom_id,subject_id,study_id,PerformedProcedureStepDescription,ViewPosition,Rows,Columns,StudyDate,StudyTime,ProcedureCodeSequence_CodeMeaning,ViewCodeSequence_CodeMeaning,PatientOrientationCodeSequence_CodeMeaning".split(",")
performed_procedure_step_descriptions = {
"CHEST (PA AND LAT)":{
"n_views":2,
"view_position":["LATERAL", "PA"],
"procedure_code_meaning":"CHEST (PA AND LAT)",
"view_code_meaning":["lateral", "postero-anterior"],
"orientation_code_meaning":["Erect","Recumbent"]
}
}
def hex(n):
hex_chars = list("0123456789abcdef")
return "".join(np.random.choice(hex_chars,n))

def int(n):
int_chars = list("0123456789abcdef")
return "".join(np.random.choice(int_chars,n))

def generate_random_row(dimensions):
performed_procedure_step_description = random.choice(
list(performed_procedure_step_descriptions)
)
procedure = performed_procedure_step_descriptions[performed_procedure_step_description]
n_views = procedure["n_views"]
view_index = random.randint(0, n_views - 1)
view_position = procedure["view_position"][view_index]
procedure_code_meaning = procedure["procedure_code_meaning"]
view_code_meaning = procedure["view_code_meaning"][view_index]
#Currently unsure how/if view codes are mapped to orientations
orientation_code_meaning = random.choice(procedure["orientation_code_meaning"])
subject_id = int(8)
study_id = int(8)
meta_row = {
"dicom_id":"-".join([hex(8) for i in range(4)]),
"subject_id":subject_id,
"study_id":study_id,
"PerformedProcedureStepDescription":performed_procedure_step_description,
"ViewPosition":view_position,
"Rows":dimensions[0],
"Columns":dimensions[1],
"StudyDate":0,
"StudyTime":0,
"ProcedureCodeSequence_CodeMeaning":procedure_code_meaning,
"ViewCodeSequence_CodeMeaning":view_code_meaning,
"PatientOrientationCodeSequence_CodeMeaning":orientation_code_meaning
}

def random_pred():
return random.choice(["1.0","-1.0","0.0",""])

csv_row = {
"subject_id":subject_id,
"study_id":study_id,
"Atelectasis":random_pred(),
"Cardiomegaly":random_pred(),
"Consolidation":random_pred(),
"Edema":random_pred(),
"Enlarged Cardiomediastinum":random_pred(),
"Fracture":random_pred(),
"Lung Lesion":random_pred(),
"Lung Opacity":random_pred(),
"No Finding":random_pred(),
"Pleural Effusion":random_pred(),
"Pleural Other":random_pred(),
"Pneumonia":random_pred(),
"Pneumothorax":random_pred(),
"Support Devices":random_pred()
}
return meta_row, csv_row

meta_rows, csv_rows = show(list(zip(*show([generate_random_row(dimensions) for i in range(n)]))))

return pd.DataFrame(meta_rows), pd.DataFrame(csv_rows)



def generate_test_images(random_metadata, extracted, tarname, zipname, dimensions):
paths = []
for _, row in random_metadata.iterrows():
subjectid = row["subject_id"]
studyid = row["study_id"]
dicom_id = row["dicom_id"]
img_fname = os.path.join("p" + subjectid[:2], "p" + subjectid, "s" + studyid, dicom_id + ".jpg")
paths.append(Path("files")/img_fname)
write_random_images(paths, extracted, tarname, zipname, dimensions)

def generate_test_data(n, directory, dimensions=(224, 224), tarname=None, zipname=None, extracted=None):
directory = Path(directory)
if tarname is None:
tarname = directory/"images-224.tar"
if zipname is None:
zipname = directory/"images-224.zip"
if extracted is None:
extracted = directory/"images-224"
random_metadata, random_csvdata = generate_random_metadata(
n,
dimensions
)
generate_test_images(random_metadata, extracted, tarname, zipname, dimensions)
random_metadata.to_csv(
directory/mimic_metadata_filename,
index=False
)
random_metadata.to_csv(
directory/(mimic_metadata_filename+".gz"),
compression="gzip",
index=False
)
random_csvdata.to_csv(
directory/mimic_csvdata_filename,
index=False
)
random_csvdata.to_csv(
directory/(mimic_csvdata_filename+".gz"),
compression="gzip",
index=False
)

#./images-224/files/p17/p17387118/s56770356/b983f94c-b77ad35d-8a4aa372-2faf6503-5ec94835.jpg

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("n")
parser.add_argument("directory")
parser.add_argument("x")
parser.add_argument("y")
parser.add_argument("tarfile", default=None, nargs="?")
parser.add_argument("extracted", default=None, nargs="?")
args = parser.parse_args()
generate_test_data(
n=int(args.n),
directory = args.directory,
dimensions = (int(args.x), int(args.y)),
tarname = args.tarfile,
extracted = args.extracted
)
2 changes: 2 additions & 0 deletions tests/gen_mimic.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
python3 gen_mimic.py 10 gen_mimic 224 224

Binary file added tests/gen_mimic/images-224.tar
Binary file not shown.
Binary file added tests/gen_mimic/images-224.zip
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 11 additions & 0 deletions tests/gen_mimic/mimic-cxr-2.0.0-metadata.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
dicom_id,subject_id,study_id,PerformedProcedureStepDescription,ViewPosition,Rows,Columns,StudyDate,StudyTime,ProcedureCodeSequence_CodeMeaning,ViewCodeSequence_CodeMeaning,PatientOrientationCodeSequence_CodeMeaning
dc7a8b5e-4ad1afd8-8f994433-643c39f8,346b3f01,14b48a44,CHEST (PA AND LAT),LATERAL,224,224,0,0,CHEST (PA AND LAT),lateral,Recumbent
9622b5f2-3a3e307e-f5ca9164-375a6f70,11a8f57f,bf2ecf26,CHEST (PA AND LAT),PA,224,224,0,0,CHEST (PA AND LAT),postero-anterior,Erect
8cf7e7b2-46bdc39f-31564b20-71ee6b6d,4eedf9fe,f8521632,CHEST (PA AND LAT),PA,224,224,0,0,CHEST (PA AND LAT),postero-anterior,Erect
6a301286-456d8528-c59bd118-195ce98c,bc0ee611,89cfed09,CHEST (PA AND LAT),LATERAL,224,224,0,0,CHEST (PA AND LAT),lateral,Erect
67dc7d13-32da76c2-ff22a175-5d4c3ced,54e08d2a,1c41417d,CHEST (PA AND LAT),PA,224,224,0,0,CHEST (PA AND LAT),postero-anterior,Erect
b69fd4aa-55745241-f15af2bb-b979004a,3b3b7d36,f076c36f,CHEST (PA AND LAT),PA,224,224,0,0,CHEST (PA AND LAT),postero-anterior,Erect
7e393cbe-6eac27c9-469708b4-dc5f22ef,310345ea,274bcf57,CHEST (PA AND LAT),LATERAL,224,224,0,0,CHEST (PA AND LAT),lateral,Erect
a0082cb3-93689ac6-4fbbad4e-ddd68866,bfb2d1b6,2c5d98ee,CHEST (PA AND LAT),LATERAL,224,224,0,0,CHEST (PA AND LAT),lateral,Recumbent
4197fbbc-1ed2f09e-8a82308b-971009a3,3d404d22,1e752e16,CHEST (PA AND LAT),LATERAL,224,224,0,0,CHEST (PA AND LAT),lateral,Erect
baf27490-fe6678a0-fe0b7379-665d78ba,6145fd64,7968e508,CHEST (PA AND LAT),PA,224,224,0,0,CHEST (PA AND LAT),postero-anterior,Recumbent
Binary file added tests/gen_mimic/mimic-cxr-2.0.0-metadata.csv.gz
Binary file not shown.
11 changes: 11 additions & 0 deletions tests/gen_mimic/mimic-cxr-2.0.0-negbio.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
subject_id,study_id,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax,Support Devices
346b3f01,14b48a44,-1.0,-1.0,0.0,0.0,-1.0,-1.0,,-1.0,0.0,,,,,
11a8f57f,bf2ecf26,1.0,1.0,0.0,0.0,-1.0,0.0,-1.0,-1.0,0.0,1.0,1.0,-1.0,,1.0
4eedf9fe,f8521632,,,1.0,0.0,1.0,1.0,,,,1.0,0.0,-1.0,-1.0,
bc0ee611,89cfed09,0.0,,0.0,-1.0,-1.0,1.0,1.0,1.0,,0.0,-1.0,0.0,-1.0,0.0
54e08d2a,1c41417d,1.0,-1.0,0.0,,,-1.0,-1.0,1.0,-1.0,-1.0,,,1.0,0.0
3b3b7d36,f076c36f,,0.0,0.0,0.0,,1.0,0.0,0.0,0.0,-1.0,,,1.0,0.0
310345ea,274bcf57,,0.0,0.0,-1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-1.0
bfb2d1b6,2c5d98ee,-1.0,,,,,,1.0,,1.0,0.0,0.0,0.0,-1.0,-1.0
3d404d22,1e752e16,0.0,0.0,1.0,-1.0,0.0,-1.0,1.0,1.0,0.0,-1.0,-1.0,1.0,-1.0,1.0
6145fd64,7968e508,,-1.0,0.0,,-1.0,1.0,1.0,0.0,1.0,1.0,,1.0,,
Binary file added tests/gen_mimic/mimic-cxr-2.0.0-negbio.csv.gz
Binary file not shown.
2 changes: 2 additions & 0 deletions tests/generate_all.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
bash gen_mimic.sh
bash generate_test_data.sh
29 changes: 29 additions & 0 deletions tests/generate_test_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from pathlib import Path
import pandas as pd
from random_data import write_random_images
import argparse

def generate_test_data(metadata_file, filename_column, size, test_data_folder):
test_data_folder = Path(test_data_folder)
write_random_images(
metadata_file[filename_column],
test_data_folder/"folder",
test_data_folder/"tar.tar",
test_data_folder/"zip.zip",
size
)

if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("metadata_file")
parser.add_argument("filename_column")
parser.add_argument("x")
parser.add_argument("y")
parser.add_argument("test_data_folder")
args = parser.parse_args()
generate_test_data(
pd.read_csv(args.metadata_file),
args.filename_column,
(int(args.x), int(args.y)),
args.test_data_folder
)
2 changes: 2 additions & 0 deletions tests/generate_test_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
python3 generate_test_data.py pc.csv ImageID 2 2 PC_test_data
python3 generate_test_data.py nih.csv "Image Index" 2 2 NIH_test_data
68 changes: 67 additions & 1 deletion tests/test_dataloaders.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
import pytest
import pickle
import torchxrayvision as xrv

import os
from pathlib import Path

dataset_classes = [xrv.datasets.NIH_Dataset,
xrv.datasets.PC_Dataset,
xrv.datasets.NIH_Google_Dataset,
Expand Down Expand Up @@ -45,3 +48,66 @@ def test_dataloader_merging_incorrect_alignment():

assert "incorrect pathology alignment" in str(excinfo.value)

def test_mimic_tar():
print(os.getcwd())
#Load tarred and untarred datasets
mimic_test_dir = Path("tests/gen_mimic")
metacsvpath = mimic_test_dir/"mimic-cxr-2.0.0-metadata.csv"
csvpath = mimic_test_dir/"mimic-cxr-2.0.0-negbio.csv"
tarred = xrv.datasets.MIMIC_Dataset(
imgpath=mimic_test_dir/"images-224.tar",
csvpath=csvpath,
metacsvpath=metacsvpath,
)
extracted = xrv.datasets.MIMIC_Dataset(
imgpath=mimic_test_dir/"images-224"/"files",
csvpath=csvpath,
metacsvpath=metacsvpath
)
#Assert items are the same
for tarred_item, extracted_item in zip(tarred, extracted):
assert pickle.dumps(tarred_item) == pickle.dumps(extracted_item)

def all_equal(items):
if len(items) == 1:
return True
return all([item == items[0] for item in items[1:]])

def _test_opening_formats(dataset_class, imgpaths, **kwargs):
sources = [dataset_class(imgpath=path, **kwargs) for path in imgpaths]
for one_item_from_each in zip(*sources):
assert all_equal([pickle.dumps(item) for item in one_item_from_each])

def test_mimic_formats():
_test_opening_formats(
xrv.datasets.MIMIC_Dataset,
imgpaths=[
"tests/gen_mimic/images-224/files",
"tests/gen_mimic/images-224.tar",
"tests/gen_mimic/images-224.zip"
],
csvpath="tests/gen_mimic/mimic-cxr-2.0.0-negbio.csv",
metacsvpath="tests/gen_mimic/mimic-cxr-2.0.0-metadata.csv"
)

def test_nih_formats():
_test_opening_formats(
xrv.datasets.NIH_Dataset,
imgpaths=[
"tests/NIH_test_data/folder",
"tests/NIH_test_data/tar.tar",
"tests/NIH_test_data/zip.zip"
],
csvpath="tests/nih.csv"
)

def test_pc_formats():
_test_opening_formats(
xrv.datasets.PC_Dataset,
imgpaths=[
"tests/PC_test_data/folder",
"tests/PC_test_data/tar.tar",
"tests/PC_test_data/zip.zip"
],
csvpath="tests/pc.csv"
)
Loading