Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize dtype for h5path with SlideData constructor #335

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

tddough98
Copy link
Collaborator

@tddough98 tddough98 commented Oct 14, 2022

Currently, PathML stores all images with float16, forcing all image inputs to be upcast or downcast to this data type, which increases storage size or loses information. There already is a dtype parameter in the SlideData constructor, but it's only used to assist the BioFormatsBackend in loading images correctly. This repurposes that parameter to control what dtype h5py uses when writing image data.

I also changed masks to stored as ENUM and use the strongest compression setting as boolean masks are highly compressible and easily compressed. The compression made a huge difference in file size, and using (HDFView)[https://www.hdfgroup.org/downloads/hdfview/] showed a compression ratio of 100-200x for masks. The ENUM data type is stored as an 8-bit integer (https://docs.h5py.org/en/stable/special.html#enumerated-types) but at least this is less than using float16.

@tddough98 tddough98 changed the title Parameterize dtype for h5path with Slide constructor Parameterize dtype for h5path with SlideData constructor Oct 14, 2022
@ryanccarelli
Copy link
Contributor

@jacob-rosenthal Do I remember you saying that certain transforms expect float16? If you remember which we should tag and document these

@jacob-rosenthal
Copy link
Collaborator

Thanks @tddough98 , this looks great! The reason we had to add the dtype parameter in the first place was as a workaround because some of our collaborators were running into problems with images where the dtype from the metadata didn't match the dtype of the image data itself. So the parameter was added so that people could read images in e.g. float16 even if the image metadata said int8. From what I can tell looking at the code, this PR will still keep that behavior?
(Side note, we should add an explicit test for this using one of the misi images to be sure that they are always supported, maybe @MohamedOmar2020 you can help provide a small misi image we can add to the test suite?)

And yes, @ryanccarelli I think we did run into problems with some of the image transforms being picky about dtypes, and found that float16 worked well usually. I think since float16 remains the default, this should still be fine, since things would only break if people manually change dtype themselves

So from what I can tell, this PR seems like strictly an improvement, keeping existing behavior but making it cleaner & more flexible

@tddough98
Copy link
Collaborator Author

I reverted storing all masks as boolean. I was focused on masks made by TissueDetectionHE, which are always boolean, but now realize that instance segmentation masks from mesmer or hovernet need to use different values for each instance. Now, masks are saved with the same datatype as the numpy array for the mask.

@jacob-rosenthal
Copy link
Collaborator

Ah yes that is a good catch. I think that should have been caught by the tests? But looks like the tests haven't run for this PR yet, I guess because it is a PR that isn't merging into dev or master (the only branches that trigger tests).

Base automatically changed from merge-master to dev October 26, 2022 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants