Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

Open
cmaumet opened this issue Mar 1, 2018 · 11 comments
Open

Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

cmaumet opened this issue Mar 1, 2018 · 11 comments

Comments

@cmaumet
Copy link
Member

cmaumet commented Mar 1, 2018

Hi everyone,

In a NIDM-Results pack:

  • we store the shasum of the files we refer to.
  • we store gzipped images (.nii.gz) (in order to save space).

But shasum of gzipped files are different:

  • whether images have been compressed using Matlab versus Octave
  • depending on the host system (local mac OS versus ubuntu on Travis CI) when Octave is launched via docker

Differences in shasum can be explained by the fact that different processes were used to gzip the images. But, this is disserving our initial goal to be able to identify common images across multiple NIDM graphs (for reconciliation).

As a workaround, we could additionally store the shasum of the file before compression.

What are your thoughts on this?

Note: This issue was identified with @gllmflndn when testing the SPM-NIDM-Results exporter in Octave at incf-nidash/nidmresults-spm#46 and briefly discussed on NIDM call (Jan. 29th, 2018).

@nicholst
Copy link
Contributor

nicholst commented Mar 1, 2018

+1 on this... especially considering that gzip can be called with different options (e.g. compression level) and even have optional comment fields, this was always rather fragile. It's annoying, but I don't see a workaround.

@satra
Copy link
Contributor

satra commented Mar 1, 2018

+1 on storing non-zipped sums. but in general since a change of a bit can effect a shasum, these are not good substitutes for anything other than identity.

we have always considered more flexible hashes to match binary blob, header, etc.,. we can describe an image based on overall hash, the blob being the same, the header being the same, etc.,.

@cmaumet
Copy link
Member Author

cmaumet commented Mar 5, 2018

We discussed this on NIDM call on March 5th.

@cmaumet - to write up a proposal on how to store the original shasum (including pros and cons).

@satra
Copy link
Contributor

satra commented Mar 5, 2018

given that shasum's are bit dependent, what is the likelihood of two unzipped nifti files having the same shasum when run through the same processing say in spm and fsl?

i.e. should we start moving towards breaking down the information content into pieces that we want to query on.

@cmaumet
Copy link
Member Author

cmaumet commented Mar 5, 2018

@satra: if two pipelines reused the same data?

@satra
Copy link
Contributor

satra commented Mar 5, 2018

@cmaumet - yes. i worry there are too many pieces in the nifti file that would be different.

so the only thing consistent would be at the level of the input data. and if that's the case, then the SHASUM as it stands currently would be fine to refer to input data.

@cmaumet
Copy link
Member Author

cmaumet commented Mar 5, 2018

@satra - what would be your suggestion of update for NIDM? Creating separate entities for headers & image, for each file?

@satra
Copy link
Contributor

satra commented Mar 5, 2018

@cmaumet - perhaps it may be useful to know what sort of equality comparisons are you planning to make?

@nicholst
Copy link
Contributor

nicholst commented Mar 5, 2018 via email

@satra
Copy link
Contributor

satra commented Mar 6, 2018

@nicholst - that is correct. hence my question of what types of comparisons to make.

i used the phrase "same same but different" for an ohbm brainhack project last year, to illustrate issues with similarity. two files can be similar on the basis of:

image similarity

  • imaging modality
  • imaging object (brain, cerebellum, spinal cord, ...)
  • image subject to transformation

graph similarity

  • processing applied
  • participant/group similiarity
  • participant characteristics (age, gender, zygosity, clinical diagnosis, ...)

for this specific issue, perhaps we should be focusing on attributes directly/easily extractable from the image. we want a set of comparison attributes associated with an image. we could insert new attributes to the file, or create a new companion entity of similarity measures. i.e when are two files similar.

i do think this topic is worth a good discussion. we should determine what aspects of similarity we get:

  • directly from attributes
  • through processing of the graph
  • through processing of the image

and what use cases these pieces of information are intended to help address.

@khelm
Copy link
Contributor

khelm commented Mar 6, 2018

FYI - There is a similar discussion regarding the use of owl:sameAs. They point out that owl:sameAs is often used to convey "represents", "very similar to", "same thing but a different context", etc. Some of which are relevant to the discussion above by @satra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants