Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Recommend gzip header fields be set to empty values #1349

Merged
merged 2 commits into from
Nov 14, 2022

Conversation

kousu
Copy link
Contributor

@kousu kousu commented Nov 4, 2022

By removing extraneous metadata, pipelines generating BIDS datasets can avoid making unnecessary changes.

We are motivated to this since we are working on combining BIDS with datalad in our workflows, and if a pipeline regenerates hundreds of images with no real change, then we rapidly bloat our storage costs and obscures real changes in git log. datalad (thanks to its dependency git-annex) avoids the bandwidth/storage costs of downstream users, since most of our users download once only, or are always downloading fresh HEAD copies to processing nodes, but our internal storage costs are a big factor for us.

This can be implemented with gzip -n in most environments, or equally by running strip-nondeterminism over a BIDS dataset before publishing.

closes #136

@codecov
Copy link

codecov bot commented Nov 4, 2022

Codecov Report

Base: 88.65% // Head: 88.65% // No change to project coverage 👍

Coverage data is based on head (92efb97) compared to base (e70278c).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1349   +/-   ##
=======================================
  Coverage   88.65%   88.65%           
=======================================
  Files          11       11           
  Lines        1084     1084           
=======================================
  Hits          961      961           
  Misses        123      123           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@effigies
Copy link
Collaborator

effigies commented Nov 4, 2022

I don't think we can make this a requirement as it will invalidate many existing datasets. I would support making it a recommendation, which would produce validation warnings, assuming there is a straightforward way to check for this metadata in Javascript.

@kousu
Copy link
Contributor Author

kousu commented Nov 4, 2022

Cool :). I thought you might say something like that. That would be okay, but I wanted to start out as strong as possible.

Before totally giving up on the "MUST", let me point out that strip-nondeterminism should be able to rapidly re-validate all those datasets. We've had to deal with breaking BIDS changes in our lab already that involved more complicated fixes than that.

@effigies
Copy link
Collaborator

effigies commented Nov 4, 2022

Agreed that the fix is easy, and some labs will be willing to apply it. Apart from concern about making BIDS a moving target and generally annoying people who've done their bit, my perspective here is the OpenNeuro one, where we host data archives (so reproducibility of the archive isn't a concern) but don't generally have permission to update datasets on the user's behalf.

It is true that there've been breaks in the past; most breaking changes in the validator involved rules that were previously unenforced. There are some others where an ambiguity had to be resolved in one direction or another. We do honestly try to keep these to a minimum, and will generally try to accommodate existing practice except when it makes interpretation of the dataset ambiguous or incoherent.

For your lab/center's use case, the validator allows warnings to be promoted to errors, so internally you could have more strict validation. OpenNeuro does this for missing authors, as we can't generate DOIs without this metadata.

So my vote remains for a backwards-compatible SHOULD, though I would support moving to MUST in a BIDS 2.0.

xref bids-standard/bids-validator#1400

@kousu
Copy link
Contributor Author

kousu commented Nov 4, 2022

the validator allows warnings to be promoted to errors

Oh okay! I hadn't noticed that yet. Thank you. That's good enough for me :)

@kousu kousu marked this pull request as ready for review November 5, 2022 04:07
@effigies effigies changed the title [SCHEMA] Insist .nii.gz be as reproducible as possible [ENH] Recommend gzip header fields be set to empty values Nov 14, 2022
By removing extraneous metadata, pipelines generating BIDS datasets can avoid making unnecessary changes.

We are motivated to this since we are working on combining BIDS with datalad in our workflows, and if a pipeline regenerates hundreds of images with no change, then we rapidly bloat our storage costs and obscures real changes in `git log`. datalad (thanks to git-annex) avoids the bandwidth/storage costs of downstream users, since most of our users download once only, or are always downloading fresh HEAD copies to processing nodes, but our internal storage costs are a big factor for us.

This can be implemented with `gzip -n` in most environments, or equally by running https://salsa.debian.org/reproducible-builds/strip-nondeterminism over a BIDS dataset before publishing.
@effigies
Copy link
Collaborator

Rebase changed apparent commit time, but it was authored on Nov 4, so this has passed the 5 day review period. Merging.

@effigies effigies merged commit 9231875 into bids-standard:master Nov 14, 2022
effigies added a commit to effigies/bids-specification that referenced this pull request Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mismatch between name of a *.nii.gz file and the *.nii file it contains
3 participants