-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Recommend gzip header fields be set to empty values #1349
Conversation
Codecov ReportBase: 88.65% // Head: 88.65% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## master #1349 +/- ##
=======================================
Coverage 88.65% 88.65%
=======================================
Files 11 11
Lines 1084 1084
=======================================
Hits 961 961
Misses 123 123 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I don't think we can make this a requirement as it will invalidate many existing datasets. I would support making it a recommendation, which would produce validation warnings, assuming there is a straightforward way to check for this metadata in Javascript. |
Cool :). I thought you might say something like that. That would be okay, but I wanted to start out as strong as possible. Before totally giving up on the "MUST", let me point out that |
Agreed that the fix is easy, and some labs will be willing to apply it. Apart from concern about making BIDS a moving target and generally annoying people who've done their bit, my perspective here is the OpenNeuro one, where we host data archives (so reproducibility of the archive isn't a concern) but don't generally have permission to update datasets on the user's behalf. It is true that there've been breaks in the past; most breaking changes in the validator involved rules that were previously unenforced. There are some others where an ambiguity had to be resolved in one direction or another. We do honestly try to keep these to a minimum, and will generally try to accommodate existing practice except when it makes interpretation of the dataset ambiguous or incoherent. For your lab/center's use case, the validator allows warnings to be promoted to errors, so internally you could have more strict validation. OpenNeuro does this for missing authors, as we can't generate DOIs without this metadata. So my vote remains for a backwards-compatible SHOULD, though I would support moving to MUST in a BIDS 2.0. |
Oh okay! I hadn't noticed that yet. Thank you. That's good enough for me :) |
By removing extraneous metadata, pipelines generating BIDS datasets can avoid making unnecessary changes. We are motivated to this since we are working on combining BIDS with datalad in our workflows, and if a pipeline regenerates hundreds of images with no change, then we rapidly bloat our storage costs and obscures real changes in `git log`. datalad (thanks to git-annex) avoids the bandwidth/storage costs of downstream users, since most of our users download once only, or are always downloading fresh HEAD copies to processing nodes, but our internal storage costs are a big factor for us. This can be implemented with `gzip -n` in most environments, or equally by running https://salsa.debian.org/reproducible-builds/strip-nondeterminism over a BIDS dataset before publishing.
Rebase changed apparent commit time, but it was authored on Nov 4, so this has passed the 5 day review period. Merging. |
By removing extraneous metadata, pipelines generating BIDS datasets can avoid making unnecessary changes.
We are motivated to this since we are working on combining BIDS with datalad in our workflows, and if a pipeline regenerates hundreds of images with no real change, then we rapidly bloat our storage costs and obscures real changes in
git log
. datalad (thanks to its dependency git-annex) avoids the bandwidth/storage costs of downstream users, since most of our users download once only, or are always downloading fresh HEAD copies to processing nodes, but our internal storage costs are a big factor for us.This can be implemented with
gzip -n
in most environments, or equally by runningstrip-nondeterminism
over a BIDS dataset before publishing.closes #136