[ENH] Recommend gzip header fields be set to empty values #1349

kousu · 2022-11-04T19:17:16Z

By removing extraneous metadata, pipelines generating BIDS datasets can avoid making unnecessary changes.

We are motivated to this since we are working on combining BIDS with datalad in our workflows, and if a pipeline regenerates hundreds of images with no real change, then we rapidly bloat our storage costs and obscures real changes in git log. datalad (thanks to its dependency git-annex) avoids the bandwidth/storage costs of downstream users, since most of our users download once only, or are always downloading fresh HEAD copies to processing nodes, but our internal storage costs are a big factor for us.

This can be implemented with gzip -n in most environments, or equally by running strip-nondeterminism over a BIDS dataset before publishing.

closes #136

codecov · 2022-11-04T19:24:39Z

Codecov Report

Base: 88.65% // Head: 88.65% // No change to project coverage 👍

Coverage data is based on head (92efb97) compared to base (e70278c).
Patch has no changes to coverable lines.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1349   +/-   ##
=======================================
  Coverage   88.65%   88.65%           
=======================================
  Files          11       11           
  Lines        1084     1084           
=======================================
  Hits          961      961           
  Misses        123      123

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

effigies · 2022-11-04T19:33:13Z

I don't think we can make this a requirement as it will invalidate many existing datasets. I would support making it a recommendation, which would produce validation warnings, assuming there is a straightforward way to check for this metadata in Javascript.

kousu · 2022-11-04T19:38:59Z

Cool :). I thought you might say something like that. That would be okay, but I wanted to start out as strong as possible.

Before totally giving up on the "MUST", let me point out that strip-nondeterminism should be able to rapidly re-validate all those datasets. We've had to deal with breaking BIDS changes in our lab already that involved more complicated fixes than that.

effigies · 2022-11-04T20:25:43Z

Agreed that the fix is easy, and some labs will be willing to apply it. Apart from concern about making BIDS a moving target and generally annoying people who've done their bit, my perspective here is the OpenNeuro one, where we host data archives (so reproducibility of the archive isn't a concern) but don't generally have permission to update datasets on the user's behalf.

It is true that there've been breaks in the past; most breaking changes in the validator involved rules that were previously unenforced. There are some others where an ambiguity had to be resolved in one direction or another. We do honestly try to keep these to a minimum, and will generally try to accommodate existing practice except when it makes interpretation of the dataset ambiguous or incoherent.

For your lab/center's use case, the validator allows warnings to be promoted to errors, so internally you could have more strict validation. OpenNeuro does this for missing authors, as we can't generate DOIs without this metadata.

So my vote remains for a backwards-compatible SHOULD, though I would support moving to MUST in a BIDS 2.0.

xref bids-standard/bids-validator#1400

kousu · 2022-11-04T20:30:15Z

the validator allows warnings to be promoted to errors

Oh okay! I hadn't noticed that yet. Thank you. That's good enough for me :)

By removing extraneous metadata, pipelines generating BIDS datasets can avoid making unnecessary changes. We are motivated to this since we are working on combining BIDS with datalad in our workflows, and if a pipeline regenerates hundreds of images with no change, then we rapidly bloat our storage costs and obscures real changes in `git log`. datalad (thanks to git-annex) avoids the bandwidth/storage costs of downstream users, since most of our users download once only, or are always downloading fresh HEAD copies to processing nodes, but our internal storage costs are a big factor for us. This can be implemented with `gzip -n` in most environments, or equally by running https://salsa.debian.org/reproducible-builds/strip-nondeterminism over a BIDS dataset before publishing.

effigies · 2022-11-14T16:46:27Z

Rebase changed apparent commit time, but it was authored on Nov 4, so this has passed the 5 day review period. Merging.

effigies mentioned this pull request Nov 4, 2022

Warning when gzip headers contain non-BIDS filenames bids-standard/bids-validator#1400

Open

kousu marked this pull request as ready for review November 5, 2022 04:07

effigies changed the title ~~[SCHEMA] Insist .nii.gz be as reproducible as possible~~ [ENH] Recommend gzip header fields be set to empty values Nov 14, 2022

effigies approved these changes Nov 14, 2022

View reviewed changes

kousu added 2 commits November 14, 2022 08:52

"MUST" -> "SHOULD"

92efb97

effigies force-pushed the patch-1 branch from d26517e to 92efb97 Compare November 14, 2022 13:52

effigies requested a review from DimitriPapadopoulos as a code owner November 14, 2022 13:52

DimitriPapadopoulos approved these changes Nov 14, 2022

View reviewed changes

effigies merged commit 9231875 into bids-standard:master Nov 14, 2022

effigies added a commit to effigies/bids-specification that referenced this pull request Nov 17, 2022

NF: Add GZIP_HEADER_DATA issue, following bids-standardgh-1349

da4cfa7

effigies added a commit that referenced this pull request Dec 1, 2022

NF: Add GZIP_HEADER_DATA issue, following gh-1349 (#1360)

a1c7dfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Recommend gzip header fields be set to empty values #1349

[ENH] Recommend gzip header fields be set to empty values #1349

kousu commented Nov 4, 2022 •

edited by Remi-Gau

Loading

codecov bot commented Nov 4, 2022 •

edited

Loading

effigies commented Nov 4, 2022

kousu commented Nov 4, 2022

effigies commented Nov 4, 2022

kousu commented Nov 4, 2022

effigies commented Nov 14, 2022

[ENH] Recommend gzip header fields be set to empty values #1349

[ENH] Recommend gzip header fields be set to empty values #1349

Conversation

kousu commented Nov 4, 2022 • edited by Remi-Gau Loading

codecov bot commented Nov 4, 2022 • edited Loading

Codecov Report

effigies commented Nov 4, 2022

kousu commented Nov 4, 2022

effigies commented Nov 4, 2022

kousu commented Nov 4, 2022

effigies commented Nov 14, 2022

kousu commented Nov 4, 2022 •

edited by Remi-Gau

Loading

codecov bot commented Nov 4, 2022 •

edited

Loading