Modify dset/attr builders based on sidecar JSON #677

rly · 2021-11-11T02:38:40Z

Motivation

This is DEMO code to demonstrate how HDMF might support the reading of file modifications from a sidecar JSON. I made up the formatting/schema of the JSON file. It would look like the below.

In this set up, the JSON file has a list of versions which have a label, description, and list of changes.
Each change specifies an object id, relative path to the dataset/attribute being changed from the group/dataset with the object id, and the new value, which can be a scalar, list, or nested list.

The builder values are replaced after the file is fully built by HDF5IO.

Lots of details to be worked out (changing data types? changing shape? compound dtypes?). Let me know what you think of this approach. @oruebel @ajtritt @bendichter

{
    "versions": [
        {
            "label": "version 2",
            "description": "change attr1 from 'old' to 'my experiment' and my_data from [1, 2, 3] to [4, 5, 6, 7]",
            "changes": [
                {
                    "object_id": "3350e602-5073-4bd8-b835-2c771e8b11f9",
                    "relative_path": "attr1",
                    "new_value": "my experiment"
                },
                {
                    "object_id": "3350e602-5073-4bd8-b835-2c771e8b11f9",
                    "relative_path": "my_data",
                    "new_value": [
                        4,
                        5,
                        6,
                        7
                    ]
                }
            ]
        },
        {
            "label": "version 3",
            "description": "change sub_foo/my_data from [-1, -2, -3] to [[0]]",
            "changes": [
                {
                    "object_id": "a0eebdae-5c13-455c-bcf6-2e5b630a3079",
                    "relative_path": "my_data",
                    "new_value": [
                        [
                            0
                        ]
                    ]
                }
            ]
        }
    ]
}

Checklist

Did you update CHANGELOG.md with your changes?
Have you checked our Contributing document?
Have you ensured the PR clearly describes the problem and the solution?
Is your contribution compliant with our coding style? This can be checked running flake8 from the source directory.
Have you checked to ensure that there aren't other open Pull Requests for the same change?
Have you included the relevant issue number using "Fix #XXX" notation where XXX is the issue number? By including "Fix #XXX" you allow GitHub to close issue #XXX when the PR is merged.

codecov · 2021-11-11T02:45:36Z

Codecov Report

Base: 87.65% // Head: 87.18% // Decreases project coverage by -0.47% ⚠️

Coverage data is based on head (fee5245) compared to base (61eec5c).
Patch coverage: 69.47% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #677      +/-   ##
==========================================
- Coverage   87.65%   87.18%   -0.48%     
==========================================
  Files          44       45       +1     
  Lines        8845     9082     +237     
  Branches     2051     2108      +57     
==========================================
+ Hits         7753     7918     +165     
- Misses        777      829      +52     
- Partials      315      335      +20

Impacted Files	Coverage Δ
src/hdmf/spec/spec.py	`90.86% <ø> (ø)`
src/hdmf/build/builders.py	`94.33% <62.79%> (-5.67%)`	⬇️
src/hdmf/backends/builderupdater.py	`66.66% <66.66%> (ø)`
src/hdmf/backends/__init__.py	`100.00% <100.00%> (ø)`
src/hdmf/backends/io.py	`96.49% <100.00%> (+0.26%)`	⬆️
src/hdmf/build/__init__.py	`100.00% <100.00%> (ø)`
src/hdmf/build/errors.py	`100.00% <100.00%> (ø)`
src/hdmf/build/objectmapper.py	`93.12% <100.00%> (+0.39%)`	⬆️
src/hdmf/backends/hdf5/h5tools.py	`82.35% <0.00%> (+0.22%)`	⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/hdmf/backends/hdf5/h5tools.py

oruebel · 2021-11-11T18:41:41Z

Let me know what you think of this approach.

Generally, this approach makes sense to me for updates on read.

Sidecar file schema

The general content in your example looks good.
The label of the version I think should require semantic versioning. In the API then, a user should only be able to indicate whether the changes are major, minor, or patch version bump and the API would then set the version automatically. Users can add there own comments in the description.
We should look at PROV to see if following the PROV-JSON schema for describing these changes will work. This would make this part of the sidecar file idea more accessible to web standards and PROV tools and potentially help with maintaining this functionality if we can us PROV tools. This library may be of interest for this https://prov.readthedocs.io/en/latest/readme.html. It may very well be that this is not suitable here, but I think it is worth taking a look at.
We should require a ISO8061 datetime stamp to be included with each version.
Even if we don't use PROV, it will be useful to have a key for an "agent" to record who made the changes related to a particular version
new_value should be renamed to simply value
We should include a dtype field here. JSON doesn't know the difference between int16 and int32, so we'll need to cast the data to correct type after read. This would also help address your question of "changing dtype" as this would mean specifying the dtype in the change but not the value.
Similarly, to dtype you could also allow for a key shape in the changes section (although I'm not sure it is necessary, unless you have to reshape the dataset, but I'm not sure what the use-case for that is right now).

Questions

Updating datasets with object_references and probably also updating compound_datasets is not supported yet in the current design, correct?
You mentioned "changing data types? changing shape? I'm not sure what the use-case for these operations is. The shape and dtype of the data is dictated by the schema, so why would we reshape or cast a dataset in a valid NWB file?

rly · 2021-11-11T19:41:06Z

Let me know what you think of this approach.

Generally, this approach makes sense to me for updates on read.

Sidecar file schema

The general content in your example looks good.

The label of the version I think should require semantic versioning. In the API then, a user should only be able to indicate whether the changes are major, minor, or patch version bump and the API would then set the version automatically. Users can add there own comments in the description.

How is semantic versioning defined for data? What constitutes a major, minor, or patch version bump?

We should look at PROV to see if following the PROV-JSON schema for describing these changes will work. This would make this part of the sidecar file idea more accessible to web standards and PROV tools and potentially help with maintaining this functionality if we can us PROV tools. This library may be of interest for this prov.readthedocs.io/en/latest/readme.html. It may very well be that this is not suitable here, but I think it is worth taking a look at.

It's worth noting that if we go with PROV, there will be a tradeoff between having the changes be accessible using PROV tools and human-readability/simplicity. Is "a user wants to edit this file by hand" a primary user story that we want to support?

We should require a ISO8061 datetime stamp to be included with each version.

👍

Even if we don't use PROV, it will be useful to have a key for an "agent" to record who made the changes related to a particular version

👍

new_value should be renamed to simply value

👍

We should include a dtype field here. JSON doesn't know the difference between int16 and int32, so we'll need to cast the data to correct type after read. This would also help address your question of "changing dtype" as this would mean specifying the dtype in the change but not the value.

👍

Similarly, to dtype you could also allow for a key shape in the changes section (although I'm not sure it is necessary, unless you have to reshape the dataset, but I'm not sure what the use-case for that is right now).

👍

Questions

Updating datasets with object_references and probably also updating compound_datasets is not supported yet in the current design, correct?

You mentioned "changing data types? changing shape? I'm not sure what the use-case for these operations is. The shape and dtype of the data is dictated by the schema, so why would we reshape or cast a dataset in a valid NWB file?

Some type specs allow a choice from multiple shapes and data types, e.g. TimeSeries.

oruebel · 2021-11-11T22:14:15Z

It's worth noting that if we go with PROV, there will be a tradeoff between having the changes be accessible using PROV tools and human-readability/simplicity.

I agree that this is a concern. Understanding PROV is not trivial. Another main concern is that I believe the PROV data model assumes that entities are immutable , i.e., all changes to an entity (e.g., dataset or attribute in our case) would be assumed to be represented by the creation of a new entity. In this case, however, we want to update the entity, rather than creating a new one. The following paper may be relevant for this topic https://link.springer.com/chapter/10.1007/978-3-319-98379-0_7

Ultimately, PROV may or may not be the right approach for this, but I think its worth spending a little bit of time looking at it to see if it makes sense. If we decide against it, then at least we have a clear answer for folks why we chose to not use it.

Some type specs allow a choice from multiple shapes and data types, e.g. TimeSeries.

Correct, but I don't understand why one would need to change the shape of the data after the fact (unless you are replacing the values all-together)

How is semantic versioning defined for data?

I think semantic versioning can be applied fairly straight-forward here:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible data changes. I.e., when adding new main data type, e.g., a new ElectricalSeries, or when changing the dtype or shape of an object.
MINOR version when you add functionality in a backwards-compatible manner. Here this would mean, e.g., adding columns to a DynamicTable.
PATCH version when you make backwards-compatible bug fixes, i.e., update the value of an existing dataset or attribute (without changing shape or dtype).

Since we have a data schema, I think it may be possible to codify this to automatically determine version numbers based on the type of changes made. Providing some strict versioning rules from the start I think will make things easier down the line.

This being said, I think it makes sense to have both a user-defined label key and a separate version key for a version number. I.e., for now, I would suggest leaving the label as is and just adding a version key that requires a semantic versioning string (which initially could just be set by a user and then as we make progress can be codified further).

Question: Which changes to do we allow via the sidecar file?
I think (at least at for now) we should only allow replacing datasets and attributes, while changes to individual values or adding new containers should be done in the HDF5 file. This means the following cases would not be supported (at least in the first iteration of this) but are modifications that would only be possible via the APIs by modifying the HDF5 file:

- Adding rows to a dynamic table
- Adding a new recording or processing result (i.e., add a new neurodata_type)
- Adding a new group
- Updating datasets with object_references
- Updating datasets with compound data type
- ....

You could still record those changes as a version in the sidecar file (essentially by having a record with an empty changes key), so you would still have a record of the changes, but the changes would then not be reversible.

Question: How do we envision the sidecar file to be created in the API?
I think it would be worth having a separate discussion about this via Zoom to sketch out in pseudocode how this would look. My current thinking is that we would have a separate class to interact with sidecar files, which would server as a "low-level" interface for making modifications to a sidecar file directly. Container classes would then need to provide high-level options to make changes, e.g., when changing an attribute value those changes would be routed to the sidecar file (either directly or when calling write, depending on the change).

Question: How to deal with larger changes?
I think for now, we probably only need to worry about smaller datasets, but it may be worth considering how we want to deal with larger changes as well. Instead of recording new values directly in the sidecare file I could imagine that we may want to record changes in separate HDF5 files to load the values from there. This would not just allow for more compact storage of binary data but it would also make dealing with compound data types, object references, and adding of new containers (without modifying the main file) possible.

bendichter · 2021-11-30T18:51:54Z

What if you want to remove values? e.g. what if you have the age of a subject and then realize that this age is incorrect but you don't know what the correct values is.

rly · 2021-11-30T20:28:22Z

What if you want to remove values? e.g. what if you have the age of a subject and then realize that this age is incorrect but you don't know what the correct values is.

You would set value: null

rly · 2021-12-01T18:37:45Z

Latest example JSON:

{
    "versions": [
        {
            "label": "2.0.0",
            "description": "change attr1 from 'old' to 'my experiment' and my_data from [1, 2, 3] to [4, 5]",
            "datetime": "2020-10-29T19:15:15.789Z",
            "agent": "John Doe",
            "changes": [
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "attr1",
                    "value": "my experiment"
                },
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data",
                    "value": [
                        4,
                        5
                    ],
                    "dtype": "int32"
                }
            ]
        },
        {
            "label": "3.0.0",
            "description": "change sub_foo/my_data from [-1, -2, -3] to [[0]], delete my_data/attr2, and change dtype of my_data",
            "datetime": "2021-11-30T20:16:16.790Z",
            "agent": "Jane Doe",
            "changes": [
                {
                    "object_id": "26b95d5a-b632-407d-921a-8e255752b0f7",
                    "relative_path": "my_data",
                    "value": [
                        [
                            0
                        ]
                    ]
                },
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data/attr2",
                    "value": null
                },
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data",
                    "value": [
                        6,
                        7
                    ],
                    "dtype": "int8"
                }
            ]
        },
        {
            "label": "3.0.1",
            "description": "change my_data from [4, 5] to [6, 7]",
            "datetime": "2021-11-30T20:17:16.790Z",
            "agent": "Jane Doe",
            "changes": [
                {
                    "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
                    "relative_path": "my_data",
                    "value": [
                        6,
                        7
                    ]
                }
            ]
        }
    ],
    "schema_version": "0.1.0"
}

rly · 2021-12-01T18:40:42Z

TODO:

create jsonschema for sidecar JSON
validate sidecar JSON before read
set new dtype correctly
create documentation

Nice to have:

allow loading different versions from HDMFIO
allow modification of object references
allow modification of links
allow modification of compound dtypes
allow modification of slices of datasets or attributes

bendichter · 2021-12-02T16:18:22Z

I feel like "label" should be "version"

bendichter · 2021-12-02T16:34:56Z

Can groups be added? Would they be added implicitly if we gave a relative path that does not exist yet, or should be create them explicitly? For instance, if "a" exists as a group but "b" does not, and you want to create a dataset "d" inside of "b" inside of "a", then you might try adding "a/b/d" with some value, and have "b" implicitly created. Alternatively, we might want to require that "b" is created explicitly before we can add a dataset to it.

Can neurodata_type instances be added? (I'd lean towards no)

rly · 2021-12-02T17:38:16Z

Under the current setup, new HDF5 elements cannot be added. But I see your point that we may want that, especially for datasets and attributes. Let's say a user wants to add "related_publications" when the dataset doesn't exist.

For new neurodata_type instances, I also lean toward no. At that point, I would recommend modifying the file directly.

rly · 2021-12-06T19:29:46Z

TODO:

add uuid to each version so that different sidecar jsons can be more easily compared. e.g., two people make two different "version": "2.0.0" and we want to tell the difference between them and maybe even make a version map.
- con: the file is more complicated to create by hand
add "operation" key that can take value "replace", "delete", etc. this will make it easier to expand functionality without breaking existing functionality

rly · 2021-12-06T19:59:56Z

Comment from @yarikoptic

Versioning within this sidecar file is messy. Provenance is half-baked. Use a dedicated version system. If Version 2 changes A to B and Version 3 changes B to C, do we need to store the record of B? Just store that C replaces A. Just store an ordered list of changes.

Consider using "full_path" instead of "object_id" & "relative_path"

New example:

{
    "operations": [
        {
            "type": "replace",
            "description": "change attr1 to 'my experiment'",
            "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
            "relative_path": "attr1",
            "value": "my experiment"
        },
        {
            "type": "replace",
	        "description": "change my_data to [4, 5] (int32)",
            "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
            "relative_path": "my_data",
            "value": [
                4,
                5
            ],
            "dtype": "int32"
        },
        {
    	    "type": "remove",
    	    "description": "delete my_data/attr2",
            "object_id": "e7fa8789-e446-49b3-944b-004763362aa1",
            "relative_path": "my_data/attr2",
        }
    ],
    "schema_version": "0.1.0"
}

Also should check that when a file is read and file is exported to a new file, new data is present. When it is opened in append mode, do not replace the values (but note that if the file or container is modified, then the new values would be written! this could result in partial overwrites, where changes A is applied but change B is not, depending on what container is modified between read and write)

Advanced edge case: File A links to File B with its own sidecar JSON

bendichter · 2021-12-07T13:48:10Z

I have an idea that I don't think should be in this MVP, but I wanted to make a note of it and see what you all thought. Another "type" might be "add_alternate_value", which would be used as a value if the first value was somehow unaccessible. This could be a good solution for providing a link to a file that might be locally or remotely stored.

rly · 2021-12-08T00:10:17Z

I have an idea that I don't think should be in this MVP, but I wanted to make a note of it and see what you all thought. Another "type" might be "add_alternate_value", which would be used as a value if the first value was somehow unaccessible. This could be a good solution for providing a link to a file that might be locally or remotely stored.

This idea could work, but feels hacky and I would prefer to build this kind of functionality (alternate paths to look up a linked file) into core HDMF. I'm drafting an extension for that. If that takes too long though to release, then we can test out this solution.

oruebel · 2021-12-08T13:49:30Z

I'm drafting an extension for that. If that takes too long though to release, then we can test out this solution.

Just an idea. One option could also be to use ExternalResources to assign alternate online paths. In this scenario, the dataset would store the original local filepath and via ExternalResources on could associate and arbitrary number of external online resources with it.

oruebel · 2021-12-08T14:01:57Z

New example:

I think it may be useful to add the following keys at the top level (i.e., not for each operation, but for the whole sidecar file) could be useful:

description : Describe the purpose and summarize changes
label and/or version : Even when using version control systems to manage sidecar files it will be useful to indicate the version of the sidecar file in the file itself. I.e., this would be to help with versioning the sidecar file, not to version individual changes as in the original proposal.
authors : List of authors of the sidecar file (or alternatively we could make it best practice to update the author field in NWB if the authors of the sidecar file are different from the authors of the NWB file).

…sidecar_mods

rly · 2022-04-12T08:23:04Z

See https://hdmf--677.org.readthedocs.build/en/677/sidecar.html for documentation on what is currently supported in this PR. Feedback is appreciated.

I preserved some code that I wrote for other types of modifications in builderupdater.py but they are not currently used.

oruebel · 2021-11-11T17:30:52Z

src/hdmf/backends/hdf5/h5tools.py

+        returns='The same input GroupBuilder, now modified.',
+        rtype='GroupBuilder'
+    )
+    def update_builder_from_sidecar(self, **kwargs):


We could also add a post_read_builder function to HDMFIO itself to provide a standard place for I/O backends to update builders after read

oruebel · 2022-04-19T21:27:15Z

docs/source/sidecar.rst

+Users may want to update part of an HDMF file without rewriting the entire file.
+To do so, HDMF supports the use of a "sidecar" JSON file that lives adjacent to the HDMF file on disk and
+specifies modifications to the HDMF file. Only a limited set of modifications are supported; for example, users can
+delete a dataset or attribute but cannot create a new dataset or attribute.


Suggested change

delete a dataset or attribute but cannot create a new dataset or attribute.

hide a dataset or attribute so that it will not be read by HDFM but cannot create a new dataset or attribute.

I think delete is misleading since we are not actually deleting any data from a file but the JSON file can only indicate that the dataset/attribute should be ignored on read (maybe hide or invalid would be more precise).

Does delete also apply to groups?

Good point. I'll make the change. For now, I have not allowed hiding of groups because the use case is unclear. But it is technically not very different from hiding of datasets.

I think a main use-case for hiding groups would instances of a data_type, e.g., to hide a TimeSeries that for some reason contains bad data. If it's trivial, then I think allowing to hide groups is something we could allow, but if it adds a lot of complexity then I would hold off until a specific need arises.

oruebel · 2022-04-19T21:27:15Z

docs/source/sidecar.rst

+Users may want to update part of an HDMF file without rewriting the entire file.
+To do so, HDMF supports the use of a "sidecar" JSON file that lives adjacent to the HDMF file on disk and
+specifies modifications to the HDMF file. Only a limited set of modifications are supported; for example, users can
+delete a dataset or attribute but cannot create a new dataset or attribute.


Suggested change

delete a dataset or attribute but cannot create a new dataset or attribute.

hide a dataset or attribute so that it will not be read by HDFM but cannot create a new dataset or attribute.

I think delete is misleading since we are not actually deleting any data from a file but the JSON file can only indicate that the dataset/attribute should be ignored on read (maybe hide or invalid would be more precise).

Does delete also apply to groups?

oruebel · 2022-04-19T21:29:21Z

docs/source/sidecar.rst

+The sidecar JSON file can be validated using the ``sidecar.schema.json`` JSON schema file
+located at the root of the HDMF repository.


Are sidecar files automatically validated by the validator as well?

oruebel · 2022-04-19T21:29:21Z

docs/source/sidecar.rst

+The sidecar JSON file can be validated using the ``sidecar.schema.json`` JSON schema file
+located at the root of the HDMF repository.


Are sidecar files automatically validated by the validator as well?

docs/source/sidecar.rst

oruebel · 2022-04-19T21:47:41Z

src/hdmf/backends/builderupdater.py

+        """Update the file builder in-place with the values specified in the sidecar JSON."""
+        # the sidecar json must have the same name as the file but with suffix .json
+        f_builder, path = getargs('file_builder', 'file_path', kwargs)
+        sidecar_path = Path(path).with_suffix('.json')


If I understand this correctly, then this assumes that the sidecar file has the same name but different suffix than the main file. While this is a good default strategy, I'm wondering whether this will work in DANDI. I think there external files (e.g., videos) are placed in a folder with the same name as file, and I'm wondering whether they would place the sidecar file there instead?

oruebel · 2022-04-19T21:47:41Z

src/hdmf/backends/builderupdater.py

+        """Update the file builder in-place with the values specified in the sidecar JSON."""
+        # the sidecar json must have the same name as the file but with suffix .json
+        f_builder, path = getargs('file_builder', 'file_path', kwargs)
+        sidecar_path = Path(path).with_suffix('.json')


If I understand this correctly, then this assumes that the sidecar file has the same name but different suffix than the main file. While this is a good default strategy, I'm wondering whether this will work in DANDI. I think there external files (e.g., videos) are placed in a folder with the same name as file, and I'm wondering whether they would place the sidecar file there instead?

oruebel · 2022-04-19T21:50:05Z

docs/source/sidecar.rst

+Modifying an HDMF File with a Sidecar JSON File
+===============================================
+
+Users may want to update part of an HDMF file without rewriting the entire file.


Suggested change

Users may want to update part of an HDMF file without rewriting the entire file.

Users may want to update part of an HDMF file without rewriting the entire file.

I think it would be useful to elaborate a little bit on this to clarify the intent and scope of the sidecar file, i.e., this is for small updates and corrections only.

oruebel · 2022-04-19T21:50:05Z

docs/source/sidecar.rst

+Modifying an HDMF File with a Sidecar JSON File
+===============================================
+
+Users may want to update part of an HDMF file without rewriting the entire file.


Suggested change

Users may want to update part of an HDMF file without rewriting the entire file.

Users may want to update part of an HDMF file without rewriting the entire file.

I think it would be useful to elaborate a little bit on this to clarify the intent and scope of the sidecar file, i.e., this is for small updates and corrections only.

Co-authored-by: Oliver Ruebel <[email protected]>

yarikoptic · 2022-04-21T19:04:03Z

reordered for TL;DR:

Note: The default (or required) NWB/HDMF sidecar JSON file need not be named [nwb_file_base_name].json. It could be [nwb_file_base_name].overwrite.json to be more clear about its function.

I think this is a good idea/alternative!

I only worry about .overwrite being "too long" to be considered by some tools to be a part of the extension.

e.g. FWIW git-annex would not consider it to be an extension to be used in annexed key but it is not critical, just wanted to mention

$> ls -ld 123.overwrite.json
lrwxrwxrwx 1 yoh yoh 118 Apr 21 14:35 123.overwrite.json -> .git/annex/objects/fm/24/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.json/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.json

# you can see that it is just .json not .overwrite.json.  Let's try shorter:

$> touch 123.over.json 
$> git annex add 123.over.json
add 123.over.json 
ok
(recording state in git...)

$> ls -ld 123.over.json
lrwxrwxrwx 1 yoh yoh 128 Apr 21 14:35 123.over.json -> .git/annex/objects/6P/1w/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.over.json/MD5E-s0--d41d8cd98f00b204e9800998ecf8427e.over.json

But what about .nwb.json ???? kinda cute ;) If it would be possible for such overwrite files to be expressive enough to pretty much populate from scratch an .nwb file (i.e. start without any .nwb) would make even more sense to e.g. encapsulate all desired metadata (before running acquisition) without actual data file, and then eventually to produce the .nwb with data to accompany it.

having decided on that we would need to tune up BIDS specification to allow for .nwb files to be accompanied with .overwrite.nwb (or .nwb.json) files. Filed SCHEMA: satellite (.overwrite.nwb) file for the base file (.nwb) bids-standard/bids-specification#1087 to preclude problems/clarify.

individual short answers

@yarikoptic Sorry, I am a little confused. Are you saying that because BIDS allows sidecar JSON files and NWB/HDMF (will) allow sidecar JSON files, then it would not be clear which schema to use for which sidecar JSON file?

yes

And therefore we should have one sidecar JSON file that marries the two?

that was my ugly idea (I now love yours more)

I think the BIDS sidecar JSON file serves a different purpose than the NWB/HDMF one which is to update/overwrite data from the bulk data file, but maybe I am mistaken. If it does serve a different purpose, I think two separate files would be best.

agree

rly · 2022-04-21T19:21:39Z

I only worry about .overwrite being "too long" to be considered by some tools to be a part of the extension.

e.g. FWIW git-annex would not consider it to be an extension to be used in annexed key but it is not critical, just wanted to mention

I like [nwb_file_base_name].overwrite.json. I'm a bit confused about the issue with the extension being too long. Why does the extension need to be fully preserved when the file is being renamed by git-annex? Renaming the file would break the cross-file linkages anyway. (aside: it seems that git annex can be configured to allow overwrite.json to be considered an extension (annex.maxextensionlength has default value of 4)).

oruebel · 2022-04-21T19:31:05Z

@yarikoptic Sorry, I am a little confused. Are you saying that because BIDS allows sidecar JSON files and NWB/HDMF (will) allow sidecar JSON files, then it would not be clear which schema to use for which sidecar JSON file?

Shouldn't the reference to schema be typically part of the comment in the first line of the JSON file?

yarikoptic · 2022-04-21T23:15:54Z

Shouldn't the reference to schema be typically part of the comment in the first line of the JSON file?

please clarify/give example since AFAIK JSON doesn't even have comments supported (unlike its superset YAML).

yarikoptic · 2022-04-21T23:22:46Z

Why does the extension need to be fully preserved when the file is being renamed by git-annex?

it doesn't need to, but:

such feature was requested in the childhood of git-annex and before inception of datalad to be able to use tools which first dereference the symlink but then need to know the extension of the file ;)
it shouldn't matter for pynwb as long as symlinks are not dereferenced before extension is "assessed"
it is a choice user makes by choosing backend to use for annex'ed keys. If backend ends with E (e.g. MD5E which is default for datalad datasets, vs plain MD5 which would come then without extension) -- it is the one which would keep extension

I like [nwb_file_base_name].overwrite.json.

eh, I wish you said

I like [nwb_file].json.

;-)

oruebel · 2022-08-18T06:06:02Z

The latest HDF5 1.13.2 releases adds the Onion virtual file driver (VFD). According to the release notes: “The onion VFD allows creating “versioned” HDF5 files. File open/close operations after initial file creation will add changes to an external “onion” file (.onion extension by default) instead of the original file. Each written revision can be opened independently.” (see here for the release notes and here for an in-depth description of the Onion VFD). While 1.13 is an experimental release, it may be interesting to try and see how onion compares with this JSON sidecar approach.

rly added 3 commits November 10, 2021 18:06

Add first at reading sidecar modifications

9e4ba60

Pretty-print json

1f53919

Update to work if json is not present

dafc650

oruebel reviewed Nov 11, 2021

View reviewed changes

src/hdmf/backends/hdf5/h5tools.py Outdated Show resolved Hide resolved

Refactor BuilderUpdater functionality to sep class

de5fefe

Merge branch 'dev' into sidecar_mods

3f1f8f2

rly added 2 commits November 30, 2021 12:23

Handle changing sub-dataset attr, add sidecar fields

036fa1e

Use semantic versioning in version label

b4b5419

rly added 2 commits December 1, 2021 11:55

Add jsonschema for sidecar json

151c69d

Add validation to read

32d1397

Saksham20 mentioned this pull request Dec 3, 2021

Video files organize dandi/dandi-cli#841

Merged

Update to use new schema. More tests needed

933ef40

Update tests (more to do)

393e5b3

rly added 3 commits April 11, 2022 09:20

Merge branch 'sidecar_mods' of https://github.com/hdmf-dev/hdmf into …

393ffdf

…sidecar_mods

Update documentation, refactor, and add test cases

729e989

Update

ecd244d

rly marked this pull request as ready for review April 12, 2022 08:19

rly changed the title ~~[WIP] Update builders with modifications from sidecar JSON~~ Replace values of or delete dset/attr builders based on sidecar JSON Apr 12, 2022

rly changed the title ~~Replace values of or delete dset/attr builders based on sidecar JSON~~ Modify dset/attr builders based on sidecar JSON Apr 12, 2022

rly requested review from oruebel and bendichter April 12, 2022 08:21

rly added 4 commits April 12, 2022 01:25

Add link to sidecar json schema

168f4a9

Add examples to doc

1c57573

Update sidecar.rst

62ed248

Merge branch 'dev' into sidecar_mods

7078ca1

oruebel requested changes Apr 21, 2022

View reviewed changes

rly and others added 3 commits April 21, 2022 00:58

Update sidecar.rst

9faf7a2

Update docs/source/sidecar.rst

827d61d

Co-authored-by: Oliver Ruebel <[email protected]>

Update sidecar.rst

ef22dc5

yarikoptic mentioned this pull request Apr 21, 2022

SCHEMA: satellite (.overwrite.nwb) file for the base file (.nwb) bids-standard/bids-specification#1087

Open

yarikoptic mentioned this pull request Jul 5, 2022

RF "files.py" validation/metadata-loading to support BIDS dandi/dandi-cli#1044

Closed

yarikoptic mentioned this pull request Aug 11, 2022

Enable asset upload dandi/dandi-archive#705

Open

rly added 2 commits August 30, 2022 17:01

Merge branch 'dev' into sidecar_mods

2bb7185

Merge branch 'dev' into sidecar_mods

fee5245

rly marked this pull request as draft January 13, 2024 08:10

rly mentioned this pull request Mar 28, 2024

Example for changing metadata (e.g. subject id) in NWB file NeurodataWithoutBorders/lindi#33

Closed

	delete a dataset or attribute but cannot create a new dataset or attribute.
	hide a dataset or attribute so that it will not be read by HDFM but cannot create a new dataset or attribute.

		The sidecar JSON file can be validated using the ``sidecar.schema.json`` JSON schema file
		located at the root of the HDMF repository.

	Users may want to update part of an HDMF file without rewriting the entire file.
	Users may want to update part of an HDMF file without rewriting the entire file.

Modify dset/attr builders based on sidecar JSON #677

Are you sure you want to change the base?

Modify dset/attr builders based on sidecar JSON #677

Conversation

rly commented Nov 11, 2021 • edited Loading

Motivation

Checklist

codecov bot commented Nov 11, 2021 • edited Loading

Codecov Report

oruebel commented Nov 11, 2021

rly commented Nov 11, 2021

oruebel commented Nov 11, 2021 • edited Loading

bendichter commented Nov 30, 2021

rly commented Nov 30, 2021

rly commented Dec 1, 2021

rly commented Dec 1, 2021 • edited Loading

bendichter commented Dec 2, 2021

bendichter commented Dec 2, 2021

rly commented Dec 2, 2021

rly commented Dec 6, 2021

rly commented Dec 6, 2021 • edited Loading

bendichter commented Dec 7, 2021

rly commented Dec 8, 2021

oruebel commented Dec 8, 2021

oruebel commented Dec 8, 2021

rly commented Apr 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic commented Apr 21, 2022

rly commented Apr 21, 2022

oruebel commented Apr 21, 2022

yarikoptic commented Apr 21, 2022

yarikoptic commented Apr 21, 2022

oruebel commented Aug 18, 2022 • edited by rly Loading

rly commented Nov 11, 2021 •

edited

Loading

codecov bot commented Nov 11, 2021 •

edited

Loading

oruebel commented Nov 11, 2021 •

edited

Loading

rly commented Dec 1, 2021 •

edited

Loading

rly commented Dec 6, 2021 •

edited

Loading

oruebel commented Aug 18, 2022 •

edited by rly

Loading