Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr Extension? #781

Closed
rabernat opened this issue Apr 24, 2020 · 10 comments
Closed

Zarr Extension? #781

rabernat opened this issue Apr 24, 2020 · 10 comments

Comments

@rabernat
Copy link

Following up on the discussion in #713, I thought it would be good to raise a standalone issue about how STAC may interact with the Zarr format. Zarr is a storage protocol for multidimensional arrays + metadata. (Think an alternative to HDF). It began with a python implementation, but, using the Zarr V2 Spec, implementations now exist in Java, C++, Julia, and JavaScript.

Zarr can store data in anything that resembles a key-value store: both filesystem (keys are filenames) and cloud object store (keys are object names) work well. Consequently, zarr works well as a cloud-optimized format for storing geospatial data, and we are using it extensively in the Pangeo project (e.g. https://catalog.pangeo.io/).

Zarr stores its metadata in json files and stores the arrays in binary chunks (i.e. fragments), which are optionally compressed. So the layout of a zarr group may look something like this:
image

where the dark boxes are json and the light boxes are binary data. In this way, Zarr resembles Parquet, but aimed at ND-arrays rather than tabular data.

If Zarr were simply a standalone file format, analogous to COG, it would be reasonable to catalog Zarr using STAC Items. However, Zarr arrays consist of many different files / objects, none of which is meaningful on its own. In that sense, Zarr is more akin to a STAC Catalog or Collection. Zarr arrays also tend to be aggregated at a higher level than individual images / scenes. This notebook, for example, shows how Xarray represents a particular Zarr store:

image

In this example, all of the timesteps from the original netcdf files have been concatenated in a single Zarr array (which consists of thousands of individual chunks). We have a data cube, rather than a data square. (Note that the number of dimensions in Zarr is arbitrary, and we have examples with up to 6 dimensions, e.g. scenario, ensemble, time, height, lat, lon).

We are putting more and more data into the cloud in Zarr format (e.g. CMIP6), and we would like to be able to catalog this data using STAC. However, the lack of clarity about how to refer to Zarr data in STAC is a barrier.

I have a few main questions for this community:

  1. Looking closely at the STAC Item spec, the main challenge appears to be the requirement that the item have a single datetime property. Since we are generally storing all the timesteps from a given product, it doesn't make sense to specify a single datetime. Is there any chance of an option to specify a datetime range at the Item level?

  2. Alternatively, we could avoid the Item problem by referring to Zarr via a Collection or Catalog. Would it make sense to refer to a Zarr array or group via a Collection? The Standalone Collection concept appears to allow for a collection with no links to Items.

  3. Another issue is that the Zarr spec already defines a convention for storing metadata (conveniently, in .json files). It would be redundant to duplicate all this metadata in STAC Item or Collection. But if STAC were made aware of Zarr via an extension, processors could easily crawl and this metadata directly. Does that idea sound feasible?

If we can resolve these issues, I believe we can move forward with generating STAC catalogs for all of our cloud-based Zarr data.

I also note that this discussion is somewhat orthogonal to the ESM collection spec discussion in NCAR/esm-collection-spec#21. One should be able to catalog Zarr with STAC, regardless of whether or not one is using the ESM data model.

@m-mohr
Copy link
Collaborator

m-mohr commented Apr 24, 2020

Short of time at the moment, so just a heads up to some issues:

  1. Looking closely at the STAC Item spec, the main challenge appears to be the requirement that the item have a single datetime property. Since we are generally storing all the timesteps from a given product, it doesn't make sense to specify a single datetime. Is there any chance of an option to specify a datetime range at the Item level?

I'll support you on this for sure. I already tried to get this into STAC, but was rejected: #613. Hopefully it's getting on the agenda again.

2. Alternatively, we could avoid the Item problem by referring to Zarr via a Collection or Catalog. Would it make sense to refer to a Zarr array or group via a Collection? The Standalone Collection concept appears to allow for a collection with no links to Items.

I feel like Zarr groups would mostly be collections or catalogs and Zarr arrays would be items, but that's more a feeling than anything profound.

3. Another issue is that the Zarr spec already defines a convention for storing metadata (conveniently, in .json files). It would be redundant to duplicate all this metadata in STAC Item or Collection. But if STAC were made aware of Zarr via an extension, processors could easily crawl and this metadata directly. Does that idea sound feasible?

This sounds a bit like a use-case for #757.

@cholmes
Copy link
Contributor

cholmes commented Apr 24, 2020

Looking closely at the STAC Item spec, the main challenge appears to be the requirement that the item have a single datetime property. Since we are generally storing all the timesteps from a given product, it doesn't make sense to specify a single datetime. Is there any chance of an option to specify a datetime range at the Item level?

I'll support you on this for sure. I already tried to get this into STAC, but was rejected: #613. Hopefully it's getting on the agenda again.

So I'm still not sure about having a default item have a single value and a time range on the same footing, where implementors can just pick either and clients then have to know about both and be able to handle either. I think we should do one by default - we originally did have a range, but it felt weird there too to shoehorn everything into a range.

What I am very much for is extensions being able to say 'for this extension a range is all that makes sense', and zarr and datacubes are perfect examples of that. I could see something like a special value of 'datetime', like it is 'null' or says 'range' or something, which indicates that a single datetime is meaningless, but that the date time range values are present.

Or maybe we just somehow allow extensions to override how the datetime works, and can say for their data the date time range is required, and the datetime field is optional. I know that probably gets messy with the json schema, and perhaps it's a bit looser in actual validation, but the spec defines it more strongly.

In short - the original idea was that extensions should refine their meaning of the time, and I think it makes sense to flesh things out so datacube and zarr extensions can make the range the required thing, and the single datetime less important.

But I do think it's important the core spec doesn't start with 'you can use a single time or a range' - that we have a default recommendation: Pick a single datetime. If you can't then use or define an extension that uses a range.

@cholmes
Copy link
Contributor

cholmes commented Apr 24, 2020

As for general responses on zarr:

I think the key thing for us to figure out in STAC is whether a datacube (zarr or otherwise, thinking netcdf fits too) that has tangible assets should be defined as a Collection or an Item.

  • Collection we'd need to add 'assets' at the collection level.
  • Item we'd need a stronger / clearer way to define it with a range.

These both feel like critical things to figure out for 1.0-beta1 (not saying we necessarily add each, but we need to decide and show how to handle this and other similar use cases).

I have a similar feeling that zarr arrays likely makes sense as items, but like Matthias I'm not certain, have not worked with zarr enough. But it feels like that's the level it could be useful to 'search' on. Like you find a specific model / model run, just like you find a specific 'image' from an larger catalog. I don't know that there'd need to be 'items' defined for every array (though could if it's easy) - you could just pick key ones to expose. I liked the idea discussed on the call that you might have a number of 'collections' for CMIP, like all X type of simulations. And then I think items below that makes sense, as I'm hesitant to have big hierarchies of collections (though I suppose it is possible).

  1. Another issue is that the Zarr spec already defines a convention for storing metadata (conveniently, in .json files). It would be redundant to duplicate all this metadata in STAC Item or Collection. But if STAC were made aware of Zarr via an extension, processors could easily crawl and this metadata directly. Does that idea sound feasible?

This sounds a bit like a use-case for #757.

I think it's a wee bit different, but definitely similar. For that one I'd propose a new JSON structure for those adding more metadata. I think this one is simpler - the zarr extension would just require that there's an asset with a role 'zarr-metadata' (or whatever makes sense to call it) and it provides the link to that. That's in line with assets in general - you can refer to metadata, and I think it'd make good sense for zarr to define and (possibly) require a role / file type that has that.

@m-mohr
Copy link
Collaborator

m-mohr commented May 6, 2020

I think the key thing for us to figure out in STAC is whether a datacube (zarr or otherwise, thinking netcdf fits too) that has tangible assets should be defined as a Collection or an Item.

  • Collection we'd need to add 'assets' at the collection level.
  • Item we'd need a stronger / clearer way to define it with a range.

We have PRs for both now: Collection Assets #800, Item Date Ranges #798

@rabernat
Copy link
Author

rabernat commented Sep 2, 2020

I wanted to check in on this issue. I apologize for losing track of it somewhat. (That happened with a lot of projects this spring / summer).

Since both #800 and #798 have been merged, it should be possible for us to represent our zarr stores in STAC. That's what @charlesbluca has been working on in https://github.com/charlesbluca/pangeo-datastore-stac/.

I'd love to push this forward as part of the STAC data sprint next week. Please help us figure out where contributions are needed.

@m-mohr
Copy link
Collaborator

m-mohr commented Sep 2, 2020

I think we need to finalize the esm-collection-spec PRs, especially the written docs. I think there were one or two open questions to be answeed by you guys. What do you need from us to finish them? Let me know if there's anything I can help with. @rabernat
From an external perspective it felt like there was not much going on after the last call, but I probably just didn't found what has happended recently?!

@rabernat
Copy link
Author

rabernat commented Sep 3, 2020

Thanks @m-mohr. Yes we definitely stalled on our end, for various reasons. Honestly I think the biggest thing we are missing is a high-level view of how we as a community / project are going to be interacting with STAC going forward. There are lots of moving pieces that have to move together.

I think we need to finalize the esm-collection-spec PRs, especially the written docs.

I think there may be two related but distinct issues here.

Over in https://github.com/charlesbluca/pangeo-datastore-stac, we are working on a STAC catalog that exposes individual Zarr groups in a hierarchy. Our problem seems to be that we don't know whether it is "right." Since we are not (yet) using STAC on the client side, we have no way to check if the catalog is usable. In pangeo-data/pangeo-datastore-stac#1 we are discussing validators. If you could provide some advice on how to test or validate our catalog, even informally / non-rigorously, ideally from python, that would help us bring some closure to that step of the project.

The next step would be to try to plug these catalogs into one of the pretty STAC front ends so we can browse the catalog interactively. We would then work towards rewriting / replacing our existing Pangeo catalog website, an intake / flask app living at https://github.com/pangeo-data/pangeo-datastore-flask, to something javascript based.

@cholmes
Copy link
Contributor

cholmes commented Sep 3, 2020

Let's try to have a session on stac + zarr at the sprint next week to push things forward. I was talking to @abarciauskas-bgse and she's going interested in STAC plus zarr, and will be working to add STAC to https://registry.opendata.aws/mur/ soon. That might actually be an easier dataset for us to all start on, since it looks a bit more focused - thinking through the huge zarrs hurts my brain. She's also going to give an 'intro session' on zarr, which I'm hoping to attend that will hopefully get me a bit more up to speed to be able to contribute to the conversations.

@rabernat
Copy link
Author

rabernat commented Sep 3, 2020

I tried submitting our catalog to the STAC index and it kind of worked!

https://stacindex.org/catalogs/pangeo

I consider this a big win!

EDIT: Apparently we don't actually have the Zarr datasets in that catalog yet, so there is nothing to see: https://stacindex.org/catalogs/pangeo#/5cxrtDQ39vyM7sLrEWnCPyCvZ/29SzVx8vurgxGhvU3r8P6ihBY26pLRMo8G6J8nh/3X3qqSmiHVwdZKGoFG4X18X3eZCMvVrENzTFPHEmajbPHKB

@m-mohr
Copy link
Collaborator

m-mohr commented Sep 3, 2020

Nice! It doesn't support showing collection assets yet (radiantearth/stac-browser#38) so it doesn't show anything to download, but other than that, it's cool!
Let's discuss further steps at the data sprint. ESM collection spec shouldn't be too much work....
Validation is also working again in Python and Node, so that should be mostly solved by then. See also https://github.com/m-mohr/stac-node-validator/blob/master/COMPARISON.md

@m-mohr m-mohr added this to the new extensions milestone May 4, 2021
@radiantearth radiantearth locked and limited conversation to collaborators Apr 4, 2023
@PowerChell PowerChell converted this issue into discussion #1222 Apr 4, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

3 participants