Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection with and contribution to the MaRDA metadata extractors registry #207

Open
ml-evs opened this issue Dec 19, 2023 · 5 comments
Open

Comments

@ml-evs
Copy link

ml-evs commented Dec 19, 2023

Hi RosettaSciIO devs, just wanted to make a connection with a MaRDA working group we've been running this year that focuses on interoperability of metadata extraction in materials science/chemistry. We have developed our proof-of-concept registry, schema and API for describing extractor code, their target file types and their automatic execution, with the aim to enhance discoverability of existing initiatives in our field, and promote best practices for scientific ETL.

I came across RosettaSciIO a little while ago and was really happy that our designs look similar and compatible; would there be any interest in contributing the various RosettaSciIO file formats and extractor definitions to our registry? I think this could be readily scripted from your existing yaml definitions.

I won't go into too much more detail (there's info at all the links above if you are interested) -- I'd be very happy to speak to any of the developers about this in the new year, otherwise I'll try to do the work myself (with the benefit of enhanced discoverability of RosettaSciIO an the various formats you support). If you'd like to hear more we will be wrapping the WG with a presentation in January (see marda-alliance/metadata_extractors#21 for details).

cc @PeterKraus and @CSSFrancis (who I think I spoke to over email about this a while ago)

@jlaehne
Copy link
Contributor

jlaehne commented Dec 21, 2023

In Germany, there is also a pretty huge activity in that direction from the FAIRmat consortium, where I am also trying to promote connections.

Indeed, the motivation behind splitting RosettaSciIO out of HyperSpy was to create exactly such synergies, without the need for other projects to rely on the much larger parent project.

Personally, I think that being focused on the interface to HyperSpy, where there remains enough work to do, I think that the current developers don't have much capacity at the moment to implement the interface to specific other projects, but would welcome any such links/contributions - and could of course help in case something needs to be adapted on the rsciio side. In particular, it could be relevant that we have started a discussion in #89 on how to generalize the metadata handling to simplify maintenance and improve interoperability (see also related topic hyperspy/hyperspy#2095).

@ericpre
Copy link
Member

ericpre commented Dec 21, 2023

Thank you @ml-evs for getting in touch.

As @jlaehne mentioned, there is limited resources in the hyperspy community, because the development is driven by each contributor needs for their research and as a community we make sure that work is done in a way that works well, future proof and is useful for the wider community. I am mentioning this explicitly because unlike most other projects we don't have resources (hyperspy doesn't and never had dedicated funding) that we can allocate to a achieve a specific aim/task, instead we are people working together because on their free will because we think that this is useful/needed!

To extend a bit on @jlaehne already said, I will try to summarise the situation on metadata handling in the hyperspy community with the hope that it help understanding:

  • we are not metadata handling expert but we need to use some of them. We follow a pragmatic approach of reading all possible metadata, dumping it in the original_metadata dictionary and parsing the one that we need the metadata dictionary, following the structure defined in the hyperspy metadata specification
  • for obvious reasons, our current approach have limitations (even if it has been very useful for many years) and I think that we reach a point where there is a consensus that this became a priority to improve it.
  • we haven't put much effort in it so far and we are aware that there are various interesting initiatives (see for example, discussion in Improve metadata handling #89) but these things are complicated and I think that we don't have a good understanding of their relevance for our needs and how we could use them. Moreover, we had very limited discussions on this topic.
  • currently, there is no plan or strategy on how to improve the current situation.

@ml-evs, I have been through some of the documentation available and I couldn't figure out what happen to the metadata, once they have been extracted. How end users are expected to use it?
For me to check that I understand correctly, can you please confirm that the following statement are correct:

  • The aim of your working group is to provide an infrastructure that can extract data/metadata from as many files as possible without parsing it to a standardised structure
  • Adding RosettaSciIO in the MaRDA metadata extractors registry will help discovery of existing parser in rosettasciio but not help the hyperspy community to improve the handling of metadata.

@PeterKraus
Copy link

Thanks for such a fruitful discussion!

We (MaRDA Extractors WG) are of course aware of the FAIRmat folks - and they're aware of us, as you can see from Markus Scheidgen's contributions to the discussions in the repo. My mentor is also one of the task leaders in FAIRmat Area 3, and both of us (@ml-evs and I) are involved in planning a workshop in Berlin (madices.github.io) on related issues. But it's also understandable that there's a healthy degree of skepticism, as it's easy to over-promise.

As for the metadata discussion, it's a difficult topic, and we've spent the best part of a year on it with a similar conclusion: kicking most of the tough bits down the road. My own parsers (as part of yadg) do not have a metadata spec. My strategy was basically to learn from what others do and then slowly implement it in my code; I'm trying to follow the NetCDF convention, but ultimately, if only I and my co-workers use my code, nobody cares about parsing the metadata that we don't need.

To answer the last two questions:

  • The aim of your working group is to provide an infrastructure that can extract data/metadata from as many files as possible without parsing it to a standardised structure
  • primary goal is to have a searchable registry of extractors and filetypes
  • secondary goal is to have the extractors tested for installation and usage with known example files of these filetypes
  • ternary goal would be to begin standardising the extractor output - i.e. (meta)-data schema

Currently, we have a proof of concept for the first two, and at least a mechanism for requesting metadata or data from the extractor for the third one, but we won't be able to get further without momentum, examples, and consensus, which is why we're having this discussion.

  • Adding RosettaSciIO in the MaRDA metadata extractors registry will help discovery of existing parser in rosettasciio but not help the hyperspy community to improve the handling of metadata.

Well, we cannot guarantee the former, and cannot promise help on the latter, but getting your code in a maintained list of "you can extract these files using these codes" cannot hurt discovery.

@CSSFrancis
Copy link
Member

@ml-evs thanks for starting this discussion. I must admit that I had plans to follow this through a bit more but as @ericpre pointed out we are kind of limited in our development time. I'm currently trying to graduate which has made me focus my efforts a bit lately.

Currently, we have a proof of concept for the first two, and at least a mechanism for requesting metadata or data from the extractor for the third one, but we won't be able to get further without momentum, examples, and consensus, which is why we're having this discussion.

I've looked through some of .ymal files you have created for MaRDA (for example this is the Renishaw ymal) and it makes me think that it would be a good idea to add to our .ymal file as that might help us to organize our file readers by subject, data type etc and would help with interoperability.

For example, things that we should probably add:

  • subject
  • description
  • supported_filetypes

Maybe this is a good place to start. In my opinion there isn't a good reason to not have more information in the .ymal file and I'd rather have it defined in one place rather than split between different repositories. It's not a huge ask to get people to add that information as well.

@ml-evs
Copy link
Author

ml-evs commented Dec 21, 2023

All sounds good to me! I'm happy to prepare an example of what one given rosetta extractor would look like in our format (though this will now have to wait until the new year).

I've looked through some of .ymal files you have created for MaRDA (for example this is the Renishaw ymal) and it makes me think that it would be a good idea to add to our .ymal file as that might help us to organize our file readers by subject, data type etc and would help with interoperability.

Just wanted to chime-in with the rendered https://marda-registry.fly.dev/filetypes/renishaw-wdf and API version of this yaml file at https://marda-registry.fly.dev/api/v0.3.0/filetypes/renishaw-wdf so you get an idea of the registry connection.

Hope to follow up this connection soon, have a good winter break everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants