Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

Develop search capability for data in multiple THREDDS catalogs #648

Open
rsignell-usgs opened this issue Sep 27, 2016 · 23 comments
Open

Develop search capability for data in multiple THREDDS catalogs #648

rsignell-usgs opened this issue Sep 27, 2016 · 23 comments

Comments

@rsignell-usgs
Copy link
Contributor

Here at the September Unidata Users Committee meeting, Unidata Director Mohan listed "Data Discoverability" as a major potential theme for the 2016 Strategic Plan. I agree this would be a great thing to work on, and Unidata is in a great position to do this because they already have many THREDDS servers out in the community serving data with ncISO services available to create ISO metadata. And there are many catalog services that can ingest ISO metadata and provide standardized CSW or OpenSearch catalog interfaces.

@rsignell-usgs
Copy link
Contributor Author

rsignell-usgs commented Sep 27, 2016

Here's one approach that we use in IOOS.

@lesserwhirls
Copy link
Collaborator

I agree this is a great thing to work on. The approach here at Unidata has been that "search" (Data Discoverability) encompasses so much and there are many experts in that area, of which we are not. Rather than do yet another one off solution, we decided to work with those experts to ensure that the TDS can provide the information they need to do their magic. From what I understand, ISO metadata has been most useful. Now, it seems there are standard services that can suck in the iso metadata and provide pretty nice search and discoverability capability.

It seems to me that we are at (or past) the point where we at Unidata should be reaching out to the community, as well as do some in-house evaluations, to see if there is a solution that we could recommend for use with the TDS.

One obvious solution would be pyCSW, which I know you've worked with. Do you think that would be a good place to start? Note that here I consider any brokering solutions, such as GI-CAT, to be a separate topic.

@lesserwhirls
Copy link
Collaborator

Ok, I think we should start by evaluating the IOOS workflow. Opinions?

@rsignell-usgs
Copy link
Contributor Author

rsignell-usgs commented Sep 27, 2016

Here's an actual example that uses the harvester, a script that harvests datasets from
http:https://thredds.ucar.edu/thredds/catalog.xml

https://gist.github.com/kwilcox/60b8a3e771987f96adf0c6b1e77ede24

@dopplershift
Copy link
Member

Elsewhere I've been having a discussion about thredds_crawler + siphon, but first we need to do something about thredds_crawler's license: GPL 😱

@lesserwhirls
Copy link
Collaborator

Ouch...yeah, that's a problem. 😞

@rsignell-usgs
Copy link
Contributor Author

@kwilcox, would it be a big deal to change to another license?
@dopplershift , what do you prefer, MIT?

@dopplershift
Copy link
Member

@kwilcox already said in email "That really isn't the correct license for thredds_crawler. NOAA/IOOS should figure that out with RPS before we move forward with using it for anything. IMO it should be public domain."

My preference is anything permissive--I usually go MIT or BSD 3-clause.

@dopplershift
Copy link
Member

To be clear, my problem with GPL is that anything "derived" from it, which even includes me looking at the code for ideas, would have to then be GPL as well.

@rsignell-usgs
Copy link
Contributor Author

@dpsnowden, @shane-axiom, @lukecampbell, any reason we couldn't do MIT license here, or CC0 (which we've been recommended to use for government-developed software...)?

@lukecampbell
Copy link

IANAL

I can't comment on the thredds-crawler thing, that's above my pay grade. But, public domain for software that was developed by and distributed by a non-government entity is dangerous because it opens up avenues for liability. Which is why the majority of permissive licenses just contain limited liability clauses, and some include attribution requirements.

I would prefer to see MIT as well. I've brought it up, and discussions are taking place outside of my realm of responsibility.

@lukecampbell
Copy link

lukecampbell commented Sep 27, 2016

And, you're right @dopplershift about GPL, it's like an open source infection. Anything that touches it, must be GPL (few exceptions which I'll omit for brevity). If the license is changed, any derivative software or linked software can become more permissive like https://github.com/axiom-data-science/thredds_iso_harvester

@srstsavage
Copy link

I changed the thredds-iso-harvester license to Unlicense, which is public domain and does include a liability section.

@lukecampbell
Copy link

I'd rather not debate copyright law, but technically, and again IANAL, but because thredds-iso-harvester uses thredds_crawler, it's in violation of the license on thredds_crawler currently, as it is currently GPLv3.

@lukecampbell
Copy link

That's why they can't use thredds_crawler in siphon, because it's currently licensed under GPLv3.

@srstsavage
Copy link

srstsavage commented Sep 27, 2016

Yes, good point. I reverted thredds-iso-harvester to GPL 3 for now. Cue Kafka.

Can you ping this issue if/when thredds_crawler gets a license update?

@lukecampbell
Copy link

I'm hopeful that the license will be changed soon.

@lukecampbell
Copy link

@shane-axiom We moved the thredds_crawler project from asascience-open to ioos and changed the license to MIT.

@dopplershift
Copy link
Member

That's great. Thanks guys! 🎉

@srstsavage
Copy link

@lukecampbell Thanks Luke, I updated thredds-iso-harvester's license to MIT as well.

@rsignell-usgs
Copy link
Contributor Author

rsignell-usgs commented Mar 15, 2017

@lesserwhirls , @dopplershift , I'm guessing this has slipped off the radar screen, but here's an example of how easy it is to harvest the ISO records from Unidata datasets.

This example harvests the ISO records from "Best" time series forecast models using Axiom's docker container for the thredds_iso_harvester:

$ do_harvest unidata.py

where do_harvest is:

#!/bin/bash
docker run --rm -v $(pwd)/$1:/srv/harvest.py -v $(pwd)/iso:/srv/iso \
  axiom/thredds_iso_harvester

and unidata.py is:

from thredds_iso_harvester.harvest import ThreddsIsoHarvester
from thredds_crawler.crawl import Crawl

skip = Crawl.SKIPS
select = ['.*\/Best']

ThreddsIsoHarvester(catalog_url="http:https://thredds.ucar.edu/thredds/idd/forecastMod
els.xml",
    skip=skip, select=select,
    out_dir="/srv/iso/unidata")

Running this script should take just 1 or 2 minutes, and will create 50+ ISO records in a ./iso/unidata subdirectory.

The beauty of this technique is that you don't need to have a custom python environment, or even any python! You just need Docker.

@lesserwhirls
Copy link
Collaborator

@lesserwhirls , @dopplershift , I'm guessing this has slipped off the radar screen, but here's an example of how easy it is to harvest the ISO records from Unidata datasets.

In part, yes; in other part, several of our machines run SunOS, and running a python stack on that can be quite...ummm..what's the word I'm looking for, @dopplershift? And Docker? Fuhgeddaboudit. If you were to do a demo at the spring user comm showing what kind of search capabilities this enables, that would be awesome!

@rsignell-usgs
Copy link
Contributor Author

rsignell-usgs commented Mar 18, 2017

If you were to do a demo at the spring user comm showing what kind of search capabilities this enables, that would be awesome!

@lesserwhirls, I'd love to give a demo of harvesting multiple thredds catalogs, then querying the catalog using a Jupyter notebook and then TerriaJS.

Only problem is that I already asked to give a presentation on ERDDAP for obs data. Would it be too much to do both?

Here's an example of exploring some of the Unidata thredds forecast models via with datasets dynamically populated via a CSW query to the IOOS catalog:

Jupyter Example: https://gist.github.com/anonymous/0a3a8ec292a4a480a0c01b89ef3a297e

TerriaJS Example: http:https://gamone.whoi.edu/terriajs/#clean&proxy/_60s/https://raw.githubusercontent.com/USGS-CMG/terriajs-dive/master/examples/csw_unidata.json

2017-03-18_12-50-56
2017-03-18_12-50-17
2017-03-18_12-49-09

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants