Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

404's on retrieving data archives via bio-cache #272

Open
jhpoelen opened this issue Sep 8, 2018 · 21 comments
Open

404's on retrieving data archives via bio-cache #272

jhpoelen opened this issue Sep 8, 2018 · 21 comments
Labels

Comments

@jhpoelen
Copy link

jhpoelen commented Sep 8, 2018

Hi!

I found https://collections.ala.org.au/ws/dataResource/dr3561 via https://collections.ala.org.au/ws/dataResource but got a 404 when retrieving the content associated to the public archive url http:https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip . Is this expected?

Related to bio-guoda/preston#1 .

$ wget https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip
--2018-09-08 16:56:42--  https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip
Resolving biocache.ala.org.au (biocache.ala.org.au)... 54.79.49.195, 52.65.238.196
Connecting to biocache.ala.org.au (biocache.ala.org.au)|54.79.49.195|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2018-09-08 16:56:43 ERROR 404: Not Found.
@ansell
Copy link
Contributor

ansell commented Sep 9, 2018

The underlying reason is that we don't generate archives for data resources that have been registered but do not yet have any records loaded.

It would be ideal if the collections.ala.org.au service didn't generate the URLs in this case to avoid confusion. I have a feeling that the archive URL is generated by code using a pattern without checking if the archive exists or there are records in biocache.ala.org.au for the data resource.

@jhpoelen
Copy link
Author

@ansell thanks for the clarification. I was wondering whether one could use the "status" fields (e.g., "status" : "identified", see below) as a way to filter out the resources that have been registered, but not yet loaded. If so, which status would indicate a loaded resource?

From https://collections.ala.org.au/ws/dataResource/dr3561 -

{
  "name": "Aboriginal Cultural Heritage (NLP)",
  "acronym": null,
  "uid": "dr3561",
  "guid": null,
  "address": null,
  "phone": null,
  "email": null,
  "pubShortDescription": null,
  "pubDescription": "The Aboriginal Cultural Heritage project increases Aboriginal engagement and participation in sustainable NRM as a part of the Sustainable Communities program.",
  "techDescription": null,
  "focus": null,
  "state": null,
  "websiteUrl": null,
  "alaPublicUrl": "https://collections.ala.org.au/public/show/dr3561",
  "networkMembership": null,
  "hubMembership": [],
  "taxonomyCoverageHints": [],
  "attributions": [],
  "dateCreated": "2016-02-14T11:35:57Z",
  "lastUpdated": "2016-02-14T11:35:57Z",
  "userLastModified": "Data services",
  "provider": {
    "name": "Aboriginal Cultural Heritage (NLP)",
    "uri": "https://collections.ala.org.au/ws/dataProvider/dp2243",
    "uid": "dp2243"
  },
  "rights": null,
  "licenseType": "other",
  "licenseVersion": null,
  "citation": null,
  "resourceType": "records",
  "dataGeneralizations": null,
  "informationWithheld": null,
  "permissionsDocument": null,
  "permissionsDocumentType": "Other",
  "contentTypes": [],
  "hasMappedCollections": false,
  "status": "identified",
  "provenance": null,
  "harvestFrequency": 0,
  "lastChecked": null,
  "dataCurrency": null,
  "harvestingNotes": null,
  "publicArchiveAvailable": false,
  "publicArchiveUrl": "http:https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip",
  "gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr3561/dr3561_ror_dwca.zip",
  "downloadLimit": 0,
  "gbifDataset": false,
  "isShareableWithGBIF": true,
  "verified": false,
  "gbifRegistryKey": null,
  "doi": null
}

@jhpoelen
Copy link
Author

jhpoelen commented Sep 10, 2018

Or perhaps "publicArchiveAvailable": false ?

@ansell
Copy link
Contributor

ansell commented Sep 10, 2018

The best I can recommend at this point is checking the value of the status field, to check if it is set to dataAvailable, rather than identified or something else. I am not very familiar with the collections.ala.org.au API, but looking at some datasets that do have archives, it appears that the status field is the best avenue.

publicArchiveAvailable sounds like it should be the way to go, but it appears to be inconsistent, with it being false on some datasets I have looked at so far which have archives available.

@jhpoelen
Copy link
Author

Thanks for sharing your insights.

I poked around and found an example (dr6504) with "status" : "dataAvailable", but seems to have a url that is a 404 (e.g., http:https://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip ) .

In this particular example, the "publicArchiveAvailable" : false .

One idea would be to run Preston on all data resource and measure which archive urls are active and which are not. Let me know if you'd find this useful.

{
  "name": "Advisory List of Threatened Invertebrate Fauna in Victoria 2009",
  "acronym": null,
  "uid": "dr6504",
  "guid": null,
  "address": null,
  "phone": null,
  "email": null,
  "pubShortDescription": null,
  "pubDescription": "Victorian advisory list for Invertebrate Fauna",
  "techDescription": "This list was first uploaded by Paul Skeen on the Wed Oct 12 01:06:06 UTC 2016.It contains [totalRecords:636, successfulItems:0] taxa.",
  "focus": null,
  "state": null,
  "websiteUrl": "http:https://lists.ala.org.au/speciesListItem/list/dr6504?max=10",
  "alaPublicUrl": "https://collections.ala.org.au/public/show/dr6504",
  "networkMembership": null,
  "hubMembership": [],
  "taxonomyCoverageHints": [],
  "attributions": [],
  "dateCreated": "2016-10-12T01:06:27Z",
  "lastUpdated": "2016-10-12T01:06:28Z",
  "userLastModified": "Species list upload",
  "rights": null,
  "licenseType": "other",
  "licenseVersion": null,
  "citation": null,
  "resourceType": "species-list",
  "dataGeneralizations": null,
  "informationWithheld": null,
  "permissionsDocument": null,
  "permissionsDocumentType": "Other",
  "contentTypes": [
    "species list"
  ],
  "hasMappedCollections": false,
  "status": "dataAvailable",
  "provenance": null,
  "harvestFrequency": 0,
  "lastChecked": null,
  "dataCurrency": null,
  "harvestingNotes": null,
  "publicArchiveAvailable": false,
  "publicArchiveUrl": "http:https://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip",
  "gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr6504/dr6504_ror_dwca.zip",
  "downloadLimit": 0,
  "gbifDataset": false,
  "isShareableWithGBIF": true,
  "verified": false,
  "gbifRegistryKey": null,
  "doi": null
}

@ansell
Copy link
Contributor

ansell commented Sep 11, 2018

Also filtering for only "resourceType": "records" could be useful. In this case dr6504 represents a species-list, which are located in lists.ala.org.au and are not occurrence records so we don't currently dump them to Darwin Core Archives. Species lists should be representable using Darwin Core Archives, with a different rowType to occurrence records, so they could be exported in future.

@jhpoelen
Copy link
Author

Using your suggestion, I found a dataResource with "status": "dataAvailable" and "resourceType": "records" with a 404 at https://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip. Can you reproduce? If so, please let me know if there's additional criteria that can be used.
Thanks for your patience.

{
  "name": "AIMS - LTM Nearshore Corals (OBIS Australia)",
  "acronym": null,
  "uid": "dr122",
  "guid": null,
  "address": null,
  "phone": null,
  "email": null,
  "pubShortDescription": null,
  "pubDescription": "Surveys of coral species richness were carried out at nearshore reefs of the Great Barrier Reef, Australia in conjunction with surveys of size structure and percentage cover of hard and soft coral communities. Species lists (Presence / Absence) were compiled at 2m and 5m below datum at two sites on 33 reefs between Mackay and Cooktown (latitude 16-23 degrees South) in 2004. The aim of the study was to document the status of nearshore coral communities in this region to serve both as a baseline against which future change could be compared and also identify communities potentially at risk from anthropogenic activities. Hard corals were identified to species level (although on occasion identification was limited to genus) and soft corals were identified to genus.",
  "techDescription": null,
  "focus": null,
  "state": null,
  "websiteUrl": null,
  "alaPublicUrl": "https://collections.ala.org.au/public/show/dr122",
  "networkMembership": null,
  "hubMembership": [
    {
      "uid": "dh3",
      "name": "Ocean Biogeographic Information System",
      "uri": "https://collections.ala.org.au/ws/dataHub/dh3"
    }
  ],
  "taxonomyCoverageHints": [],
  "attributions": [],
  "dateCreated": "2010-09-14T00:05:26Z",
  "lastUpdated": "2011-07-05T04:21:50Z",
  "userLastModified": "[email protected]",
  "provider": {
    "name": "Institute of Marine and Coastal Sciences, Rutgers University",
    "uri": "https://collections.ala.org.au/ws/dataProvider/dp18",
    "uid": "dp18"
  },
  "rights": "Acknowledge the use of records from this dataset in the form appearing in the 'Citation' field and acknowledge the use of the OBIS facility. Recognise the limitatons of data in OBIS.",
  "licenseType": null,
  "licenseVersion": null,
  "citation": "AIMS - Status of Nearshore Reefs of the GBR: H Sweatman, A Thompson, S Delean, J Davidson, S Neale",
  "resourceType": "records",
  "dataGeneralizations": null,
  "informationWithheld": null,
  "permissionsDocument": null,
  "permissionsDocumentType": null,
  "contentTypes": [],
  "connectionParameters": {
    "protocol": "DIGIR",
    "resource": "aims_ltm_ns",
    "url": "http:https://iobis.marine.rutgers.edu/digir2/DiGIR.php",
    "termsForUniqueKey": [
      "institutionCode",
      "collectionCode",
      "catalogNumber"
    ]
  },
  "hasMappedCollections": false,
  "status": "dataAvailable",
  "provenance": "Published dataset",
  "harvestFrequency": 0,
  "lastChecked": null,
  "dataCurrency": null,
  "harvestingNotes": null,
  "publicArchiveAvailable": false,
  "publicArchiveUrl": "http:https://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip",
  "gbifArchiveUrl": "https://biocache.ala.org.au/archives/dr122/dr122_ror_dwca.zip",
  "downloadLimit": 0,
  "gbifDataset": false,
  "isShareableWithGBIF": true,
  "verified": false,
  "gbifRegistryKey": null,
  "doi": null
}

@ansell
Copy link
Contributor

ansell commented Sep 11, 2018

That data resource had records in the past (the collections.ala.org.au page shows downloads of records in the past, but none recently https://collections.ala.org.au/public/show/dr122 ), but has had all of its records deleted since then. Because it had records in the past, it likely received the dataAvailable flag at some point and has not had it revoked when the records were deleted.

I have switched its dataAvailable flag back to identified.

There are possibly others in the same category that you can identify using the "Usage statistics" on the public collections HTML page. Otherwise, once those issues are cleared, dataAvailable should be a fairly reliable flag.

More generally, we have not exported data for a few months while we have transitioned to a new version of biocache-store/biocache-service/cassandra/solr, but it is on our todo list to refresh the archive dumps and get them back to being automatically refreshed monthly. You may also find some new datasets that have been created and loaded into the new system for the first time which will not have exported archives yet, but will after we restart the archive creation process.

@ansell
Copy link
Contributor

ansell commented Sep 11, 2018

Just to clarify the underlying cause for dr122 a little further. If datasets have no records, we don't create archives for them. The previous archives have in those cases remained available in the past. However, I went through recently and cleared out old archives that were not being refreshed because of errors or not having any records. When doing that, I didn't change their "status" in collections.ala.org.au at that point, because I was unaware of its existence at the time. In the future if I detect these errors I will add the status change to the todo list for fixing them.

@jhpoelen
Copy link
Author

@ansell thanks for your clarifying the background. Hoping to get started on integrating the ALA so that the 404s can be easily picked up by Preston and others who might be interested. Does it makes sense to leave this issue open until the missing archives are removed / re-generated?

@ansell
Copy link
Contributor

ansell commented Sep 18, 2018

Yes, I have a separate task open for regenerating the files and will do a verification after that process.

@ansell ansell added the bug label Oct 28, 2018
@ansell ansell self-assigned this Oct 28, 2018
@jhpoelen
Copy link
Author

jhpoelen commented Apr 4, 2019

hey @ansell - just checking in on this issue. Did get a chance to scrub the ala dataset archives? Would this be a good time to start indexing them?

@jhpoelen
Copy link
Author

@ansell was just looking at indexing the ALA bio-cache - is this a good time to start indexing the ala datasets?

@ansell
Copy link
Contributor

ansell commented Sep 20, 2019

Hi,

I tried to regenerate the archives, but my plan was foiled by some software issues that will need a software engineer to look at them.

Have you run your code recently to know where the remaining issues are?

Thanks,

Peter

@jhpoelen
Copy link
Author

Just completed a Preston run using newly added ALA support (bio-guoda/preston#1) and found that, out of 1778 active archive uls (https://collections.ala.org.au/ws/dataResource?status=dataAvailable), 895 were unavailable (or rotten) and 1883 were active. Total data volume ~5 GB. Does this reflect the size of the ALA corpus?

Please see attached lists for more info.

active-urls.txt
rotten-urls.txt

@ansell
Copy link
Contributor

ansell commented Sep 23, 2019

The majority of those that I have reviewed are unpublished datasets from our internal data collection systems, DigiVol ( https://volunteer.ala.org.au/ ), and Biocollect ( https://biocollect.ala.org.au/ ). We are storing the metadata for those in collections.ala.org.au, and although they currently say dataAvailable, their records are not currently in biocache.ala.org.au so they will not appear in the dumps (not sure what a better way for that would be).

At some point in the mid-term (1-2 years) we will be switching collections.ala.org.au to use the GBIF registry software so I doubt that changes to collections.ala.org.au to change this behaviour will be made before then.

I will test the export with the latest biocache-store snapshot today to see how it goes, which may add some datasets that are published, but weren't exported yet, but it won't pick up the DigiVol and Biocollect unpublished cases.

@ansell
Copy link
Contributor

ansell commented Sep 23, 2019

Some of them are also lists from https://lists.ala.org.au/ , which may be "private" in some cases, but in most cases you can get those using the Lists API. https://api.ala.org.au/#ws92

@ansell
Copy link
Contributor

ansell commented Sep 23, 2019

You may have more success looking for dumps with this query, which doesn't include species lists:

https://collections.ala.org.au/ws/dataResource?status=dataAvailable&resourceType=records

@ansell
Copy link
Contributor

ansell commented Sep 23, 2019

I have copied all of our latest archive exports and there are some new data resources since last time, so you should see a drop in the number missing out of the 754 data resources that you see with the query above.

@ansell
Copy link
Contributor

ansell commented Sep 23, 2019

Regarding the sizes, we are sending a limited subset of fields to GBIF, so the size doesn't reflect the total data resource sizes:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/export/DwCACreator.scala#L32-L74

@jhpoelen
Copy link
Author

Thanks for all the info: very insightful.

re: resource sizes. This makes me wonder: Is there another way to access ALA records that better reflects the ALA corpus (incl. checklists excl. images)?

PS. Nice to see that you are using scala!

@ansell ansell removed their assignment Jan 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants