-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALA is downgrading and obfuscating species records #97
Comments
This issue isn't a trivial one - parsing of the original 'Scientific Name' is failing badly. The worst result from ALA's point of view as a record provider is that good species records are being relegated to order, family, etc and hidden from search. In the beetles dataset I downloaded 2 December 2015, I've now done a preliminary audit of the higher categories order > subgenus looking for lower-ranking taxa inappropriately 'downgraded' in the 'Matched Scientific Name' field. 4243 names have been miscategorised, accounting for 31029 records. Some of the reasons for the parsing failure are obvious, some aren't. A few are hilarious, like NSW's Myall Woodland threatened ecological community being parsed as the eucnemid genus Myall, e.g. http:https://biocache.ala.org.au/occurrences/6b709005-9565-4d5b-9318-0660df50a70c Data cleaning at provider end isn't going to solve this problem, although it would help. ALA's string parser just isn't doing its job, even on clean strings. |
Thanks again Bob - this is an ongoing issue for us (it's never "done") but we've neglected it a bit in the past year. We're currently working on updating our name indexing and parsing software, so these examples will be useful in improving the software's name matching accuracy. I just wanted to point out at that mis-matched or unmatched names are not (completely) hidden from the search - we provide an option in the advanced search page to search for the "raw/verbatim/unprocessed name". But I agree it would be far better to have the matching work the way a human would expect it to. |
"We're currently working on updating our name indexing and parsing software...improving the software's name matching accuracy" How will you assess any improvement, and more generally, how do you check to see if your name-matching software is doing what it's supposed to do? Mine was an external audit. How does your internal audit work? |
We've been collecting reports of examples of incorrectly identified names that we will test with and (fix) on top of existing names we've fixed in the past (to make sure we don't break them). We also use the GBIF names parsing library that contains its own set of tests. We are currently in discussions with a panel botanical experts who are also providing us with a list of problems in botanical name matching in the ALA. If you could provide an export list (CSV) of names you've identified - with correct matching or higher taxonomy, then this would greatly improve our ability to make such improvements. |
"If you could provide an export list (CSV) of names you've identified - with correct matching or higher taxonomy, then this would greatly improve our ability to make such improvements." That's 4243 beetle names in the order > subgenus sets alone. You want the correct matching for each, and you're paying me how much to do that for you? |
I was under the impression you already had something like this, so apologies for the confusion or misunderstanding about the nature of your data. I'm not privy to the details of your audit so I'm only going on what has been provided in these issues and in emails. I was not asking or expecting you to do hours of unpaid work for us. I was simply asking if you did have data that would assist us in improving our names matching, and you were in a position to be able to provide that to us, then we would greatly appreciate it :-) |
I can give you a list of likely reasons for parsing failure, with one example each. I'll post it tomorrow. |
As promised, at the end of this post is a list (with examples) of 7 suggested reasons for inappropriate taxonomic 'downgrading'. Another strange feature of name matching is inconsistency. As with Amblystogenium pacificum (mentioned in the first post in this thread), the same supplied name string can be matched by ALA in more than one way. In the following pairs, the first item is number of records, the second is (supplied) scientific name, the third is matched scientific name and the last is matched taxon rank. 1 Paederus Paederus genus
|
In auditing ALA's Coleoptera (2 December 2015 download) I've so far found hundreds of cases in which apparently good species records have been 'downgraded' during processing to higher taxon records and hidden from searching. Below are three examples.
(1) Amblystogenium pacificum is a carabid beetle. The Australian Antarctic Data Centre provided two records to ALA from islands in the French Southern Territories. A search for Amblystogenium pacificum on the ALA website returns no results. Despite both AADC records being uploaded with Amblystogenium as the genus item and pacificum as the specific epithet, one record has been assigned to Coleoptera:
http:https://biocache.ala.org.au/occurrences/82e887ca-0c35-4ef0-a9f2-878a1f0012df
and the other to Carabidae:
http:https://biocache.ala.org.au/occurrences/e6c7f45b-c388-4a57-bbb8-c12cb68088da
The name 'Amblystogenium pacificum' is not in AFD but has a listing in CoL:
http:https://www.catalogueoflife.org/col/details/species/id/762976f3892d25893864cf272ff31260
(2) 88 specimen records of the ant Paratrechina minutula were provided to ALA by the Australian Museum. The Museum misspelled 'Paratrechina' as 'Paratrachina' in its upload. Although all 88 records give Hymenoptera for order and Formicidae for family, ALA has listed them all as subtribe Paratrachina in order Coleoptera, family Buprestidae, e.g.
http:https://biocache.ala.org.au/occurrences/cd45395e-0a02-4686-8d86-06dbfe36b21a
These 88 records don't appear among ALA's ant records.
(3) 10 specimen records of the beetle Dorysthenes walkeri were provided to ALA by the Australian Museum. A search for "Dorysthenes walkeri" on the ALA website returns no results. All 10 records have been 'downgraded' to Cerambycidae, e.g.
http:https://biocache.ala.org.au/occurrences/e54e8567-4146-441c-9b9c-90afb412f5fc
The name 'Dorysthenes walkeri' is not in AFD but has a listing in CoL:
http:https://www.catalogueoflife.org/col/details/species/id/b7f5669190f3e08bcf25a1d51217c922
The text was updated successfully, but these errors were encountered: