Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALA is downgrading and obfuscating species records #97

Open
Mesibov opened this issue Dec 7, 2015 · 8 comments
Open

ALA is downgrading and obfuscating species records #97

Mesibov opened this issue Dec 7, 2015 · 8 comments

Comments

@Mesibov
Copy link

Mesibov commented Dec 7, 2015

In auditing ALA's Coleoptera (2 December 2015 download) I've so far found hundreds of cases in which apparently good species records have been 'downgraded' during processing to higher taxon records and hidden from searching. Below are three examples.

(1) Amblystogenium pacificum is a carabid beetle. The Australian Antarctic Data Centre provided two records to ALA from islands in the French Southern Territories. A search for Amblystogenium pacificum on the ALA website returns no results. Despite both AADC records being uploaded with Amblystogenium as the genus item and pacificum as the specific epithet, one record has been assigned to Coleoptera:

http:https://biocache.ala.org.au/occurrences/82e887ca-0c35-4ef0-a9f2-878a1f0012df

and the other to Carabidae:

http:https://biocache.ala.org.au/occurrences/e6c7f45b-c388-4a57-bbb8-c12cb68088da

The name 'Amblystogenium pacificum' is not in AFD but has a listing in CoL:

http:https://www.catalogueoflife.org/col/details/species/id/762976f3892d25893864cf272ff31260

(2) 88 specimen records of the ant Paratrechina minutula were provided to ALA by the Australian Museum. The Museum misspelled 'Paratrechina' as 'Paratrachina' in its upload. Although all 88 records give Hymenoptera for order and Formicidae for family, ALA has listed them all as subtribe Paratrachina in order Coleoptera, family Buprestidae, e.g.

http:https://biocache.ala.org.au/occurrences/cd45395e-0a02-4686-8d86-06dbfe36b21a

These 88 records don't appear among ALA's ant records.

(3) 10 specimen records of the beetle Dorysthenes walkeri were provided to ALA by the Australian Museum. A search for "Dorysthenes walkeri" on the ALA website returns no results. All 10 records have been 'downgraded' to Cerambycidae, e.g.

http:https://biocache.ala.org.au/occurrences/e54e8567-4146-441c-9b9c-90afb412f5fc

The name 'Dorysthenes walkeri' is not in AFD but has a listing in CoL:

http:https://www.catalogueoflife.org/col/details/species/id/b7f5669190f3e08bcf25a1d51217c922

@Mesibov
Copy link
Author

Mesibov commented Dec 9, 2015

This issue isn't a trivial one - parsing of the original 'Scientific Name' is failing badly. The worst result from ALA's point of view as a record provider is that good species records are being relegated to order, family, etc and hidden from search.

In the beetles dataset I downloaded 2 December 2015, I've now done a preliminary audit of the higher categories order > subgenus looking for lower-ranking taxa inappropriately 'downgraded' in the 'Matched Scientific Name' field. 4243 names have been miscategorised, accounting for 31029 records.

Some of the reasons for the parsing failure are obvious, some aren't. A few are hilarious, like NSW's Myall Woodland threatened ecological community being parsed as the eucnemid genus Myall, e.g.

http:https://biocache.ala.org.au/occurrences/6b709005-9565-4d5b-9318-0660df50a70c

Data cleaning at provider end isn't going to solve this problem, although it would help. ALA's string parser just isn't doing its job, even on clean strings.

@nickdos
Copy link
Contributor

nickdos commented Dec 9, 2015

Thanks again Bob - this is an ongoing issue for us (it's never "done") but we've neglected it a bit in the past year. We're currently working on updating our name indexing and parsing software, so these examples will be useful in improving the software's name matching accuracy.

I just wanted to point out at that mis-matched or unmatched names are not (completely) hidden from the search - we provide an option in the advanced search page to search for the "raw/verbatim/unprocessed name". But I agree it would be far better to have the matching work the way a human would expect it to.

image

@Mesibov
Copy link
Author

Mesibov commented Dec 10, 2015

"We're currently working on updating our name indexing and parsing software...improving the software's name matching accuracy"

How will you assess any improvement, and more generally, how do you check to see if your name-matching software is doing what it's supposed to do? Mine was an external audit. How does your internal audit work?

@nickdos
Copy link
Contributor

nickdos commented Dec 10, 2015

We've been collecting reports of examples of incorrectly identified names that we will test with and (fix) on top of existing names we've fixed in the past (to make sure we don't break them). We also use the GBIF names parsing library that contains its own set of tests. We are currently in discussions with a panel botanical experts who are also providing us with a list of problems in botanical name matching in the ALA.

If you could provide an export list (CSV) of names you've identified - with correct matching or higher taxonomy, then this would greatly improve our ability to make such improvements.

@Mesibov
Copy link
Author

Mesibov commented Dec 10, 2015

"If you could provide an export list (CSV) of names you've identified - with correct matching or higher taxonomy, then this would greatly improve our ability to make such improvements."

That's 4243 beetle names in the order > subgenus sets alone. You want the correct matching for each, and you're paying me how much to do that for you?

@nickdos
Copy link
Contributor

nickdos commented Dec 10, 2015

I was under the impression you already had something like this, so apologies for the confusion or misunderstanding about the nature of your data. I'm not privy to the details of your audit so I'm only going on what has been provided in these issues and in emails. I was not asking or expecting you to do hours of unpaid work for us. I was simply asking if you did have data that would assist us in improving our names matching, and you were in a position to be able to provide that to us, then we would greatly appreciate it :-)

@Mesibov
Copy link
Author

Mesibov commented Dec 10, 2015

I can give you a list of likely reasons for parsing failure, with one example each. I'll post it tomorrow.

@Mesibov
Copy link
Author

Mesibov commented Dec 10, 2015

As promised, at the end of this post is a list (with examples) of 7 suggested reasons for inappropriate taxonomic 'downgrading'.

Another strange feature of name matching is inconsistency. As with Amblystogenium pacificum (mentioned in the first post in this thread), the same supplied name string can be matched by ALA in more than one way. In the following pairs, the first item is number of records, the second is (supplied) scientific name, the third is matched scientific name and the last is matched taxon rank.

1 Paederus Paederus genus
37 Paederus STAPHYLINIDAE family
1 Osorius COLEOPTERA order
95 Osorius STAPHYLINIDAE family
200 Gonipterus CURCULIONIDAE family
3 Gonipterus Gonipterus genus
445 Cisseis BUPRESTIDAE family
4 Cisseis COLEOPTERA order
3 Carpelimus COLEOPTERA order
42 Carpelimus STAPHYLINIDAE family
40 Calosoma Calosoma genus
3 Calosoma CARABIDAE family
4 Anomala sp. Anomala genus
2 Anomala sp. SCARABAEIDAE family

  1. Name-matching isn't looking to CoL for non-Australian species
    The teak borer Stromatium barbatum (Fabricius, 1775) [http:https://www.padil.gov.au/pests-and-diseases/pest/main/135566] is matched to family Cerambycidae.

  2. Name-matching isn't fuzzy, and fails strings with minor spelling errors
    'Metriorhynchus moereus Lea' is matched with family Lycidae, while 'Metriorrhynchus moerens Lea, 1909' is correctly matched with its synonym Porrostoma (Porrostoma) moerens.

  3. Name-matching is following taxonomic fashions too closely
    265 records for 'Scydmaenidae' are matched to order Coleoptera. In Australia, the stone beetle family Scydmaenidae was only recently proposed as Scydmaeninae, a subfamily of Staphylinidae. Scydmaenidae is still recognised as a family in the NCBI taxonomy (used by Encyclopedia of Life) and CoL.

  4. Name-matching isn't always following synonyms carefully
    'Austrolema (Zeugophora) williamsi Reid' is matched to subgenus Zeugophora (Pedrillia). Zeugophora (Pedrillia) williamsi Reid, 1989 is in AFD.

  5. Name-matching sometimes doesn't follow synonyms at all
    'Ptenidula producta Deane, 1932' is matched to order Coleoptera. Ptenidula producta Deane, 1932 is listed in AFD as a synonym of Actidium producta (Deane, 1932).

  6. Inconsistent treatment of genus 'sp.' names
    Most genus 'sp.' names are matched to the genus concerned, but 745 records of 'Heteronyx sp.' are assigned to family Scarabaeidae instead of genus Heteronyx.

  7. No apparent reason
    'Diomus bunya Pang & Slipinski' is matched to genus Diomus. Diomus bunya Pang & Slipinski, 2010 is in AFD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants