Make all SOLR taxa fields case insensitive #76

nickdos · 2015-08-19T06:04:42Z

From @nickdos on September 10, 2014 5:19

Or add a case-insensitive copyField.

As reported by a user:

On 9 Sep 2014, at 6:23 pm, Gioia, Paul [email protected] wrote:

species download - https://bie.ala.org.au/ws/download?q={q}&fq={fq} - If you search on kingdom, Plantae works fine, but Animalia doesn’t until you realise the searches are case sensitive, and Plantae, Fungi are in title case while ANIMALIA is upper case. Either the data should be consistent or the querying case-insensitive, otherwise you don’t know what you can query for.

Copied from original issue: AtlasOfLivingAustralia/biocache-service#5

nickdos · 2015-08-19T06:04:43Z

Indexing issue - goes in biocache-store

nickdos · 2015-08-21T02:02:12Z

This might break stuff so will need testing

djtfmartin · 2015-08-24T09:06:03Z

Im not sure the solution here is to make all SOLR fields case in sensitve. I think this will cause the index to increase dramatically in size which will have big implications for not a lot of benefit.

Heres a quicker/easier alternative:

https://biocache.ala.org.au/ws/occurrences/search?q=Animalia

When the above query is ran we match the string "Animalia" to the the GUID for the taxon kingdom:ANIMALIA. We then search with left/right values associated with the guid. This works for any level in the hierarchy (not just major linnean ranks e.g.subfamily). The match is case insensitive and it also has the smarts to parse taxonomic names with authorship string etc.

So if we just extend this support to taking a query "kingdom:animalia", parsing it and doing the same matching as above, then we get case insensitive searches and we get more accurate searches for all taxonomic ranks (which you don't get if you just make all fields case insensitive).

Make sense ?

FQs shouldn't be case insensitive as they are intended to be exact. They differ from Qs in this aspect.

djtfmartin · 2015-08-24T09:08:45Z

Also - just noticed the original query raised by Paul was for the BIE not the Biocache...

adam-collins · 2015-08-24T12:39:31Z

Changing to a case insensitive SOLR field might work in BIE because it does not have facet listing services. Is that right?

A test on 50999 records indicated that less storage space is required for a case insensitive index with solr.TextField than a case sensitive index with solr.StrField. At least as an overall result, field to field may vary.

fieldType string as case sensitive strField, 50999 records, solr data size after optimize 303480KB

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true">

fieldType string as case insensitive textfield, 50999 records after solr data size after optimize 266488KB.

    <fieldType name="string" class="solr.TextField">
            <analyzer type="index">
                <tokenizer class="solr.KeywordTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.KeywordTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
     </fieldType>

Wild card searching and exact searching appear to operate the same. Stored values do retain case.

Unfortunately facet listings return only lower case values so I do not think we can use it in biocache-service.

It looks like the biocache-hubs param taxa operates with a GUID search. https://biocache.ala.org.au/occurrences/search?taxa=animalia&facet=off and biocache-service with q=Animalia is currently searching the defaultSearchField text, which is case insensitive and contains the contents of many fields. It is more likely to include unintended results and require further filtering, https://biocache.ala.org.au/ws/occurrence/facets?q=Fungi&facets=kingdom. Is it worth implementing taxa= in biocache-service as it is in biocache-hubs?

djtfmartin · 2015-08-24T12:54:41Z

Yeah, facet listings return the indexed values, not the stored. I think we came to conclusion last time we looked at this that we'd need to store/index the fields we want to be case insensitive twice.

Is it worth implementing taxa= in biocache-service as it is in biocache-hubs?

Yes, probably. The aim for the services was to make clients as dumb as possible. So if theres search term mangling in biocache-hubs it would be better to push this back to the service if possible. That way we dont have multiple clients (SP, biocache, outside world) all replicating the logic.

nickdos added the enhancement label Aug 19, 2015

nickdos mentioned this issue Aug 19, 2015

Make all SOLR taxa fields case insensitive AtlasOfLivingAustralia/biocache-service#5

Closed

nickdos added this to the Sprint 3 milestone Aug 21, 2015

nickdos mentioned this issue Feb 5, 2019

Make certain search (only) fields case insensitive #322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make all SOLR taxa fields case insensitive #76

Make all SOLR taxa fields case insensitive #76

nickdos commented Aug 19, 2015

nickdos commented Aug 19, 2015

nickdos commented Aug 21, 2015

djtfmartin commented Aug 24, 2015

djtfmartin commented Aug 24, 2015

adam-collins commented Aug 24, 2015

djtfmartin commented Aug 24, 2015

Make all SOLR taxa fields case insensitive #76

Make all SOLR taxa fields case insensitive #76

Comments

nickdos commented Aug 19, 2015

nickdos commented Aug 19, 2015

nickdos commented Aug 21, 2015

djtfmartin commented Aug 24, 2015

djtfmartin commented Aug 24, 2015

adam-collins commented Aug 24, 2015

djtfmartin commented Aug 24, 2015