Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all SOLR taxa fields case insensitive #76

Open
nickdos opened this issue Aug 19, 2015 · 6 comments
Open

Make all SOLR taxa fields case insensitive #76

nickdos opened this issue Aug 19, 2015 · 6 comments
Milestone

Comments

@nickdos
Copy link
Contributor

nickdos commented Aug 19, 2015

From @nickdos on September 10, 2014 5:19

Or add a case-insensitive copyField.

As reported by a user:

On 9 Sep 2014, at 6:23 pm, Gioia, Paul [email protected] wrote:

species download - https://bie.ala.org.au/ws/download?q={q}&fq={fq} - If you search on kingdom, Plantae works fine, but Animalia doesn’t until you realise the searches are case sensitive, and Plantae, Fungi are in title case while ANIMALIA is upper case. Either the data should be consistent or the querying case-insensitive, otherwise you don’t know what you can query for.

Copied from original issue: AtlasOfLivingAustralia/biocache-service#5

@nickdos
Copy link
Contributor Author

nickdos commented Aug 19, 2015

Indexing issue - goes in biocache-store

@nickdos
Copy link
Contributor Author

nickdos commented Aug 21, 2015

This might break stuff so will need testing

@nickdos nickdos added this to the Sprint 3 milestone Aug 21, 2015
@djtfmartin
Copy link
Member

Im not sure the solution here is to make all SOLR fields case in sensitve. I think this will cause the index to increase dramatically in size which will have big implications for not a lot of benefit.

Heres a quicker/easier alternative:

https://biocache.ala.org.au/ws/occurrences/search?q=Animalia

When the above query is ran we match the string "Animalia" to the the GUID for the taxon kingdom:ANIMALIA. We then search with left/right values associated with the guid. This works for any level in the hierarchy (not just major linnean ranks e.g.subfamily). The match is case insensitive and it also has the smarts to parse taxonomic names with authorship string etc.

So if we just extend this support to taking a query "kingdom:animalia", parsing it and doing the same matching as above, then we get case insensitive searches and we get more accurate searches for all taxonomic ranks (which you don't get if you just make all fields case insensitive).

Make sense ?

FQs shouldn't be case insensitive as they are intended to be exact. They differ from Qs in this aspect.

@djtfmartin
Copy link
Member

Also - just noticed the original query raised by Paul was for the BIE not the Biocache...

@adam-collins
Copy link
Contributor

Changing to a case insensitive SOLR field might work in BIE because it does not have facet listing services. Is that right?

A test on 50999 records indicated that less storage space is required for a case insensitive index with solr.TextField than a case sensitive index with solr.StrField. At least as an overall result, field to field may vary.

  • fieldType string as case sensitive strField, 50999 records, solr data size after optimize 303480KB
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true">
  • fieldType string as case insensitive textfield, 50999 records after solr data size after optimize 266488KB.
    <fieldType name="string" class="solr.TextField">
            <analyzer type="index">
                <tokenizer class="solr.KeywordTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.KeywordTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
            </analyzer>
     </fieldType>

Wild card searching and exact searching appear to operate the same. Stored values do retain case.

Unfortunately facet listings return only lower case values so I do not think we can use it in biocache-service.

It looks like the biocache-hubs param taxa operates with a GUID search. https://biocache.ala.org.au/occurrences/search?taxa=animalia&facet=off and biocache-service with q=Animalia is currently searching the defaultSearchField text, which is case insensitive and contains the contents of many fields. It is more likely to include unintended results and require further filtering, https://biocache.ala.org.au/ws/occurrence/facets?q=Fungi&facets=kingdom. Is it worth implementing taxa= in biocache-service as it is in biocache-hubs?

@djtfmartin
Copy link
Member

Yeah, facet listings return the indexed values, not the stored. I think we came to conclusion last time we looked at this that we'd need to store/index the fields we want to be case insensitive twice.

Is it worth implementing taxa= in biocache-service as it is in biocache-hubs?

Yes, probably. The aim for the services was to make clients as dumb as possible. So if theres search term mangling in biocache-hubs it would be better to push this back to the service if possible. That way we dont have multiple clients (SP, biocache, outside world) all replicating the logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants