Best practice for N-gram and set Lucene param with Clouseau #2635

natcohen · 2020-03-04T14:56:25Z

CouchDB/Clouseau indexing allows analyzers but what about n-gram tokenization? What is the best practive for n-grams? Should we use an algorithm to do n-grams within the index javascript function? Or can we take advantage of Lucene n-gram function?

Also how can we set Lucene parameters such as allowing leading wildcard (https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAllowLeadingWildcard(boolean))?

rnewson · 2020-03-12T08:05:25Z

We don't expose the NGram analyzers in Clouseau today but we'd consider merging a pull request if you want to add it.

We don't support setting of that parameter either, and I don't think we'd accept a patch to allow it given it has such bad performance implications.

natcohen · 2020-03-12T14:19:22Z

@rnewson I'd love to contribute and add the n-gram analyzer. Unfortunately I don't know Erlang and working on Clouseau is a bit overwhelming since the project seams quite complex with very little documentation... I m also not an expert in Java so that doesn't help either!

Regarding the leading wildcard parameter, it was just an example! I don't plan to use it but wanted to know if there was a way to use all the parameters Lucene offers.

natcohen · 2020-03-19T19:28:34Z

@rnewson Partial search is widely used especially for auto-complete. Any chance someone can help exposing the n-gram analyzer? I have posted an issue to get some guidance here but Clouseau doesn't seem super active!

PS There are other useful analyzers that would be great exposing such as n-gram edge...

rnewson · 2020-03-31T09:08:42Z

hi @natcohen sorry for silence.

Appreciate the desire to help but things move forward in this project when folks contribute. It's useful to highlight a desire for this feature, though. If someone works on it to a reviewable standard, I'm sure someone will have time to help it the last few steps.

streunerlein · 2023-09-01T11:59:56Z

@natcohen I see your efforts here and appreciate them a lot. We are also eagerly looking out for an n-gram analyzer in Clouseau - but it seems to be very low priority. If you look for auto-complete we've had good experience with prefix searches using the wildcard at the end (*) - that works out of the box.

Prefix-Search (aka words starting with X)
Out-of-the-box, index the field normally and query with tailing wildcard (value*).

Suffix-Search (aka words ending with X)
The other way around, suffix search (tokens ending certain things), that is trickier but we are working on a prototype that might just allow that:

Index string as you normally would (index("field", "value")) to allow prefix search
Index the string again in a different field but reversed (index("r_field", "eulav")) to allow suffix search

When you perform searches search in both fields add the input twice, once with the query reversed on the reversed field:
field:value* r_field:eulav*

which will search for tokens starting with "value" or ending with "value".

What is left is infix search, so words containing "value" for which I think ngram-analyzer are the only way to go.

natcohen added enhancement needs-triage labels Mar 4, 2020

natcohen changed the title ~~Best practice for N-grame and set Lucene param with Clouseau~~ Best practice for N-gram and set Lucene param with Clouseau Mar 4, 2020

wohali added patches-welcome and removed needs-triage labels Mar 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for N-gram and set Lucene param with Clouseau #2635

Best practice for N-gram and set Lucene param with Clouseau #2635

natcohen commented Mar 4, 2020 •

edited

Loading

rnewson commented Mar 12, 2020

natcohen commented Mar 12, 2020

natcohen commented Mar 19, 2020 •

edited

Loading

rnewson commented Mar 31, 2020

streunerlein commented Sep 1, 2023 •

edited

Loading

Best practice for N-gram and set Lucene param with Clouseau #2635

Best practice for N-gram and set Lucene param with Clouseau #2635

Comments

natcohen commented Mar 4, 2020 • edited Loading

rnewson commented Mar 12, 2020

natcohen commented Mar 12, 2020

natcohen commented Mar 19, 2020 • edited Loading

rnewson commented Mar 31, 2020

streunerlein commented Sep 1, 2023 • edited Loading

natcohen commented Mar 4, 2020 •

edited

Loading

natcohen commented Mar 19, 2020 •

edited

Loading

streunerlein commented Sep 1, 2023 •

edited

Loading