Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for N-gram and set Lucene param with Clouseau #2635

Open
natcohen opened this issue Mar 4, 2020 · 5 comments
Open

Best practice for N-gram and set Lucene param with Clouseau #2635

natcohen opened this issue Mar 4, 2020 · 5 comments

Comments

@natcohen
Copy link

natcohen commented Mar 4, 2020

CouchDB/Clouseau indexing allows analyzers but what about n-gram tokenization? What is the best practive for n-grams? Should we use an algorithm to do n-grams within the index javascript function? Or can we take advantage of Lucene n-gram function?

Also how can we set Lucene parameters such as allowing leading wildcard (https://lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/queryparser/classic/QueryParserBase.html#setAllowLeadingWildcard(boolean))?

@natcohen natcohen changed the title Best practice for N-grame and set Lucene param with Clouseau Best practice for N-gram and set Lucene param with Clouseau Mar 4, 2020
@rnewson
Copy link
Member

rnewson commented Mar 12, 2020

We don't expose the NGram analyzers in Clouseau today but we'd consider merging a pull request if you want to add it.

We don't support setting of that parameter either, and I don't think we'd accept a patch to allow it given it has such bad performance implications.

@natcohen
Copy link
Author

@rnewson I'd love to contribute and add the n-gram analyzer. Unfortunately I don't know Erlang and working on Clouseau is a bit overwhelming since the project seams quite complex with very little documentation... I m also not an expert in Java so that doesn't help either!

Regarding the leading wildcard parameter, it was just an example! I don't plan to use it but wanted to know if there was a way to use all the parameters Lucene offers.

@natcohen
Copy link
Author

natcohen commented Mar 19, 2020

@rnewson Partial search is widely used especially for auto-complete. Any chance someone can help exposing the n-gram analyzer? I have posted an issue to get some guidance here but Clouseau doesn't seem super active!

PS There are other useful analyzers that would be great exposing such as n-gram edge...

@rnewson
Copy link
Member

rnewson commented Mar 31, 2020

hi @natcohen sorry for silence.

Appreciate the desire to help but things move forward in this project when folks contribute. It's useful to highlight a desire for this feature, though. If someone works on it to a reviewable standard, I'm sure someone will have time to help it the last few steps.

@streunerlein
Copy link

streunerlein commented Sep 1, 2023

@natcohen I see your efforts here and appreciate them a lot. We are also eagerly looking out for an n-gram analyzer in Clouseau - but it seems to be very low priority. If you look for auto-complete we've had good experience with prefix searches using the wildcard at the end (*) - that works out of the box.

Prefix-Search (aka words starting with X)
Out-of-the-box, index the field normally and query with tailing wildcard (value*).

Suffix-Search (aka words ending with X)
The other way around, suffix search (tokens ending certain things), that is trickier but we are working on a prototype that might just allow that:

  • Index string as you normally would (index("field", "value")) to allow prefix search
  • Index the string again in a different field but reversed (index("r_field", "eulav")) to allow suffix search

When you perform searches search in both fields add the input twice, once with the query reversed on the reversed field:
field:value* r_field:eulav*

which will search for tokens starting with "value" or ending with "value".

What is left is infix search, so words containing "value" for which I think ngram-analyzer are the only way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants