-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make suggester's string distance pluggable or configurable #24655
Comments
I think it's misleading and very expert if we allow a non-editdistance-like metric. See the comment on the spellchecker:
I really think we can't do much here without causing a massive amount of confusion? |
@s1monw I see, basically, it's not going to work optimally unless all costs in our custom metric are greater or equal to 1. In other words, Damerau-Levenshtein has to be the lower bound for the custom function. I agree, that would be really confusing in a general case, because it's not really possible to select proper max inspections. We could add a weighted edit distance and enforce weights to be >= 1, but this will only help the "Fat-fingers" use case and only to some degree. What do you think about adding an option to not ignore exact match of the same term? This will not help with "Fat-fingers" use case, but it will be a viable solution for other two use cases. |
we do this in our code afaik - we already have the original term and we take it into account when we try to find a good suggestions |
yes, but there seems to be no way to figure out if the original term exited in the index or was so different that we couldn't find any suggestions for it at all. For example:
produces:
|
I think this is very difficult to proof on a single doc single term example. you can play with confidence level to get suggestions in such an index but i personally think it's pretty useless. We do get these suggestions but they don't pass the cutoff freq which is ok in that particular index IMO. |
I am not sure I follow. The first suggestion is an exact match, but there is no way for a user to figure this out. I am not really sure how adding more documents to the index is going to help here. |
spellcheckers work based on frequencies and docs etc. in a real corpus this will very likely work. |
…by default In example indices it can be very confusing when perfect matches are not suggested since scores are equivalent to the cut-off score. Relates to elastic#24655
@imotov I added a little change that makes this come up but it also might bring up bullshit in other situations you can't come up with 1 doc examples and expect them to work, sorry. |
cc @elastic/es-search-aggs |
Is this feature on the backlog? |
I am still not getting any result for exact match. :-((( |
Did you guys getting any solution for that, I am new here. |
Pinging @elastic/es-search (Team:Search) |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
Summary
The current implementation allows choosing between of Damerau-Levenshtein algorithm (2 implementations), Levenshtein algorithm, Jaro-Winkler algorithm, or ngram-based algorithm with non-configurable size of n-gram (2). I would like to discuss a possibility of making the string distance algorithm pluggable or extend the number of algorithms and provide configurability.
Rationale
While this selection provides most frequently used generic string distance algorithms it falls short in several important use cases where more information about frequency of certain mistakes is available and certain types of mistakes are anticipated. Here are few of the use cases
Wrong Keyboard Layout
The first use case and the main motivation for this issue was this recent post on Russian category of elasticsearch forum. It's somewhat frequent for users with multiple keyboard layouts installed on their computer to start typing using a wrong layout. So, in case of standard Russian keyboard layout and QWERTY layout, a user could type
ыекштп
instead ofstring
, which is essentially a single mistake (wrong layout) counted as 6 (length of the string). We can handle this use case during search using char_filters but it wouldn't work in suggester because we don't provide suggestions for exact matches."Fat-fingers"
Even with a single keyboard layout, some typos are much more common with other. So, by making penalty for substitution for commonly mistyped letters lower, we can improve quality of suggestions. A typical approach to this problem with is using weighted edit distance with frequency-based confusion matrix.
Phonetic correction
Misspellings are different from typos but as typos they have similar characteristics with certain mistakes being more frequent than others. Similar to wrong keyboard layout use case we can easily handle this in search using phonetic token filter, but if a term misspelled by a user has exactly the same phonetic hash as the term found by suggester, we would not make the suggestion.
The text was updated successfully, but these errors were encountered: