Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make suggester's string distance pluggable or configurable #24655

Open
imotov opened this issue May 12, 2017 · 14 comments
Open

Make suggester's string distance pluggable or configurable #24655

imotov opened this issue May 12, 2017 · 14 comments
Labels
>enhancement :Search Relevance/Suggesters "Did you mean" and suggestions as you type Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@imotov
Copy link
Contributor

imotov commented May 12, 2017

Summary

The current implementation allows choosing between of Damerau-Levenshtein algorithm (2 implementations), Levenshtein algorithm, Jaro-Winkler algorithm, or ngram-based algorithm with non-configurable size of n-gram (2). I would like to discuss a possibility of making the string distance algorithm pluggable or extend the number of algorithms and provide configurability.

Rationale

While this selection provides most frequently used generic string distance algorithms it falls short in several important use cases where more information about frequency of certain mistakes is available and certain types of mistakes are anticipated. Here are few of the use cases

Wrong Keyboard Layout

The first use case and the main motivation for this issue was this recent post on Russian category of elasticsearch forum. It's somewhat frequent for users with multiple keyboard layouts installed on their computer to start typing using a wrong layout. So, in case of standard Russian keyboard layout and QWERTY layout, a user could type ыекштп instead of string, which is essentially a single mistake (wrong layout) counted as 6 (length of the string). We can handle this use case during search using char_filters but it wouldn't work in suggester because we don't provide suggestions for exact matches.

"Fat-fingers"

Even with a single keyboard layout, some typos are much more common with other. So, by making penalty for substitution for commonly mistyped letters lower, we can improve quality of suggestions. A typical approach to this problem with is using weighted edit distance with frequency-based confusion matrix.

Phonetic correction

Misspellings are different from typos but as typos they have similar characteristics with certain mistakes being more frequent than others. Similar to wrong keyboard layout use case we can easily handle this in search using phonetic token filter, but if a term misspelled by a user has exactly the same phonetic hash as the term found by suggester, we would not make the suggestion.

@imotov imotov added :Search Relevance/Suggesters "Did you mean" and suggestions as you type discuss >feature labels May 12, 2017
@s1monw
Copy link
Contributor

s1monw commented May 15, 2017

I think it's misleading and very expert if we allow a non-editdistance-like metric. See the comment on the spellchecker:

/**
   * Set the string distance metric.
   * The default is {@link #INTERNAL_LEVENSHTEIN}
   * <p>
   * Note: because this spellchecker draws its candidates from the term
   * dictionary using Damerau-Levenshtein, it works best with an edit-distance-like
   * string metric. If you use a different metric than the default,
   * you might want to consider increasing {@link #setMaxInspections(int)}
   * to draw more candidates for your metric to rank.
   */

I really think we can't do much here without causing a massive amount of confusion?

@imotov
Copy link
Contributor Author

imotov commented May 15, 2017

@s1monw I see, basically, it's not going to work optimally unless all costs in our custom metric are greater or equal to 1. In other words, Damerau-Levenshtein has to be the lower bound for the custom function. I agree, that would be really confusing in a general case, because it's not really possible to select proper max inspections. We could add a weighted edit distance and enforce weights to be >= 1, but this will only help the "Fat-fingers" use case and only to some degree.

What do you think about adding an option to not ignore exact match of the same term? This will not help with "Fat-fingers" use case, but it will be a viable solution for other two use cases.

@s1monw
Copy link
Contributor

s1monw commented May 16, 2017

What do you think about adding an option to not ignore exact match of the same term? This will not help with "Fat-fingers" use case, but it will be a viable solution for other two use cases.

we do this in our code afaik - we already have the original term and we take it into account when we try to find a good suggestions

@imotov
Copy link
Contributor Author

imotov commented May 16, 2017

yes, but there seems to be no way to figure out if the original term exited in the index or was so different that we couldn't find any suggestions for it at all. For example:

DELETE my_index
PUT my_index/doc/1?refresh
{
  "text": "foobarbaz"
}
GET my_index/_search
{
  "size": 0,
  "suggest": {
        "exact" : {
            "text" : "foobarbaz",
            "term" : {
                "field" : "text",
                "suggest_mode": "always"
            }
        },
        "way-off" : {
            "text" : "bazbarfoo",
            "term" : {
                "field" : "text",
                "suggest_mode": "always"
            }
        },
        "pretty-close" : {
            "text" : "foobarba",
            "term" : {
                "field" : "text",
                "suggest_mode": "always"
            }
        }
    }
}

produces:

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "suggest": {
    "exact": [
      {
        "text": "foobarbaz",
        "offset": 0,
        "length": 9,
        "options": []
      }
    ],
    "way-off": [
      {
        "text": "bazbarfoo",
        "offset": 0,
        "length": 9,
        "options": []
      }
    ],
    "pretty-close": [
      {
        "text": "foobarba",
        "offset": 0,
        "length": 8,
        "options": [
          {
            "text": "foobarbaz",
            "score": 0.875,
            "freq": 1
          }
        ]
      }
    ]
  }
}

@s1monw
Copy link
Contributor

s1monw commented May 16, 2017

I think this is very difficult to proof on a single doc single term example. you can play with confidence level to get suggestions in such an index but i personally think it's pretty useless. We do get these suggestions but they don't pass the cutoff freq which is ok in that particular index IMO.

@imotov
Copy link
Contributor Author

imotov commented May 16, 2017

I think this is very difficult to proof on a single doc single term example.

I am not sure I follow. The first suggestion is an exact match, but there is no way for a user to figure this out. I am not really sure how adding more documents to the index is going to help here.

@s1monw
Copy link
Contributor

s1monw commented May 16, 2017

I am not sure I follow. The first suggestion is an exact match, but there is no way for a user to figure this out. I am not really sure how adding more documents to the index is going to help here.

spellcheckers work based on frequencies and docs etc. in a real corpus this will very likely work.

s1monw added a commit to s1monw/elasticsearch that referenced this issue May 16, 2017
…by default

In example indices it can be very confusing when perfect matches are not suggested
since scores are equivalent to the cut-off score.

Relates to elastic#24655
@s1monw
Copy link
Contributor

s1monw commented May 16, 2017

@imotov I added a little change that makes this come up but it also might bring up bullshit in other situations you can't come up with 1 doc examples and expect them to work, sorry.

@jimczi
Copy link
Contributor

jimczi commented Mar 19, 2018

cc @elastic/es-search-aggs

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@Gilthans
Copy link

Is this feature on the backlog?
When debugging bad search results, 'fat fingers' is probably the top culprit in terms of 'how could we have known what the user wanted'.
Even allowing for weighting specific character replacements or weighting transposition over replacement could greatly improve many search results IMO.

@sujata1993
Copy link

I am still not getting any result for exact match. :-(((

@sujata1993
Copy link

Did you guys getting any solution for that, I am new here.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Suggesters "Did you mean" and suggestions as you type Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants