Make suggester's string distance pluggable or configurable #24655

imotov · 2017-05-12T22:46:21Z

Summary

The current implementation allows choosing between of Damerau-Levenshtein algorithm (2 implementations), Levenshtein algorithm, Jaro-Winkler algorithm, or ngram-based algorithm with non-configurable size of n-gram (2). I would like to discuss a possibility of making the string distance algorithm pluggable or extend the number of algorithms and provide configurability.

Rationale

While this selection provides most frequently used generic string distance algorithms it falls short in several important use cases where more information about frequency of certain mistakes is available and certain types of mistakes are anticipated. Here are few of the use cases

Wrong Keyboard Layout

The first use case and the main motivation for this issue was this recent post on Russian category of elasticsearch forum. It's somewhat frequent for users with multiple keyboard layouts installed on their computer to start typing using a wrong layout. So, in case of standard Russian keyboard layout and QWERTY layout, a user could type ыекштп instead of string, which is essentially a single mistake (wrong layout) counted as 6 (length of the string). We can handle this use case during search using char_filters but it wouldn't work in suggester because we don't provide suggestions for exact matches.

"Fat-fingers"

Even with a single keyboard layout, some typos are much more common with other. So, by making penalty for substitution for commonly mistyped letters lower, we can improve quality of suggestions. A typical approach to this problem with is using weighted edit distance with frequency-based confusion matrix.

Phonetic correction

Misspellings are different from typos but as typos they have similar characteristics with certain mistakes being more frequent than others. Similar to wrong keyboard layout use case we can easily handle this in search using phonetic token filter, but if a term misspelled by a user has exactly the same phonetic hash as the term found by suggester, we would not make the suggestion.

The text was updated successfully, but these errors were encountered:

s1monw · 2017-05-15T09:19:24Z

I think it's misleading and very expert if we allow a non-editdistance-like metric. See the comment on the spellchecker:

/**
   * Set the string distance metric.
   * The default is {@link #INTERNAL_LEVENSHTEIN}
   * <p>
   * Note: because this spellchecker draws its candidates from the term
   * dictionary using Damerau-Levenshtein, it works best with an edit-distance-like
   * string metric. If you use a different metric than the default,
   * you might want to consider increasing {@link #setMaxInspections(int)}
   * to draw more candidates for your metric to rank.
   */

I really think we can't do much here without causing a massive amount of confusion?

imotov · 2017-05-15T12:47:14Z

@s1monw I see, basically, it's not going to work optimally unless all costs in our custom metric are greater or equal to 1. In other words, Damerau-Levenshtein has to be the lower bound for the custom function. I agree, that would be really confusing in a general case, because it's not really possible to select proper max inspections. We could add a weighted edit distance and enforce weights to be >= 1, but this will only help the "Fat-fingers" use case and only to some degree.

What do you think about adding an option to not ignore exact match of the same term? This will not help with "Fat-fingers" use case, but it will be a viable solution for other two use cases.

s1monw · 2017-05-16T14:16:02Z

What do you think about adding an option to not ignore exact match of the same term? This will not help with "Fat-fingers" use case, but it will be a viable solution for other two use cases.

we do this in our code afaik - we already have the original term and we take it into account when we try to find a good suggestions

imotov · 2017-05-16T14:28:34Z

yes, but there seems to be no way to figure out if the original term exited in the index or was so different that we couldn't find any suggestions for it at all. For example:

DELETE my_index
PUT my_index/doc/1?refresh
{
  "text": "foobarbaz"
}
GET my_index/_search
{
  "size": 0,
  "suggest": {
        "exact" : {
            "text" : "foobarbaz",
            "term" : {
                "field" : "text",
                "suggest_mode": "always"
            }
        },
        "way-off" : {
            "text" : "bazbarfoo",
            "term" : {
                "field" : "text",
                "suggest_mode": "always"
            }
        },
        "pretty-close" : {
            "text" : "foobarba",
            "term" : {
                "field" : "text",
                "suggest_mode": "always"
            }
        }
    }
}

produces:

{
  "took": 11,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": 0,
    "hits": []
  },
  "suggest": {
    "exact": [
      {
        "text": "foobarbaz",
        "offset": 0,
        "length": 9,
        "options": []
      }
    ],
    "way-off": [
      {
        "text": "bazbarfoo",
        "offset": 0,
        "length": 9,
        "options": []
      }
    ],
    "pretty-close": [
      {
        "text": "foobarba",
        "offset": 0,
        "length": 8,
        "options": [
          {
            "text": "foobarbaz",
            "score": 0.875,
            "freq": 1
          }
        ]
      }
    ]
  }
}

s1monw · 2017-05-16T15:22:05Z

I think this is very difficult to proof on a single doc single term example. you can play with confidence level to get suggestions in such an index but i personally think it's pretty useless. We do get these suggestions but they don't pass the cutoff freq which is ok in that particular index IMO.

imotov · 2017-05-16T15:47:52Z

I think this is very difficult to proof on a single doc single term example.

I am not sure I follow. The first suggestion is an exact match, but there is no way for a user to figure this out. I am not really sure how adding more documents to the index is going to help here.

s1monw · 2017-05-16T15:50:47Z

I am not sure I follow. The first suggestion is an exact match, but there is no way for a user to figure this out. I am not really sure how adding more documents to the index is going to help here.

spellcheckers work based on frequencies and docs etc. in a real corpus this will very likely work.

…by default In example indices it can be very confusing when perfect matches are not suggested since scores are equivalent to the cut-off score. Relates to elastic#24655

s1monw · 2017-05-16T15:52:14Z

@imotov I added a little change that makes this come up but it also might bring up bullshit in other situations you can't come up with 1 doc examples and expect them to work, sorry.

jimczi · 2018-03-19T18:18:28Z

cc @elastic/es-search-aggs

Gilthans · 2020-12-31T09:56:37Z

Is this feature on the backlog?
When debugging bad search results, 'fat fingers' is probably the top culprit in terms of 'how could we have known what the user wanted'.
Even allowing for weighting specific character replacements or weighting transposition over replacement could greatly improve many search results IMO.

sujata1993 · 2023-04-11T14:40:50Z

I am still not getting any result for exact match. :-(((

sujata1993 · 2023-04-11T14:41:34Z

Did you guys getting any solution for that, I am new here.

elasticsearchmachine · 2024-06-25T20:08:20Z

Pinging @elastic/es-search (Team:Search)

elasticsearchmachine · 2024-07-12T08:45:58Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

imotov added :Search Relevance/Suggesters "Did you mean" and suggestions as you type discuss >feature labels May 12, 2017

jimczi mentioned this issue Jun 13, 2017

Suggest exact match #22902

Closed

imotov removed the discuss label Aug 18, 2017

rjernst added the Team:Search Meta label for search team label May 4, 2020

javanna added >enhancement and removed >feature labels Jun 25, 2024

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make suggester's string distance pluggable or configurable #24655

Make suggester's string distance pluggable or configurable #24655

imotov commented May 12, 2017

s1monw commented May 15, 2017

imotov commented May 15, 2017

s1monw commented May 16, 2017

imotov commented May 16, 2017

s1monw commented May 16, 2017

imotov commented May 16, 2017

s1monw commented May 16, 2017

s1monw commented May 16, 2017

jimczi commented Mar 19, 2018

Gilthans commented Dec 31, 2020

sujata1993 commented Apr 11, 2023

sujata1993 commented Apr 11, 2023

elasticsearchmachine commented Jun 25, 2024

elasticsearchmachine commented Jul 12, 2024

Make suggester's string distance pluggable or configurable #24655

Make suggester's string distance pluggable or configurable #24655

Comments

imotov commented May 12, 2017

Summary

Rationale

s1monw commented May 15, 2017

imotov commented May 15, 2017

s1monw commented May 16, 2017

imotov commented May 16, 2017

s1monw commented May 16, 2017

imotov commented May 16, 2017

s1monw commented May 16, 2017

s1monw commented May 16, 2017

jimczi commented Mar 19, 2018

Gilthans commented Dec 31, 2020

sujata1993 commented Apr 11, 2023

sujata1993 commented Apr 11, 2023

elasticsearchmachine commented Jun 25, 2024

elasticsearchmachine commented Jul 12, 2024