Common Grams filter should have configuration option #36771

Aezo · 2018-12-18T13:58:30Z

Describe the feature: Common grams token filter should have a configuration option to specify whether the words should be combined with left token or right token or both tokens. And query_mode, if true, will then only remove the joined tokens and not touch the other tokens.

The configuration option can be named "join_mode". Configuration can be given as such -

PUT /test_index
{
  "index": {
    "analysis": {
      "filter": {
        "common_grams_left_filter": {
          "common_words_path": "my_common_words_left.txt",
          "ignore_case": "true",
          "type": "common_grams",
          "query_mode": "true",
          "join_mode": "left"
        },
        "common_grams_right_filter": {
          "common_words_path": "my_common_words_right.txt",
          "ignore_case": "true",
          "type": "common_grams",
          "query_mode": "true",
          "join_mode": "right"
        },
        "common_grams_both_filter": {
          "common_words_path": "my_common_words_both.txt",
          "ignore_case": "true",
          "type": "common_grams",
          "query_mode": "true",
          "join_mode": "both"
        },
      },
      "analyzer": {
        "common_grams_left_analyser": {
          "filter": [
            "lowercase",
            "common_grams_left_filter"
          ],
          "tokenizer": "whitespace"
        },
        "common_grams_right_analyser": {
          "filter": [
            "lowercase",
            "common_grams_right_filter"
          ],
          "tokenizer": "whitespace"
        },
        "common_grams_both_analyser": {
          "filter": [
            "lowercase",
            "common_grams_both_filter"
          ],
          "tokenizer": "whitespace"
        }
      },
      "char_filter": {
        
      }
    },
    "number_of_shards": "5",
    "number_of_replicas": "0"
  }
}

This is how the above 3 analysers will work -

join_mode: left

my_common_words_left.txt
gb

input string - "Samsung 64 GB Gold"
common_grams_left_analyser - "samsung", "64_gb", "gold"

join_mode: right

my_common_words_right.txt
rs

input string - "salt Rs 200"
common_grams_left_analyser - "salt", "rs_200"

join_mode: both (default)

This is the current behaviour. So it should be taken as the default value of join_mode.

my_common_words_both.txt
is

input string - "fox is brown"
common_grams_left_analyser - "fox_is", "is_brown"

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-18T14:10:12Z

Pinging @elastic/es-search

cbuescher · 2018-12-19T11:38:35Z

@Aezo I had a short discussion about this request with another team member and we were wondering why the current behaviour doesn't work for you. For example in the "Samsung 64 GB Gold" case, this would only create one more token ("gb gold") which should be quite rare and shouldn't really result in any loss in precision. The way we currently see this feature request is that it would adds some complexity without much benefit. If you could explain your pain points with the current way the filter works, this might change our understanding of the problem.

Aezo · 2018-12-24T08:46:41Z

@cbuescher It doesn't work for me because, with query_mode enabled, I don't get a token for gold.

If search query is "Samsung Galaxy A6 (Gold)", the tokens generated would be - samsung, galaxy, a6, gold. Since, I have no gold token in my document, I'm losing "gold" part of search query. (I have "Samsung Galaxy A6 64 GB (Gold)" in my document, generating tokens as samsung, galaxy, a6, 64_gb, gb_gold).

This is a problem because, let's say I have two phones in my documents, one "Samsung Galaxy A6 64 GB (Red)" and second "Samsung Galaxy A6 64 GB (Gold)", I would like to show the Gold one on top.

Yes, disabling query_mode would start generating a gold token, but that would also start generating a gb token, which isn't desirable.

cbuescher · 2018-12-31T09:34:33Z

If search query is "Samsung Galaxy A6 (Gold)", I'm losing the "gold" information of the search query. Because, in this case, the tokens generated would be - samsung, galaxy, a6, gold.

Just to clarify, why would you be using a different search time analyzer than the index time analyzer here?

Aezo · 2019-01-02T09:58:40Z

Just to clarify, why would you be using a different search time analyzer than the index time analyzer here?

What made you conclude that I'm using a different search analyzer and index analyzer? I'll be using same analyzer for both index time and search time.

romseygeek · 2019-01-02T10:35:03Z

(I have "Samsung Galaxy A6 64 GB (Gold)" in my document, generating tokens as samsung, galaxy, a6, 64_gb, gb_gold)

This should also generate gold as a token when indexing? So at query time, Samsung Galaxy A6 Gold would match with a higher score than Samsung Galaxy A6 red

cbuescher · 2019-01-02T10:51:22Z

What made you conclude that I'm using a different search analyzer

Sorry for that, I misread your comment and got confused by the fact that you stated you are loosing the "gold" token. As @romseygeek mentioned, using the current common_grams filter should keep the "gold" token in the indexed documents, and if you use the same analyzer there should be matches at query and search time. Compare the output of:

PUT /common_grams_example
{
    "settings": {
        "analysis": {
            "analyzer" : {
                "grams": {
                    "tokenizer": "standard",
                    "filter": ["lowercase", "grams"]
                }
            },
            "filter": {
                "grams": {
                    "type": "common_grams",
                    "common_words": ["gb", "is", "rs"]
                }
            }
        }
    },
    "mappings": {
      "properties": {
        "body" : {
          "type": "text",
          "analyzer": "grams"
        }
      }
    }
}

GET /common_grams_example/_analyze
{
  "analyzer": "grams",
  "text": "Samsung Galaxy A6 64 GB (Gold)"
}

GET /common_grams_example/_analyze
{
  "analyzer": "grams",
  "text": "Samsung Galaxy A6 (Gold)"
}

Both should contain the "gold" token, the 64_gb and gb_gold should be an extra token in the indexed document and shouldn't prevent the query from matching or the document containing "gold" from scoring higher:

PUT /common_grams_example/_doc/1
{
  "body" : "Samsung Galaxy A6 64 GB (Gold)"
}

PUT /common_grams_example/_doc/2
{
  "body" : "Samsung Galaxy A6 64 GB (Red)"
}

POST /common_grams_example/_search
{
  "query": {
    "match": {
      "body": "Samsung Galaxy A6 (Gold)"
    }
  }
}

This returns the "Samsung Galaxy A6 64 GB (Gold)" document first.
Unless I'm missing something the current filter should solve your use case, or what would the two alternative options add?

Aezo · 2019-01-02T11:12:53Z

@cbuescher I'm sorry for not being clear. I've edited my original comment to remove that confusion.

@romseygeek @cbuescher
I've mentioned before that -

Yes, disabling query_mode would start generating a gold token, but that would also start generating a gb token, which isn't desirable.

cbuescher · 2019-01-02T11:30:30Z

Under which circumstances is the gb token not desireable? If it is a common token, its impact on the score should be very low (considering its low idf), so it shouldn't dominate scoring.

Aezo · 2019-01-02T11:45:24Z

When you don't want the document to even match when based solely on the gb token. For example, if someone searches for "32 GB" and the doc contains "64 GB", in this case, if you don't want to match the "64 GB" doc, you wouldn't want to create a gb token.

Another choice would be to use shingles, but that's very limiting.

romseygeek · 2019-01-02T12:13:56Z

If someone searches for "32 GB", it will pass through the search analyzer (and hence use query_mode) and so the generated token will be 32_gb

Aezo · 2019-01-02T12:30:32Z

In that case, I'll have to use query_mode=false at index time and use query_mode=true at search time.

That means, if my doc contains "Samsung Galaxy A6 64 GB (Gold)", tokens would be samsung, galaxy, a6, 64, 64_gb, gb, gb_gold, gold.

So if someone searches for "Nintendo 64", the above doc will also get matched, which isn't right.

"64" and "GB" make sense only together, you wouldn't want to create separate tokens for them even at index time.

jimczi · 2019-01-02T13:32:22Z

Another choice would be to use shingles, but that's very limiting.

You can also use a pattern_replace to transform any <number> GB into <number>-GB ?

@romseygeek @cbuescher after reading the documentation of the common_grams and the code underneath I wonder why we don't mention phrase queries at all in the docs. It seems to me that the only purpose of the common grams filter is to speed up phrase queries that contain common term(s). The usage outside of phrase queries is not expected (at least that's what I understand from the javadoc) so in its current form I don't see why it could be useful to use this filter (with any value of query_mode) in a query that is not a phrase ?

Aezo · 2019-01-02T13:44:49Z

@jimczi

You can also use a pattern_replace to transform any GB into -GB ?

Yeah, so there are 2 choices. But only advantage of common_grams over pattern replace is the ability to give word list via a file, if later I need to add more words.

Aezo · 2019-01-03T09:07:25Z

So are we developing this feature?

jimczi · 2019-01-03T10:13:31Z

The goal of the common_grams is to speed up phrase queries that contain common terms, your feature request has a different purpose so I don't think we should add more options to this filter. We also have an issue opened to re-think the integration of the common_grams so the current thinking is that we should remove this filter entirely and make it available only as an option in the index_phrases of the text field.

"64" and "GB" make sense only together, you wouldn't want to create separate tokens for them even at index time.

I think this is the gist of what you're trying to achieve and deserves a specific filter/solution. pattern_replace is a good start and gives you more precision than the left, right join.

But only advantage of common_grams over pattern replace is the ability to give word list via a file, if later I need to add more words.

You don't need a file to update an analyzer.

javanna · 2024-06-19T10:01:45Z

This has been open for quite a while with no actiivity, and hasn't had a lot of interest. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

jkakavas added >enhancement :Search Relevance/Analysis How text is split into tokens labels Dec 18, 2018

cbuescher added the feedback_needed label Dec 19, 2018

jimczi removed the feedback_needed label Jan 3, 2019

rjernst added the Team:Search Meta label for search team label May 4, 2020

javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 19, 2024

javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Common Grams filter should have configuration option #36771

Common Grams filter should have configuration option #36771

Aezo commented Dec 18, 2018

elasticmachine commented Dec 18, 2018

cbuescher commented Dec 19, 2018

Aezo commented Dec 24, 2018 •

edited

Loading

cbuescher commented Dec 31, 2018 •

edited

Loading

Aezo commented Jan 2, 2019

romseygeek commented Jan 2, 2019

cbuescher commented Jan 2, 2019

Aezo commented Jan 2, 2019

cbuescher commented Jan 2, 2019

Aezo commented Jan 2, 2019 •

edited

Loading

romseygeek commented Jan 2, 2019

Aezo commented Jan 2, 2019

jimczi commented Jan 2, 2019 •

edited

Loading

Aezo commented Jan 2, 2019 •

edited

Loading

Aezo commented Jan 3, 2019

jimczi commented Jan 3, 2019

javanna commented Jun 19, 2024

Common Grams filter should have configuration option #36771

Common Grams filter should have configuration option #36771

Comments

Aezo commented Dec 18, 2018

join_mode: left

join_mode: right

join_mode: both (default)

elasticmachine commented Dec 18, 2018

cbuescher commented Dec 19, 2018

Aezo commented Dec 24, 2018 • edited Loading

cbuescher commented Dec 31, 2018 • edited Loading

Aezo commented Jan 2, 2019

romseygeek commented Jan 2, 2019

cbuescher commented Jan 2, 2019

Aezo commented Jan 2, 2019

cbuescher commented Jan 2, 2019

Aezo commented Jan 2, 2019 • edited Loading

romseygeek commented Jan 2, 2019

Aezo commented Jan 2, 2019

jimczi commented Jan 2, 2019 • edited Loading

Aezo commented Jan 2, 2019 • edited Loading

Aezo commented Jan 3, 2019

jimczi commented Jan 3, 2019

javanna commented Jun 19, 2024

Aezo commented Dec 24, 2018 •

edited

Loading

cbuescher commented Dec 31, 2018 •

edited

Loading

Aezo commented Jan 2, 2019 •

edited

Loading

jimczi commented Jan 2, 2019 •

edited

Loading

Aezo commented Jan 2, 2019 •

edited

Loading