Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common Grams filter should have configuration option #36771

Closed
Aezo opened this issue Dec 18, 2018 · 17 comments
Closed

Common Grams filter should have configuration option #36771

Aezo opened this issue Dec 18, 2018 · 17 comments
Labels
>enhancement :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@Aezo
Copy link

Aezo commented Dec 18, 2018

Describe the feature: Common grams token filter should have a configuration option to specify whether the words should be combined with left token or right token or both tokens. And query_mode, if true, will then only remove the joined tokens and not touch the other tokens.

The configuration option can be named "join_mode". Configuration can be given as such -

PUT /test_index
{
  "index": {
    "analysis": {
      "filter": {
        "common_grams_left_filter": {
          "common_words_path": "my_common_words_left.txt",
          "ignore_case": "true",
          "type": "common_grams",
          "query_mode": "true",
          "join_mode": "left"
        },
        "common_grams_right_filter": {
          "common_words_path": "my_common_words_right.txt",
          "ignore_case": "true",
          "type": "common_grams",
          "query_mode": "true",
          "join_mode": "right"
        },
        "common_grams_both_filter": {
          "common_words_path": "my_common_words_both.txt",
          "ignore_case": "true",
          "type": "common_grams",
          "query_mode": "true",
          "join_mode": "both"
        },
      },
      "analyzer": {
        "common_grams_left_analyser": {
          "filter": [
            "lowercase",
            "common_grams_left_filter"
          ],
          "tokenizer": "whitespace"
        },
        "common_grams_right_analyser": {
          "filter": [
            "lowercase",
            "common_grams_right_filter"
          ],
          "tokenizer": "whitespace"
        },
        "common_grams_both_analyser": {
          "filter": [
            "lowercase",
            "common_grams_both_filter"
          ],
          "tokenizer": "whitespace"
        }
      },
      "char_filter": {
        
      }
    },
    "number_of_shards": "5",
    "number_of_replicas": "0"
  }
}

This is how the above 3 analysers will work -

join_mode: left

my_common_words_left.txt
gb

input string - "Samsung 64 GB Gold"
common_grams_left_analyser - "samsung", "64_gb", "gold"

join_mode: right

my_common_words_right.txt
rs

input string - "salt Rs 200"
common_grams_left_analyser - "salt", "rs_200"

join_mode: both (default)

This is the current behaviour. So it should be taken as the default value of join_mode.

my_common_words_both.txt
is

input string - "fox is brown"
common_grams_left_analyser - "fox_is", "is_brown"

Provide logs (if relevant):

@jkakavas jkakavas added >enhancement :Search Relevance/Analysis How text is split into tokens labels Dec 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@cbuescher
Copy link
Member

@Aezo I had a short discussion about this request with another team member and we were wondering why the current behaviour doesn't work for you. For example in the "Samsung 64 GB Gold" case, this would only create one more token ("gb gold") which should be quite rare and shouldn't really result in any loss in precision. The way we currently see this feature request is that it would adds some complexity without much benefit. If you could explain your pain points with the current way the filter works, this might change our understanding of the problem.

@Aezo
Copy link
Author

Aezo commented Dec 24, 2018

@cbuescher It doesn't work for me because, with query_mode enabled, I don't get a token for gold.

If search query is "Samsung Galaxy A6 (Gold)", the tokens generated would be - samsung, galaxy, a6, gold. Since, I have no gold token in my document, I'm losing "gold" part of search query. (I have "Samsung Galaxy A6 64 GB (Gold)" in my document, generating tokens as samsung, galaxy, a6, 64_gb, gb_gold).

This is a problem because, let's say I have two phones in my documents, one "Samsung Galaxy A6 64 GB (Red)" and second "Samsung Galaxy A6 64 GB (Gold)", I would like to show the Gold one on top.

Yes, disabling query_mode would start generating a gold token, but that would also start generating a gb token, which isn't desirable.

@cbuescher
Copy link
Member

cbuescher commented Dec 31, 2018

If search query is "Samsung Galaxy A6 (Gold)", I'm losing the "gold" information of the search query. Because, in this case, the tokens generated would be - samsung, galaxy, a6, gold.

Just to clarify, why would you be using a different search time analyzer than the index time analyzer here?

@Aezo
Copy link
Author

Aezo commented Jan 2, 2019

Just to clarify, why would you be using a different search time analyzer than the index time analyzer here?

What made you conclude that I'm using a different search analyzer and index analyzer? I'll be using same analyzer for both index time and search time.

@romseygeek
Copy link
Contributor

(I have "Samsung Galaxy A6 64 GB (Gold)" in my document, generating tokens as samsung, galaxy, a6, 64_gb, gb_gold)

This should also generate gold as a token when indexing? So at query time, Samsung Galaxy A6 Gold would match with a higher score than Samsung Galaxy A6 red

@cbuescher
Copy link
Member

What made you conclude that I'm using a different search analyzer

Sorry for that, I misread your comment and got confused by the fact that you stated you are loosing the "gold" token. As @romseygeek mentioned, using the current common_grams filter should keep the "gold" token in the indexed documents, and if you use the same analyzer there should be matches at query and search time. Compare the output of:

PUT /common_grams_example
{
    "settings": {
        "analysis": {
            "analyzer" : {
                "grams": {
                    "tokenizer": "standard",
                    "filter": ["lowercase", "grams"]
                }
            },
            "filter": {
                "grams": {
                    "type": "common_grams",
                    "common_words": ["gb", "is", "rs"]
                }
            }
        }
    },
    "mappings": {
      "properties": {
        "body" : {
          "type": "text",
          "analyzer": "grams"
        }
      }
    }
}

GET /common_grams_example/_analyze
{
  "analyzer": "grams",
  "text": "Samsung Galaxy A6 64 GB (Gold)"
}

GET /common_grams_example/_analyze
{
  "analyzer": "grams",
  "text": "Samsung Galaxy A6 (Gold)"
}

Both should contain the "gold" token, the 64_gb and gb_gold should be an extra token in the indexed document and shouldn't prevent the query from matching or the document containing "gold" from scoring higher:

PUT /common_grams_example/_doc/1
{
  "body" : "Samsung Galaxy A6 64 GB (Gold)"
}

PUT /common_grams_example/_doc/2
{
  "body" : "Samsung Galaxy A6 64 GB (Red)"
}

POST /common_grams_example/_search
{
  "query": {
    "match": {
      "body": "Samsung Galaxy A6 (Gold)"
    }
  }
}

This returns the "Samsung Galaxy A6 64 GB (Gold)" document first.
Unless I'm missing something the current filter should solve your use case, or what would the two alternative options add?

@Aezo
Copy link
Author

Aezo commented Jan 2, 2019

@cbuescher I'm sorry for not being clear. I've edited my original comment to remove that confusion.

@romseygeek @cbuescher
I've mentioned before that -

Yes, disabling query_mode would start generating a gold token, but that would also start generating a gb token, which isn't desirable.

@cbuescher
Copy link
Member

Under which circumstances is the gb token not desireable? If it is a common token, its impact on the score should be very low (considering its low idf), so it shouldn't dominate scoring.

@Aezo
Copy link
Author

Aezo commented Jan 2, 2019

When you don't want the document to even match when based solely on the gb token. For example, if someone searches for "32 GB" and the doc contains "64 GB", in this case, if you don't want to match the "64 GB" doc, you wouldn't want to create a gb token.

Another choice would be to use shingles, but that's very limiting.

@romseygeek
Copy link
Contributor

If someone searches for "32 GB", it will pass through the search analyzer (and hence use query_mode) and so the generated token will be 32_gb

@Aezo
Copy link
Author

Aezo commented Jan 2, 2019

In that case, I'll have to use query_mode=false at index time and use query_mode=true at search time.

That means, if my doc contains "Samsung Galaxy A6 64 GB (Gold)", tokens would be samsung, galaxy, a6, 64, 64_gb, gb, gb_gold, gold.

So if someone searches for "Nintendo 64", the above doc will also get matched, which isn't right.

"64" and "GB" make sense only together, you wouldn't want to create separate tokens for them even at index time.

@jimczi
Copy link
Contributor

jimczi commented Jan 2, 2019

Another choice would be to use shingles, but that's very limiting.

You can also use a pattern_replace to transform any <number> GB into <number>-GB ?

@romseygeek @cbuescher after reading the documentation of the common_grams and the code underneath I wonder why we don't mention phrase queries at all in the docs. It seems to me that the only purpose of the common grams filter is to speed up phrase queries that contain common term(s). The usage outside of phrase queries is not expected (at least that's what I understand from the javadoc) so in its current form I don't see why it could be useful to use this filter (with any value of query_mode) in a query that is not a phrase ?

@Aezo
Copy link
Author

Aezo commented Jan 2, 2019

@jimczi

You can also use a pattern_replace to transform any GB into -GB ?

Yeah, so there are 2 choices. But only advantage of common_grams over pattern replace is the ability to give word list via a file, if later I need to add more words.

@Aezo
Copy link
Author

Aezo commented Jan 3, 2019

So are we developing this feature?

@jimczi
Copy link
Contributor

jimczi commented Jan 3, 2019

The goal of the common_grams is to speed up phrase queries that contain common terms, your feature request has a different purpose so I don't think we should add more options to this filter. We also have an issue opened to re-think the integration of the common_grams so the current thinking is that we should remove this filter entirely and make it available only as an option in the index_phrases of the text field.

"64" and "GB" make sense only together, you wouldn't want to create separate tokens for them even at index time.

I think this is the gist of what you're trying to achieve and deserves a specific filter/solution. pattern_replace is a good start and gives you more precision than the left, right join.

But only advantage of common_grams over pattern replace is the ability to give word list via a file, if later I need to add more words.

You don't need a file to update an analyzer.

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@javanna
Copy link
Member

javanna commented Jun 19, 2024

This has been open for quite a while with no actiivity, and hasn't had a lot of interest. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 19, 2024
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants