Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Omit noMatchDocs in a bool query #110080

Open
atsushi-matsui opened this issue Jun 24, 2024 · 11 comments · May be fixed by #110079
Open

Omit noMatchDocs in a bool query #110080

atsushi-matsui opened this issue Jun 24, 2024 · 11 comments · May be fixed by #110079
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@atsushi-matsui
Copy link

Description

Problem

When multiple queries are listed in the must field of a bool query, if even one of the queries does not hit the document, there will be zero hits.

As a workaround, you can set zero_terms_query, which is provided in match query, to "all", but this causes another problem in that all queries will be hit even if none of the queries in the must field hit the document.

  • The query contains a stop word in the must clause
    If "the" is excluded by the token filter, 0 hits will be returned.
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Quick"
          }
        },
        {
          "match": {
            "title": "the"
          }
        },
        {
          "match": {
            "title": "Brown"
          }
        },
        {
          "match": {
            "title": "Fox"
          }
        }
      ]
    }
  }
}
  • zero_terms_query is set to all
    Stop words are excluded by the token filter, so we expect zero hits, but all hits are returned
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "a",
            "zero_terms_query", "all"
          }
        },
        {
          "match": {
            "title": "the",
            "zero_terms_query", "all"
          }
        }
      ]
    }
  }
}

proposal

For this reason, we would like to provide an option called omit_zero_term_query in the bool query to ignore queries that do not hit the document.

@atsushi-matsui atsushi-matsui added >enhancement needs:triage Requires assignment of a team area label labels Jun 24, 2024
@tvernum tvernum added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jun 25, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 25, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@benwtrent
Copy link
Member

Stop words are excluded by the token filter, so we expect zero hits, but all hits are returned

I don't understand this @atsushi-matsui . Omitting a clause is the same as now "matching all docs" given the clause.

In your first example, it seems the following would work fine:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Quick",
            "zero_terms_query", "all"
          }
        },
        {
          "match": {
            "title": "the",
            "zero_terms_query", "all"
          }
        },
        {
          "match": {
            "title": "Brown",
            "zero_terms_query", "all"
          }
        },
        {
          "match": {
            "title": "Fox",
            "zero_terms_query", "all"
          }
        }
      ]
    }
  }
}

Then in your second example, omitting BOTH clauses (which is what would happen in this case), is the exact same as a match_all query. Consider the query:

"query": {"bool": {"must": []}}

That is the exact same as a match_all query.

@atsushi-matsui
Copy link
Author

atsushi-matsui commented Jun 28, 2024

@benwtrent
Thanks for the reply!!!

Then in your second example, omitting BOTH clauses (which is what would happen in this case), is the exact same as a match_all query. Consider the query:

I understand that the second example is equivalent to match_all, but there are cases where we want to omit the clause, so I'll show you another example.

When building a search system using Elasticsearch in Japan, it is common to prepare kuromoji and a 2-gram analyzer.
Here is a setting example.

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji_tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "search"
        },
        "ngram_tokenizer": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 2,
          "token_chars": ["letter", "digit"]
        }
      },
      "analyzer": {
        "kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "kuromoji_baseform",
            "kuromoji_part_of_speech",
            "cjk_width",
            "stop",
            "kuromoji_stemmer",
            "lowercase"
          ]
        },
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "ngram_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text_ja": {
        "type": "text",
        "analyzer": "kuromoji_analyzer"
      },
      "text_cjk": {
        "type": "text",
        "analyzer": "ngram_analyzer"
      }
    }
  }
}

In Japan, it is common to search by entering phrases separated by spaces, so we can construct bool_query using words separated by spaces as phrases.
When we want to search for the anime "遊☆戯☆王", we may sometimes enter "遊 ☆ 戯 ☆ 王" separated by spaces.
At this time, if we include text_ja and text_cjk in the field and set zero_terms_query to all, all results will be hit, which is not a user-friendly result.

{
    "query": {
      "bool": {
        "must": [
          {
            "multi_match": {
              "query": "遊",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "☆",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "戯",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "☆",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          },
          {
            "multi_match": {
              "query": "王",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "all"
            }
          }
        ]
      }
    }
  }

If we omit the "☆" in our search, we may find works by "遊☆戯☆王".
Omitting "☆" is the same as removing the "☆" query and setting zero_terms_query to none, as shown below.

{
    "query": {
      "bool": {
        "must": [
          {
            "multi_match": {
              "query": "遊",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "none"
            }
          },
          {
            "multi_match": {
              "query": "戯",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "none"
            }
          },
          {
            "multi_match": {
              "query": "王",
              "fields": ["text_ja", "text_cjk"],
              "type": "phrase",
              "zero_terms_query": "none"
            }
          }
        ]
      }
    }
  }

Therefore, I would like bool_query to have a function that omits the clause.

@atsushi-matsui
Copy link
Author

The organization I work for is actually facing this problem.
Even if my proposal is not accepted, I would appreciate it if you could let me know if there is another solution!

@benwtrent
Copy link
Member

@atsushi-matsui I am still not understanding, could you give me a document you would expect to match and one that wouldn't with your most recent example (thus requiring the feature change)?

I am just trying to confirm the behavior as it still isn't clear to me how omitting a clause is any different than making that clause a match_all.

@atsushi-matsui
Copy link
Author

atsushi-matsui commented Jun 28, 2024

@benwtrent
I'm sorry that the issue is difficult to understand.
I will try my best to convey it as accurately as possible.

Register the following data.
If a user searches for "遊☆戯☆王" and enters "遊 ☆," the search system should return only the document in Example 2-1.
If you set zero_terms_query to "all" as in Example 1-1, all documents will be returned, so this is not a desired result.
The cause is likely to be that 2-gram is set for text_cjk and match_all is returned.
If zero_terms_query is set to "none" as in Example 1-2, there will be 0 hits, which is also not a desired result.
The cause is likely to be 0 tokens in text_cjk.
In such a case, it is possible that the document in Example 2-1 can be obtained by omitting the "☆" character that causes the analyzer to set the number of tokens to 0.
In other words, this means that the search is performed only in the valid "遊" field in text_ja.

# queries
### Example 1-1
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "遊",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "all"
          }
        },
        {
          "multi_match": {
            "query": "☆",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "all"
          }
        }
      ]
    }
  }
}

### Example 1-2
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "遊",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "none"
          }
        },
        {
          "multi_match": {
            "query": "☆",
            "fields": ["text_ja", "text_cjk"],
            "type": "phrase",
            "zero_terms_query": "none"
          }
        }
      ]
    }
  }
}
# documents
### Example 2-1
{
  "text_ja": "遊☆戯☆王",
  "text_cjk": "遊☆戯☆王",
  "release_date": "2023-01-01",
  "views": 123
}

### Example 2-2
{
  "text_ja": "ドラゴンボール",
  "text_cjk": "ドラゴンボール",
  "release_date": "2023-01-01",
  "views": 123
}

### Example 2-3
{
  "text_ja": "ナルト",
  "text_cjk": "ナルト",
  "release_date": "2023-01-01",
  "views": 123
}

@atsushi-matsui
Copy link
Author

atsushi-matsui commented Jun 30, 2024

If you set the query as "遊 ☆" in query_string as shown below, it will appear that the search is executed only for "遊".
Although it does not exist in the query_string option, if you check the source code, it appears that the "☆" is omitted because zero_terms_query is set to null.
I would like bool_query to provide a similar option.

{
  "query": {
    "query_string": {
      "query": "遊 ☆",
      "default_operator": "AND",
      "fields": ["text_ja", "text_cjk"], 
      "type": "phrase"
    }
  }
}

@benwtrent
Copy link
Member

@atsushi-matsui for your docs, what is the mapping configured? including any custom analyzers please.

Thank you for your patience :). Excluding vs. including vs. match_none vs. match_all is tricky to reason about.

@atsushi-matsui
Copy link
Author

@benwtrent

for your docs, what is the mapping configured? including any custom analyzers please.

This is my setting used to confirm operation.

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji_tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "normal"
        },
        "ngram_tokenizer": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 2
        }
      },
      "analyzer": {
        "kuromoji_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "kuromoji_stemmer",
            "lowercase"
          ]
        },
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "ngram_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text_ja": {
        "type": "text",
        "analyzer": "kuromoji_analyzer"
      },
      "text_cjk": {
        "type": "text",
        "analyzer": "ngram_analyzer"
      }
    }
  }
}

@atsushi-matsui
Copy link
Author

I created a verification environment, so please use it if you like.
https://github.com/atsushi-matsui/sample-elastic

@atsushi-matsui atsushi-matsui changed the title Correctly handle all hits and 0 hits in a bool query Omit noMatchDocs in a bool query Jul 3, 2024
@atsushi-matsui
Copy link
Author

Hi, @benwtrent.
I would like to know if there is any progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants