Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ngram #110084

Closed
S-Dragon0302 opened this issue Jun 24, 2024 · 7 comments
Closed

ngram #110084

S-Dragon0302 opened this issue Jun 24, 2024 · 7 comments
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@S-Dragon0302
Copy link

Elasticsearch Version

7.15.1

Installed Plugins

No response

Java Version

bundled

OS Version

mac

Problem Description

PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"letter_digit_tokenizer": {
"type": "pattern",
"pattern": "[^\\p{L}\\p{N}]+"
}
},
"filter": {
"my_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 2
}
},
"analyzer": {
"my_letter_digit_ngram_analyzer": {
"type": "custom",
"tokenizer": "letter_digit_tokenizer",
"filter": [
"lowercase",
"my_ngram_filter"
]
}
}
}
}
}

GET /my_index/_analyze
{
"analyzer": "my_letter_digit_ngram_analyzer",
"text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

{
"tokens" : [ ]
}
or
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

Steps to Reproduce

PUT /my_index
{
"settings": {
"analysis": {
"tokenizer": {
"letter_digit_tokenizer": {
"type": "pattern",
"pattern": "[^\\p{L}\\p{N}]+"
}
},
"filter": {
"my_ngram_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 2
}
},
"analyzer": {
"my_letter_digit_ngram_analyzer": {
"type": "custom",
"tokenizer": "letter_digit_tokenizer",
"filter": [
"lowercase",
"my_ngram_filter"
]
}
}
}
}
}

GET /my_index/_analyze
{
"analyzer": "my_letter_digit_ngram_analyzer",
"text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

{
"tokens" : [ ]
}
or
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

Logs (if relevant)

No response

@S-Dragon0302 S-Dragon0302 added >bug needs:triage Requires assignment of a team area label labels Jun 24, 2024
@tvernum tvernum added :Search/Search Search-related issues that do not fall into other categories and removed needs:triage Requires assignment of a team area label labels Jun 25, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 25, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@cbuescher
Copy link
Member

@S-Dragon0302 Would you please let us know what the problem is you are encountering? I'm going to remove the "bug" label for now as I don't see whats missing. Also keep in mind that if this is a language-specific problem, the language-specific discuss forums (https://discuss.elastic.co/c/in-your-native-tongue/11) might be a good place to ask.

@S-Dragon0302
Copy link
Author

The segmentation result is incorrect. The token I generated from segmentation has no value. Actually, there should be a result

@S-Dragon0302
Copy link
Author

The segmentation result should be this.
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

@S-Dragon0302
Copy link
Author

The actual result is this.
{
"tokens" : [ ]
}

@benwtrent
Copy link
Member

@S-Dragon0302

For the given:

是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ the pattern without token filtering:

GET /my_index/_analyze
{
  "filter": [
    "lowercase"
  ],
  "tokenizer": {
    "type": "pattern",
    "pattern": "[^\\p{L}\\p{N}]+"
  },
  "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

Results in:

{
  "tokens": [
    {
      "token": "是",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "不",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "是",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "发",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "现",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "我",
      "start_offset": 10,
      "end_offset": 11,
      "type": "word",
      "position": 5
    },
    {
      "token": "的",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 6
    },
    {
      "token": "字",
      "start_offset": 14,
      "end_offset": 15,
      "type": "word",
      "position": 7
    },
    {
      "token": "冒",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 8
    },
    {
      "token": "烟",
      "start_offset": 18,
      "end_offset": 19,
      "type": "word",
      "position": 9
    },
    {
      "token": "了",
      "start_offset": 20,
      "end_offset": 21,
      "type": "word",
      "position": 10
    }
  ]
}

None of those are longer than 1 ngram. So filtering, requiring 2 ngram results in no output.

@benwtrent
Copy link
Member

closing as expected behavior. Filtering requiring 2 ngram when there is only 1 ngram is expected.

@benwtrent benwtrent added :Search Relevance/Analysis How text is split into tokens and removed feedback_needed :Search/Search Search-related issues that do not fall into other categories labels Jul 12, 2024
@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

6 participants