-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Span Query Based on a Known Token Index #23812
Comments
Term vectors aren't typically used at query time, but usually at fetch time to do things like highlighting. I think accessing term vectors at query time is too slow. That is why span queries use term offset/position/payloads stored in Lucene's inverted index.
I think the
Lucene has a
I don't think that this is possible. The tokens of both fields should be in a single field. Offset/positions in different fields can't be compared with each other. In your use case I think you should normalize |
Apologies, I mean we could use offset/position, not term vectors.
Sounds great!
Would I use the |
/cc @elastic/es-search-aggs |
This has been open for quite a while, and hasn't had a lot of interest. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed. |
It would be useful to be able to do a
span_query
relative to a particular token index.For example:
foo
within 10 tokens of the start of the documentfoo
withinX
tokens of theY
th token in the documentfoo
withinX
tokens of the token index stored in fieldZ
Given that term vectors are already stored by Lucene, this should be relatively straightforward to implement in ES.
Motivation - NLP
Many ES users start with text documents as their base corpus. While ES provides a great deal of functionality for analyzing text documents (e.g., tokenization, stemming, etc.), it does not have a full range of NLP tools that are available on other platforms. Accordingly, when indexing text documents, it's common to process the document using NLP methods, extract features from the document, and then index features as separate fields in the ES index.
Unfortunately, this process makes span queries cumbersome or impossible to use in conjunction with the extracted features. It's not possible to search for documents with feature A within 10 tokens of feature B.
With this feature, these queries would not be simple, but they would be possible. During indexing, we would extract the NLP features and record their location within the document. We would then save the feature and its location in separate fields in the index. We could then run a span query relative to the location stored in the field.
A Simple but Practical Use-Case
Suppose your documents are English financial documents and your users want to search for a dollar amount range related to some keyword. For example, they may search for
merger $10M-$100M
. Behind the scenes, ES should be able to find documents that mention dollar amounts between ten and one hundred million dollars within X words of "merger".Elasticsearch doesn't have a straightforward way of extracting dollar amounts, so they will have to be extracted on the client side and stored in a separate field. Then you would be able to run a range query on that dollar amount field. However, if they are stored in a separate field, you are unable to do span queries with the "merger" keyword. Only with this feature request, would this query be possible.
The text was updated successfully, but these errors were encountered: