Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Span Query Based on a Known Token Index #23812

Closed
speedplane opened this issue Mar 30, 2017 · 4 comments
Closed

Feature Request: Span Query Based on a Known Token Index #23812

speedplane opened this issue Mar 30, 2017 · 4 comments
Labels
>enhancement help wanted adoptme :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@speedplane
Copy link
Contributor

speedplane commented Mar 30, 2017

It would be useful to be able to do a span_query relative to a particular token index.

For example:

  • Find documents that mention foo within 10 tokens of the start of the document
  • More generally: find documents that mention foo within X tokens of the Yth token in the document
  • Even more generally: find documents that mention foo within X tokens of the token index stored in field Z

Given that term vectors are already stored by Lucene, this should be relatively straightforward to implement in ES.

Motivation - NLP

Many ES users start with text documents as their base corpus. While ES provides a great deal of functionality for analyzing text documents (e.g., tokenization, stemming, etc.), it does not have a full range of NLP tools that are available on other platforms. Accordingly, when indexing text documents, it's common to process the document using NLP methods, extract features from the document, and then index features as separate fields in the ES index.

Unfortunately, this process makes span queries cumbersome or impossible to use in conjunction with the extracted features. It's not possible to search for documents with feature A within 10 tokens of feature B.

With this feature, these queries would not be simple, but they would be possible. During indexing, we would extract the NLP features and record their location within the document. We would then save the feature and its location in separate fields in the index. We could then run a span query relative to the location stored in the field.

A Simple but Practical Use-Case

Suppose your documents are English financial documents and your users want to search for a dollar amount range related to some keyword. For example, they may search for merger $10M-$100M. Behind the scenes, ES should be able to find documents that mention dollar amounts between ten and one hundred million dollars within X words of "merger".

Elasticsearch doesn't have a straightforward way of extracting dollar amounts, so they will have to be extracted on the client side and stored in a separate field. Then you would be able to run a range query on that dollar amount field. However, if they are stored in a separate field, you are unable to do span queries with the "merger" keyword. Only with this feature request, would this query be possible.

@martijnvg
Copy link
Member

Given that term vectors are already stored by Lucene, this should be relatively straightforward to implement in ES.

Term vectors aren't typically used at query time, but usually at fetch time to do things like highlighting. I think accessing term vectors at query time is too slow. That is why span queries use term offset/position/payloads stored in Lucene's inverted index.

Find documents that mention foo within 10 tokens of the start of the document

I think the span_first query can be used for this.

More generally: find documents that mention foo within X tokens of the Yth token in the document

Lucene has a SpanPositionRangeQuery query, which today can't be used from the query dsl. I think we can expose in the query dsl as span_range query. That query should be able to match a term within a token range.

Even more generally: find documents that mention foo within X tokens of the token index stored in field Z

I don't think that this is possible. The tokens of both fields should be in a single field. Offset/positions in different fields can't be compared with each other. In your use case I think you should normalize $10M-$100M to $ during text analysis (this is ok as the actual range is stored in a different field), so that you then do a span_range query with the 'merger' keyword.

@speedplane
Copy link
Contributor Author

Term vectors aren't typically used at query time ... That is why span queries use term offset/position/payloads stored in Lucene's inverted index.

Apologies, I mean we could use offset/position, not term vectors.

Lucene has a SpanPositionRangeQuery query ... That query should be able to match a term within a token range.

Sounds great!

In your use case I think you should normalize $10M-$100M to $ during text analysis (this is ok as the actual range is stored in a different field), so that you then do a span_range query with the 'merger' keyword.

Would I use the span_range here? If I normalize all dollar amounts to $ during analysis, I would just be able to use a normal span query, right? This would not be an ideal solution, but would sort of work.

@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018
@cbuescher
Copy link
Member

/cc @elastic/es-search-aggs

@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@javanna
Copy link
Member

javanna commented Jun 24, 2024

This has been open for quite a while, and hasn't had a lot of interest. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement help wanted adoptme :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants