Feature Request: Span Query Based on a Known Token Index #23812

speedplane · 2017-03-30T06:56:57Z

It would be useful to be able to do a span_query relative to a particular token index.

For example:

Find documents that mention foo within 10 tokens of the start of the document
More generally: find documents that mention foo within X tokens of the Yth token in the document
Even more generally: find documents that mention foo within X tokens of the token index stored in field Z

Given that term vectors are already stored by Lucene, this should be relatively straightforward to implement in ES.

Motivation - NLP

Many ES users start with text documents as their base corpus. While ES provides a great deal of functionality for analyzing text documents (e.g., tokenization, stemming, etc.), it does not have a full range of NLP tools that are available on other platforms. Accordingly, when indexing text documents, it's common to process the document using NLP methods, extract features from the document, and then index features as separate fields in the ES index.

Unfortunately, this process makes span queries cumbersome or impossible to use in conjunction with the extracted features. It's not possible to search for documents with feature A within 10 tokens of feature B.

With this feature, these queries would not be simple, but they would be possible. During indexing, we would extract the NLP features and record their location within the document. We would then save the feature and its location in separate fields in the index. We could then run a span query relative to the location stored in the field.

A Simple but Practical Use-Case

Suppose your documents are English financial documents and your users want to search for a dollar amount range related to some keyword. For example, they may search for merger $10M-$100M. Behind the scenes, ES should be able to find documents that mention dollar amounts between ten and one hundred million dollars within X words of "merger".

Elasticsearch doesn't have a straightforward way of extracting dollar amounts, so they will have to be extracted on the client side and stored in a separate field. Then you would be able to run a range query on that dollar amount field. However, if they are stored in a separate field, you are unable to do span queries with the "merger" keyword. Only with this feature request, would this query be possible.

The text was updated successfully, but these errors were encountered:

martijnvg · 2017-03-30T08:59:44Z

Given that term vectors are already stored by Lucene, this should be relatively straightforward to implement in ES.

Term vectors aren't typically used at query time, but usually at fetch time to do things like highlighting. I think accessing term vectors at query time is too slow. That is why span queries use term offset/position/payloads stored in Lucene's inverted index.

Find documents that mention foo within 10 tokens of the start of the document

I think the span_first query can be used for this.

More generally: find documents that mention foo within X tokens of the Yth token in the document

Lucene has a SpanPositionRangeQuery query, which today can't be used from the query dsl. I think we can expose in the query dsl as span_range query. That query should be able to match a term within a token range.

Even more generally: find documents that mention foo within X tokens of the token index stored in field Z

I don't think that this is possible. The tokens of both fields should be in a single field. Offset/positions in different fields can't be compared with each other. In your use case I think you should normalize $10M-$100M to $ during text analysis (this is ok as the actual range is stored in a different field), so that you then do a span_range query with the 'merger' keyword.

speedplane · 2017-04-07T00:23:46Z

Term vectors aren't typically used at query time ... That is why span queries use term offset/position/payloads stored in Lucene's inverted index.

Apologies, I mean we could use offset/position, not term vectors.

Lucene has a SpanPositionRangeQuery query ... That query should be able to match a term within a token range.

Sounds great!

In your use case I think you should normalize $10M-$100M to $ during text analysis (this is ok as the actual range is stored in a different field), so that you then do a span_range query with the 'merger' keyword.

Would I use the span_range here? If I normalize all dollar amounts to $ during analysis, I would just be able to use a normal span query, right? This would not be an ideal solution, but would sort of work.

cbuescher · 2018-03-13T16:25:29Z

/cc @elastic/es-search-aggs

javanna · 2024-06-24T18:37:28Z

This has been open for quite a while, and hasn't had a lot of interest. For now I'm going to close this as something we aren't planning on implementing. We can re-open it later if needed.

martijnvg added :Query DSL discuss labels Mar 30, 2017

clintongormley added help wanted adoptme >enhancement and removed discuss labels Mar 31, 2017

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Query DSL labels Feb 14, 2018

rjernst added the Team:Search Meta label for search team label May 4, 2020

javanna closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Span Query Based on a Known Token Index #23812

Feature Request: Span Query Based on a Known Token Index #23812

speedplane commented Mar 30, 2017 •

edited

Loading

martijnvg commented Mar 30, 2017

speedplane commented Apr 7, 2017

cbuescher commented Mar 13, 2018

javanna commented Jun 24, 2024

Feature Request: Span Query Based on a Known Token Index #23812

Feature Request: Span Query Based on a Known Token Index #23812

Comments

speedplane commented Mar 30, 2017 • edited Loading

Motivation - NLP

A Simple but Practical Use-Case

martijnvg commented Mar 30, 2017

speedplane commented Apr 7, 2017

cbuescher commented Mar 13, 2018

javanna commented Jun 24, 2024

speedplane commented Mar 30, 2017 •

edited

Loading