Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time series support to compute engine #105397

Open
6 of 14 tasks
martijnvg opened this issue Feb 12, 2024 · 2 comments
Open
6 of 14 tasks

Add time series support to compute engine #105397

martijnvg opened this issue Feb 12, 2024 · 2 comments
Assignees
Labels
:Analytics/Compute Engine Analytics in ES|QL :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine

Comments

@martijnvg
Copy link
Member

martijnvg commented Feb 12, 2024

This is the meta issue that tracks the work to be done to the compute engine in order to power time series support. This for now at least doesn't include the language changes to ES|QL. The compute engine components should only be active via enabling specific query pragmas, until the time series compute engine components are more stable and the es|ql language is ready to adopt it.

General overview

image

(an overview of how time series aggregation can work in the compute engine (assuming all time series don't cross backing index boundary))

The idea is that a new source operator will emit all matching document in time series order (_tsid ascending, @timestamp descending). Documents are sorted in that order at the segment level, but not at the shard level. A page will additionally also include tsid and timestamp blocks. Documents of the same time serie should be contained by the block. A new time series grouping operator will make use of the sorted nature of the pages that the source operator emits and groups by tsid or tsid and timestamp interval. The output of this operator can be used by other operates such as the HashAggregationOperator.

Sometimes not all samples or a time series are in the same shard. This can happen when a query targets multiple backing indices of a tsdb data stream. In this case we need for the affected time series post pone grouping in the new time series grouping operator. The new time series grouping operator needs to group these time series on the coordinating node (when the aggregation mode is final in AggregateExec). Initially we will build a time series grouping operator that assumes that time series are always scattered across multiple backing indices and thus performs the grouping when the aggregate mode is final. In follow ups, we can then improve the new time series grouping operator to detect when time series don't cross backing index boundaries. In that case the grouping can perform locally, when aggregation mode is partial.

Initially we will only allow filtering on dimension fields. More specifically the filters that get pushed down to the time series source operator. If filters on labels or metrics get pushed down to the source operator we run at risk of breaking the ordered samples of a time serie apart.

Tasks

  • Add a time series source operator, that gets activated when a time serie pragma is enabled. (Add time series source operator. #105398)
  • Implement time series grouping. #106411
  • Update the grouping operator to perform grouping can happen locally when ever the query allows for it. For example a @timestamp filter in WHERE clause doesn't cross the boundary of a backing index. Or timestamp interval group is contained within a backing index.
  • Update Aggregator and GroupingAggregator interfaces to accept sorted pages/blocks. #106414
  • Enhancing the time serie grouping operator to also group by time series and time interval. A typical use case would group by time serie and time interval. This is when the BUCKET syntax is used.
  • Integrate the time series operators in the proposed TSTATS syntax.
  • Develop an alternative parallelization strategy for tsdb indices in the time serie source operator.
  • Add es|ql rate aggregate function #106415
  • Add support of sparse index to easily navigate a time series documents (Sparse index for tsdb #95701). This is required for determining the last value of a metric and skipping to the next last value of the next time serie. And other functionally like interpolation and geo fencing. Additionally a query may be too selective, and mask documents which are valid metric of a time serie. A sparse index would allow us to access the metrics even if that would be the case.
  • Add support for aggregate_double_metric field type in es|ql in order to support downsampling. Add support for aggregate_metric_double field in es|ql #110649
  • Add support for _doc_count field in es|ql in order to support downsampling.
  • Query planning should understand the index.time_series.start_time and index.time_series.end_time index settings, so that backing indices that will never match with the ES|QL query will be excluded from the query execution. (This is based on where filter on @timestamp field.)

Optional:

@martijnvg martijnvg added :StorageEngine/TSDB You know, for Metrics :Analytics/Compute Engine Analytics in ES|QL labels Feb 12, 2024
@martijnvg martijnvg self-assigned this Feb 12, 2024
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine labels Feb 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

martijnvg added a commit that referenced this issue Mar 7, 2024
This change adds an experimental time series source operator that gets enabled when `time_series` query pragma is set. When enabled, the documents the source operator emits, are in time series order. Meaning sorted by tsid asc and timestamp descending. Other yet to be introduced operators can make use of the sorted order and optimizations or computations that would otherwise not be feasible.

Example usage:

```
POST /_query?format=txt
{
  "query": "FROM cpu_tsbs | LIMIT 3",
  "pragma": {
    "time_series": true
  }
}
```

Note that this change on its own doesn't add any real functionality order then the sort order in which data gets emitted. This change is part of a series of many changes that would eventually add time series query support to ES|QL. There are many things to be done like adding a time series grouping operator that makes use of the sorted nature of pages that this source operator adds, adding parallization support, adding time series function support like `rate` and much more.

Relates #105397
@siposea siposea assigned dnhatn and unassigned martijnvg Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Compute Engine Analytics in ES|QL :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine
Projects
None yet
Development

No branches or pull requests

3 participants