New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: top_list aggregation #109386

Merged

elasticsearchmachine merged 44 commits into elastic:main from ivancea:feat/esql-top-values-aggregation

Jun 19, 2024

Contributor

ivancea commented Jun 5, 2024 •

edited

Loading

Added top_list(<field>, <limit>, <order>) aggregation, that collect top N values per bucket.
Works with the same types as MAX/MIN.

Added the aggregation function
Added a template to generate the aggregators
Added a template to generate the <Type>BucketedSort implementations per-type
- This structure is based on the BucketedSort structure used on the original aggregations. It was modified to better fit the ESQL ecosystem (Blocks based, no docs...)

Also added a guide to create aggregations. Fixes #109213

ivancea added 4 commits

June 3, 2024 19:38


Added basic surrogating TopValuesList aggregation and tests

b787897


Added failing test with duplicates

666815f


Basic top values list implementation for longs

b489e45


Bigger test limit

9983c86

ivancea requested a review from nik9000

June 5, 2024 09:09

elasticsearchmachine added the v8.15.0 label


Add compute generated files to gitattributes

14c88ad

nik9000 reviewed

View reviewed changes

Member

nik9000 left a comment

Looking more at BucketedSort, you'd have to make some changes to make it work. It's very aggs-y with methods like forLeaf which doesn't make sense for us. I'm not entirely sure how to change it, especially given that Block is in ESQL and BucketedSort isn't. Maybe extend it? The primitive holding nature is good and useful though. And it's got nice goodies like lazy heapification.

...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java Outdated Show resolved Hide resolved

...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java Outdated Show resolved Hide resolved

...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java Outdated Show resolved Hide resolved

ivancea added 4 commits

June 6, 2024 13:45


Format

3cd5fea


Added RawBucketedSort (Temporal name)

e5038f6


Added releasable to aggregator states

a8dc482


Fixed BigArrays leak

b4e8590

ivancea requested a review from nik9000

June 7, 2024 15:58


Removed original comment on class

9936cf5

nik9000 reviewed

View reviewed changes

server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java Outdated Show resolved Hide resolved

nik9000 reviewed

View reviewed changes

server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java Outdated Show resolved Hide resolved

...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java Outdated Show resolved Hide resolved

...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java Outdated Show resolved Hide resolved

nik9000 reviewed

View reviewed changes

server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java Outdated Show resolved Hide resolved

ivancea added 13 commits

June 10, 2024 12:02


Merge branch 'main' into feat/esql-top-values-aggregation

891837d

# Conflicts:
#	x-pack/plugin/esql/qa/testFixtures/src/main/resources/meta.csv-spec


Move BucketedSort class to computae package, and make it work with bl…

5e3d60c

…ocks directly


Added merge tests

230ee91


Extra merge tests

6f63e24


Removed extra array element

3c56712


Make templates for BucketedSort

289363e


Implemented templates for aggregator, and removed types that doesn't …

dc6d20a

…match with Max() or Min()


Fix type resolution bug and make it work with numbers like Max


Merge branch 'main' into feat/esql-top-values-aggregation

cc2bf76


Added testcases for aggregators of all types

9c6651e


Fixed headers

f3598a1


Merge branch 'main' into feat/esql-top-values-aggregation

cd58894


Removed all_types.csv data file

4e0c790

mark-vieira approved these changes

View reviewed changes


Merge branch 'main' into feat/esql-top-values-aggregation

34b997b

astefan reviewed

View reviewed changes

.../src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopValuesList.java Outdated

+ @Param(
+ name = "field",
+ type = { "double", "integer", "long" },
+ description = "The field to collect the top values for."

Contributor

astefan Jun 18, 2024

Drop the dot at the end of the messages please. The description of the functions looks like this The field to collect the top values for.,The maximum number of values to collect.,The order to calculate the top values. Either ascordesc.

Contributor Author

ivancea Jun 18, 2024

From what I saw, most functions (If not all, didn't check everything) follow this convention.
The The field to collect the top values for.,The maximum number of values to collect.,The order to... is just the expected result of the CSV meta test, which expects a list of values

Contributor

astefan Jun 18, 2024

That's fair, but the error message looks strange with that dot followed by comma. Please, change it!
With time maybe we'll change the rest of them.

.../src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopValuesList.java Outdated

+ }
+ if (type == DataType.DOUBLE) {
+ return new TopValuesListDoubleAggregatorFunctionSupplier(inputChannels, limitValue(), orderValue());
+ }

Contributor

astefan Jun 18, 2024

Maybe a GH issue helps here to not forget about this and, also, something other folks can relate to.

.../src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopValuesList.java Outdated

+ Source source,
+ @Param(
+ name = "field",
+ type = { "double", "integer", "long" },

Contributor

astefan Jun 18, 2024

I see tests that also use a date field type... should this be added here as well?

.../src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopValuesList.java Outdated

+ var order = orderRawValue();
+ if (limit <= 0) {
+ return new TypeResolution(format(null, "Limit must be greater than 0. Got {}", limit));

Contributor

astefan Jun 18, 2024 •

edited

Loading

All these messages can be improved by also showing which expression is faulty (by using sourceText() ie the original text representation of the expression in the query), see examples in other functions.

Suggested change

 return new TypeResolution(format(null, "Limit must be greater than 0. Got {}", limit));

 return new TypeResolution(format(null, "Limit must be greater than 0 in [{}], found [{}]", sourceText(), limit));

.../src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopValuesList.java Outdated

+ }
+ if (order.equalsIgnoreCase(ORDER_ASC) == false && order.equalsIgnoreCase(ORDER_DESC) == false) {
+ return new TypeResolution(format(null, "Invalid order value. Expected [{}, {}] but got {}", ORDER_ASC, ORDER_DESC, order));

Contributor

astefan Jun 18, 2024

Suggested change

 return new TypeResolution(format(null, "Invalid order value. Expected [{}, {}] but got {}", ORDER_ASC, ORDER_DESC, order));

 return new TypeResolution(format(null, "Invalid order value in [{}], expected [{}, {}] but got [{}[", sourceText(), ORDER_ASC, ORDER_DESC, order));

ivancea added 2 commits

June 18, 2024 12:21


Rename TopValuesList to TopList

940f70b


Restore spotless config

a0c7066

ivancea mentioned this pull request

ESQL: Support non-numeric comparable types in aggregations: TOP_LIST, MAX, MIN... #109849

Open


Added date to documentation of top_list and max/min

8bb808f

ivancea changed the title ~~ESQL: top_values_list aggregation~~ ESQL: top_list aggregation


Improved type resolution error messages and format

879d0ad

astefan reviewed

View reviewed changes

x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats_top_list.csv-spec

		@@ -0,0 +1,100 @@
		topList

Contributor

astefan Jun 18, 2024

You need to add some more complex queries here.
For one, I don't see any stats top_list by field queries. Have a look at stats.csv-spec and take inspiration from there. A quick list to include:

stats top_list by field
stats top_list by field1, field2
stats x=top_list | sort x asc/desc
row values | stats ....
have top_list act on dynamically created values (not something coming from Lucene or from row). For example, from employees | STATS a = TOP_VALUES_LIST(salary_change, 3, \"desc\"), d = TOP_VALUES_LIST(salary_change, 3, \"desc\") by gender | sort a desc, d asc | stats ad = top_values_list(a, 9, \"desc\"), da = top_values_list(d, 9, \"asc\") | eval dad = mv_sort(da, \"desc\") | keep ad, dad

Member

nik9000 Jun 18, 2024

Yeah, these are all good things. These tests are quite cheap to run so I'm fully behind writing a bunch of htem.

Contributor Author

ivancea Jun 18, 2024

Added some extra cases, like the ones you commented. I'm never sure about how much to test, as at some point it feels like testing the engine itself

Contributor

astefan Jun 18, 2024

My suggestions come from the query languages side of things. In my view, your csv-spec tests are great but pretty basic. I am looking for combination of commands and functions, nested functions and repeating commands and so on. Something that can bring out issues with planning, analysis, optimization steps that might require some special attention from a new stats command. My advice going forward is to seek such more complex queries in already existent csv-spec tests, there are some good ones in there.


Added extra testing on top_list

91ae66b

ivancea requested review from astefan and nik9000

June 18, 2024 13:53


Added aggregations doc link to esql root

7df8d54

nik9000 approved these changes

View reviewed changes

Member

nik9000 left a comment

I left a question about a test change. I think it should stay as it is now but I don't recall precisely what's up here. It's worth tracking down why we need the change.

.../compute/src/test/java/org/elasticsearch/compute/aggregation/AggregatorFunctionTestCase.java Outdated

@@ @@ -198,7 +197,6 @@ public final void testNullIntermediateFinal() { @@
  * return other sorts of results.
  */
  protected void assertOutputFromEmpty(Block b) {
- assertThat(b.elementType(), equalTo(ElementType.NULL));

Member

nik9000 Jun 18, 2024

What's up with this change? I don't recall exactly what this line's doing, but it's probably worth calling out why we're zapping it.

Contributor Author

ivancea Jun 19, 2024

"Empty" means a a block with a null. So, there are 2 ways to create a block with a null: A (constant) Null block, or a block. Both support having that null there.

For this aggregation (And I believe for some others too), building a null block means short-circuiting based on some check. That check may have a cost. In the case of this aggregation, it was either keeping a flag, or "deducing" that an empty array (In BucketedSort) meant no buckets, ergo no values. So some extra logics here.

Context aside: I removed it because it looked over-fitting, and as it's called on aggregation completion. I checked the difference between a constant null block and a long block, and yeah, the long blocks allocates an extra array. So I wonder if that's the reason for this check.

I wanted to comment this, but maybe I forgot about it. Thanks for pointing it out!

Member

nik9000 Jun 19, 2024

Got it. I'm beginning to remember why I had these. I should have left a comment a year ago. If wishes were fishes then... how does the phrase go?

Anyway, constant null blocks have some neat tricks, not so much in the compute engine, but we reuse them. I had been working on some optimizations when you have an index with a zillion fields but they are empty - and get loaded as null blocks. We reuse the same instance for these blocks a bunch and the deserialization is a bit faster for it. And a little less memory. That's important when there's like 10000 of the things. Which happens. A lot more than I'd like.

So aggs producing the full null block are less important than fields loading them. But it does feel useful to make the null-only block.

Contributor Author

ivancea Jun 19, 2024

Ok! Adding the logic then, and re-adding that check

drewdaemon mentioned this pull request

[ES|QL] add top_list agg elastic/kibana#186414

Closed

astefan reviewed

View reviewed changes

...n/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopList.java

+ var order = orderRawValue();
+ if (limit <= 0) {
+ return new TypeResolution(format(null, "Limit must be greater than 0 in [{}], found [{}]", sourceText(), limit));

Contributor

astefan Jun 19, 2024

Can you add tests for this parameter restriction?

Contributor Author

ivancea Jun 19, 2024

As discussed, added an issue here to add support for automatic aggregation testing, replicating that of scalar functions. They already support type resolution error testing, so it should fit well after having it.

...n/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopList.java

+ if (order.equalsIgnoreCase(ORDER_ASC) == false && order.equalsIgnoreCase(ORDER_DESC) == false) {
+ return new TypeResolution(
+ format(null, "Invalid order value in [{}], expected [{}, {}] but got [{}]", sourceText(), ORDER_ASC, ORDER_DESC, order)

Contributor

astefan Jun 19, 2024

Also, some tests for this restriction.

...n/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/TopList.java

+ var typeResolution = isType(
+ field(),
+ dt -> dt == DataType.DATETIME || dt.isNumeric() && dt != DataType.UNSIGNED_LONG,

Contributor

astefan Jun 19, 2024

Add some tests for the supported data types.

ivancea added 2 commits

June 19, 2024 12:52


Merge branch 'main' into feat/esql-top-values-aggregation

ee6e587

# Conflicts:
#	x-pack/plugin/esql/compute/build.gradle


Added sorting to-do

dc1b85d

astefan approved these changes

View reviewed changes

Contributor

astefan left a comment

LGTM
Thank you for creating #109917

ivancea added 2 commits

June 19, 2024 14:15


Restore null block type check, and fix bucketedSort to use it

1dbbdc0


Add check to BucketedSort block type

f2ec1d9

nik9000 approved these changes

View reviewed changes

Member

nik9000 left a comment

👍

ivancea added the auto-merge label


Merge branch 'main' into feat/esql-top-values-aggregation

# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/io/stream/PlanNamedTypes.java

ivancea added auto-merge and removed auto-merge labels

elasticsearchmachine merged commit 2233349 into elastic:main

16 checks passed

ivancea deleted the feat/esql-top-values-aggregation branch

June 19, 2024 14:49

stratoula mentioned this pull request

[ES|QL] Support top list aggregation elastic/kibana#186494

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment