-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ESQL: top_list aggregation #109386
ESQL: top_list aggregation #109386
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking more at BucketedSort
, you'd have to make some changes to make it work. It's very aggs-y with methods like forLeaf
which doesn't make sense for us. I'm not entirely sure how to change it, especially given that Block
is in ESQL
and BucketedSort
isn't. Maybe extend it? The primitive holding nature is good and useful though. And it's got nice goodies like lazy heapification.
...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java
Outdated
Show resolved
Hide resolved
...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java
Outdated
Show resolved
Hide resolved
...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java
Outdated
Show resolved
Hide resolved
...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java
Outdated
Show resolved
Hide resolved
...compute/src/main/java/org/elasticsearch/compute/aggregation/TopValuesListLongAggregator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/sort/RawBucketedSort.java
Outdated
Show resolved
Hide resolved
# Conflicts: # x-pack/plugin/esql/qa/testFixtures/src/main/resources/meta.csv-spec
…match with Max() or Min()
@Param( | ||
name = "field", | ||
type = { "double", "integer", "long" }, | ||
description = "The field to collect the top values for." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the dot at the end of the messages please. The description of the functions looks like this The field to collect the top values for.,The maximum number of values to collect.,The order to calculate the top values. Either
ascor
desc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I saw, most functions (If not all, didn't check everything) follow this convention.
The The field to collect the top values for.,The maximum number of values to collect.,The order to...
is just the expected result of the CSV meta test, which expects a list of values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, but the error message looks strange with that dot followed by comma. Please, change it!
With time maybe we'll change the rest of them.
} | ||
if (type == DataType.DOUBLE) { | ||
return new TopValuesListDoubleAggregatorFunctionSupplier(inputChannels, limitValue(), orderValue()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a GH issue helps here to not forget about this and, also, something other folks can relate to.
Source source, | ||
@Param( | ||
name = "field", | ||
type = { "double", "integer", "long" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see tests that also use a date
field type... should this be added here as well?
var order = orderRawValue(); | ||
|
||
if (limit <= 0) { | ||
return new TypeResolution(format(null, "Limit must be greater than 0. Got {}", limit)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these messages can be improved by also showing which expression is faulty (by using sourceText()
ie the original text representation of the expression in the query), see examples in other functions.
return new TypeResolution(format(null, "Limit must be greater than 0. Got {}", limit)); | |
return new TypeResolution(format(null, "Limit must be greater than 0 in [{}], found [{}]", sourceText(), limit)); |
} | ||
|
||
if (order.equalsIgnoreCase(ORDER_ASC) == false && order.equalsIgnoreCase(ORDER_DESC) == false) { | ||
return new TypeResolution(format(null, "Invalid order value. Expected [{}, {}] but got {}", ORDER_ASC, ORDER_DESC, order)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return new TypeResolution(format(null, "Invalid order value. Expected [{}, {}] but got {}", ORDER_ASC, ORDER_DESC, order)); | |
return new TypeResolution(format(null, "Invalid order value in [{}], expected [{}, {}] but got [{}[", sourceText(), ORDER_ASC, ORDER_DESC, order)); |
@@ -0,0 +1,100 @@ | |||
topList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to add some more complex queries here.
For one, I don't see any stats top_list by field
queries. Have a look at stats.csv-spec
and take inspiration from there. A quick list to include:
stats top_list by field
stats top_list by field1, field2
stats x=top_list | sort x asc/desc
row values | stats ....
- have top_list act on dynamically created values (not something coming from Lucene or from
row
). For example,from employees | STATS a = TOP_VALUES_LIST(salary_change, 3, \"desc\"), d = TOP_VALUES_LIST(salary_change, 3, \"desc\") by gender | sort a desc, d asc | stats ad = top_values_list(a, 9, \"desc\"), da = top_values_list(d, 9, \"asc\") | eval dad = mv_sort(da, \"desc\") | keep ad, dad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, these are all good things. These tests are quite cheap to run so I'm fully behind writing a bunch of htem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some extra cases, like the ones you commented. I'm never sure about how much to test, as at some point it feels like testing the engine itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestions come from the query languages side of things. In my view, your csv-spec tests are great but pretty basic. I am looking for combination of commands and functions, nested functions and repeating commands and so on. Something that can bring out issues with planning, analysis, optimization steps that might require some special attention from a new stats
command. My advice going forward is to seek such more complex queries in already existent csv-spec tests, there are some good ones in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a question about a test change. I think it should stay as it is now but I don't recall precisely what's up here. It's worth tracking down why we need the change.
@@ -198,7 +197,6 @@ public final void testNullIntermediateFinal() { | |||
* return other sorts of results. | |||
*/ | |||
protected void assertOutputFromEmpty(Block b) { | |||
assertThat(b.elementType(), equalTo(ElementType.NULL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's up with this change? I don't recall exactly what this line's doing, but it's probably worth calling out why we're zapping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Empty" means a a block with a null. So, there are 2 ways to create a block with a null: A (constant) Null block, or a block. Both support having that null there.
For this aggregation (And I believe for some others too), building a null block means short-circuiting based on some check. That check may have a cost. In the case of this aggregation, it was either keeping a flag, or "deducing" that an empty array (In BucketedSort) meant no buckets, ergo no values. So some extra logics here.
Context aside: I removed it because it looked over-fitting, and as it's called on aggregation completion. I checked the difference between a constant null block and a long block, and yeah, the long blocks allocates an extra array. So I wonder if that's the reason for this check.
I wanted to comment this, but maybe I forgot about it. Thanks for pointing it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'm beginning to remember why I had these. I should have left a comment a year ago. If wishes were fishes then... how does the phrase go?
Anyway, constant null blocks have some neat tricks, not so much in the compute engine, but we reuse them. I had been working on some optimizations when you have an index with a zillion fields but they are empty - and get loaded as null blocks. We reuse the same instance for these blocks a bunch and the deserialization is a bit faster for it. And a little less memory. That's important when there's like 10000 of the things. Which happens. A lot more than I'd like.
So aggs producing the full null block are less important than fields loading them. But it does feel useful to make the null
-only block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Adding the logic then, and re-adding that check
var order = orderRawValue(); | ||
|
||
if (limit <= 0) { | ||
return new TypeResolution(format(null, "Limit must be greater than 0 in [{}], found [{}]", sourceText(), limit)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for this parameter restriction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, added an issue here to add support for automatic aggregation testing, replicating that of scalar functions. They already support type resolution error testing, so it should fit well after having it.
|
||
if (order.equalsIgnoreCase(ORDER_ASC) == false && order.equalsIgnoreCase(ORDER_DESC) == false) { | ||
return new TypeResolution( | ||
format(null, "Invalid order value in [{}], expected [{}, {}] but got [{}]", sourceText(), ORDER_ASC, ORDER_DESC, order) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, some tests for this restriction.
|
||
var typeResolution = isType( | ||
field(), | ||
dt -> dt == DataType.DATETIME || dt.isNumeric() && dt != DataType.UNSIGNED_LONG, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some tests for the supported data types.
# Conflicts: # x-pack/plugin/esql/compute/build.gradle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you for creating #109917
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# Conflicts: # x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/io/stream/PlanNamedTypes.java
Added
top_list(<field>, <limit>, <order>)
aggregation, that collect top N values per bucket.Works with the same types as MAX/MIN.
<Type>BucketedSort
implementations per-typeBucketedSort
structure used on the original aggregations. It was modified to better fit the ESQL ecosystem (Blocks based, no docs...)Also added a guide to create aggregations. Fixes #109213