Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESQL: top_list aggregation #109386

Merged

Conversation

ivancea
Copy link
Contributor

@ivancea ivancea commented Jun 5, 2024

Added top_list(<field>, <limit>, <order>) aggregation, that collect top N values per bucket.
Works with the same types as MAX/MIN.

  • Added the aggregation function
  • Added a template to generate the aggregators
  • Added a template to generate the <Type>BucketedSort implementations per-type
    • This structure is based on the BucketedSort structure used on the original aggregations. It was modified to better fit the ESQL ecosystem (Blocks based, no docs...)

Also added a guide to create aggregations. Fixes #109213

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking more at BucketedSort, you'd have to make some changes to make it work. It's very aggs-y with methods like forLeaf which doesn't make sense for us. I'm not entirely sure how to change it, especially given that Block is in ESQL and BucketedSort isn't. Maybe extend it? The primitive holding nature is good and useful though. And it's got nice goodies like lazy heapification.

@ivancea ivancea requested a review from nik9000 June 7, 2024 15:58
@Param(
name = "field",
type = { "double", "integer", "long" },
description = "The field to collect the top values for."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the dot at the end of the messages please. The description of the functions looks like this The field to collect the top values for.,The maximum number of values to collect.,The order to calculate the top values. Either ascordesc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I saw, most functions (If not all, didn't check everything) follow this convention.
The The field to collect the top values for.,The maximum number of values to collect.,The order to... is just the expected result of the CSV meta test, which expects a list of values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, but the error message looks strange with that dot followed by comma. Please, change it!
With time maybe we'll change the rest of them.

}
if (type == DataType.DOUBLE) {
return new TopValuesListDoubleAggregatorFunctionSupplier(inputChannels, limitValue(), orderValue());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a GH issue helps here to not forget about this and, also, something other folks can relate to.

Source source,
@Param(
name = "field",
type = { "double", "integer", "long" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see tests that also use a date field type... should this be added here as well?

var order = orderRawValue();

if (limit <= 0) {
return new TypeResolution(format(null, "Limit must be greater than 0. Got {}", limit));
Copy link
Contributor

@astefan astefan Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these messages can be improved by also showing which expression is faulty (by using sourceText() ie the original text representation of the expression in the query), see examples in other functions.

Suggested change
return new TypeResolution(format(null, "Limit must be greater than 0. Got {}", limit));
return new TypeResolution(format(null, "Limit must be greater than 0 in [{}], found [{}]", sourceText(), limit));

}

if (order.equalsIgnoreCase(ORDER_ASC) == false && order.equalsIgnoreCase(ORDER_DESC) == false) {
return new TypeResolution(format(null, "Invalid order value. Expected [{}, {}] but got {}", ORDER_ASC, ORDER_DESC, order));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new TypeResolution(format(null, "Invalid order value. Expected [{}, {}] but got {}", ORDER_ASC, ORDER_DESC, order));
return new TypeResolution(format(null, "Invalid order value in [{}], expected [{}, {}] but got [{}[", sourceText(), ORDER_ASC, ORDER_DESC, order));

@ivancea ivancea changed the title ESQL: top_values_list aggregation ESQL: top_list aggregation Jun 18, 2024
@@ -0,0 +1,100 @@
topList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add some more complex queries here.
For one, I don't see any stats top_list by field queries. Have a look at stats.csv-spec and take inspiration from there. A quick list to include:

  • stats top_list by field
  • stats top_list by field1, field2
  • stats x=top_list | sort x asc/desc
  • row values | stats ....
  • have top_list act on dynamically created values (not something coming from Lucene or from row). For example, from employees | STATS a = TOP_VALUES_LIST(salary_change, 3, \"desc\"), d = TOP_VALUES_LIST(salary_change, 3, \"desc\") by gender | sort a desc, d asc | stats ad = top_values_list(a, 9, \"desc\"), da = top_values_list(d, 9, \"asc\") | eval dad = mv_sort(da, \"desc\") | keep ad, dad

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, these are all good things. These tests are quite cheap to run so I'm fully behind writing a bunch of htem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some extra cases, like the ones you commented. I'm never sure about how much to test, as at some point it feels like testing the engine itself

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestions come from the query languages side of things. In my view, your csv-spec tests are great but pretty basic. I am looking for combination of commands and functions, nested functions and repeating commands and so on. Something that can bring out issues with planning, analysis, optimization steps that might require some special attention from a new stats command. My advice going forward is to seek such more complex queries in already existent csv-spec tests, there are some good ones in there.

@ivancea ivancea requested review from astefan and nik9000 June 18, 2024 13:53
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a question about a test change. I think it should stay as it is now but I don't recall precisely what's up here. It's worth tracking down why we need the change.

@@ -198,7 +197,6 @@ public final void testNullIntermediateFinal() {
* return other sorts of results.
*/
protected void assertOutputFromEmpty(Block b) {
assertThat(b.elementType(), equalTo(ElementType.NULL));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's up with this change? I don't recall exactly what this line's doing, but it's probably worth calling out why we're zapping it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Empty" means a a block with a null. So, there are 2 ways to create a block with a null: A (constant) Null block, or a block. Both support having that null there.

For this aggregation (And I believe for some others too), building a null block means short-circuiting based on some check. That check may have a cost. In the case of this aggregation, it was either keeping a flag, or "deducing" that an empty array (In BucketedSort) meant no buckets, ergo no values. So some extra logics here.

Context aside: I removed it because it looked over-fitting, and as it's called on aggregation completion. I checked the difference between a constant null block and a long block, and yeah, the long blocks allocates an extra array. So I wonder if that's the reason for this check.

I wanted to comment this, but maybe I forgot about it. Thanks for pointing it out!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'm beginning to remember why I had these. I should have left a comment a year ago. If wishes were fishes then... how does the phrase go?

Anyway, constant null blocks have some neat tricks, not so much in the compute engine, but we reuse them. I had been working on some optimizations when you have an index with a zillion fields but they are empty - and get loaded as null blocks. We reuse the same instance for these blocks a bunch and the deserialization is a bit faster for it. And a little less memory. That's important when there's like 10000 of the things. Which happens. A lot more than I'd like.

So aggs producing the full null block are less important than fields loading them. But it does feel useful to make the null-only block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Adding the logic then, and re-adding that check

var order = orderRawValue();

if (limit <= 0) {
return new TypeResolution(format(null, "Limit must be greater than 0 in [{}], found [{}]", sourceText(), limit));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for this parameter restriction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, added an issue here to add support for automatic aggregation testing, replicating that of scalar functions. They already support type resolution error testing, so it should fit well after having it.


if (order.equalsIgnoreCase(ORDER_ASC) == false && order.equalsIgnoreCase(ORDER_DESC) == false) {
return new TypeResolution(
format(null, "Invalid order value in [{}], expected [{}, {}] but got [{}]", sourceText(), ORDER_ASC, ORDER_DESC, order)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, some tests for this restriction.


var typeResolution = isType(
field(),
dt -> dt == DataType.DATETIME || dt.isNumeric() && dt != DataType.UNSIGNED_LONG,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some tests for the supported data types.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thank you for creating #109917

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ivancea ivancea added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 19, 2024
# Conflicts:
#	x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/io/stream/PlanNamedTypes.java
@ivancea ivancea added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) labels Jun 19, 2024
@elasticsearchmachine elasticsearchmachine merged commit 2233349 into elastic:main Jun 19, 2024
16 checks passed
@ivancea ivancea deleted the feat/esql-top-values-aggregation branch June 19, 2024 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) ES|QL-ui Impacts ES|QL UI >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: Create introduction to writing aggregations
5 participants