Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL: ReplaceMissingFieldWithNull could lead to ClassCastException #110150

Closed
luigidellaquila opened this issue Jun 25, 2024 · 6 comments · Fixed by #110373
Closed

ES|QL: ReplaceMissingFieldWithNull could lead to ClassCastException #110150

luigidellaquila opened this issue Jun 25, 2024 · 6 comments · Fixed by #110373
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@luigidellaquila
Copy link
Contributor

ReplaceMissingFieldWithNull local logical optimization rule replaces expressions with a NULL Literal when a field is not mapped on local node.

This is dangerous because a Literal could not match the type of the expressions that are being replaced (eg. if the command expects a NamedExpression like in this case), and it can trigger a ClassCastException.

With current feature set this optimization is safe, but we could miss the problem when we add new commands.

Notice that this optimization triggers a few other downstream optimizations that can be quite important, see LocalLogicalPlanOptimizerTests.testMissingField*

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@alex-spies
Copy link
Contributor

IMO this is a bug, so I marked this accordingly. This piece of code performs a wild, unchecked cast without any safeguards; this is bound to break again in the future, especially since we plan to add new logical plan nodes (and evolve existing ones).

@astefan
Copy link
Contributor

astefan commented Jul 2, 2024

It is important to have a test case for this. The original fix addressed the problem at theoretical level and based on assumptions, without a test that demonstrates the problem is actually fixed. @luigidellaquila I know this happened in a setup that's not easy to reproduce (multi-node I assume), but please look into a more concret, non-theoretical, and reproduceable scenario. Update this issue when you have more info about this.

@luigidellaquila
Copy link
Contributor Author

@astefan the specific problem with MV_EXPAND was easily reproducible at planning time with a unit test. The test is included in the PR, so we are pretty confident that it's fixed.

This issue is tracking a more general problem: in the base case, this rule is blindly replacing FieldAttributes with Literals, without considering the actual field type declared in the command (ie. without considering that a Literal cannot be cast to FieldAttribute).
In the future, if we add a new command that declares an attribute of any type that is not a superclass of Literal, this attribute contains a FieldAttribute, and that FieldAttribute is not in the mapping of the local node, his rule will eventually trigger a ClassCastException.

With currently supported commands there are no concrete failing scenarios, so the only thing I can do is demonstrate the problem in a unit test with a mock command #110373

@alex-spies
Copy link
Contributor

IMHO the default case of ReplaceMissingFieldWithNull should just not optimize at all; that case is just a blind hope that future logical plan node will accept literals where they normally expect field attributes, when the field is missing from the local node.

Not optimizing per default forces us to actually think about this (annoying, but not incorrect) and adding new logical plan nodes explicitly to the rule. Alternatively, we could extract the info "I'm a logical plan node whose fieldattributes can be replaced by literals, no problem!" into an interface or abstract function, but that may be overkill, too.

@luigidellaquila
Copy link
Contributor Author

ReplaceMissingFieldWithNull should just not optimize at all

An easy way to fix it would be to explicitly list all the commands that we know we want to optimize that way, and as you say, just do nothing by default.
Adding more logic to let commands declare that they can be optimized by this rule seems a bit of a dependency loop, and probably overkill, yeah.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
4 participants