-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pushing dereferences within lambdas into table scan #21957
Support pushing dereferences within lambdas into table scan #21957
Conversation
// Picked from Presto | ||
public class Subfield | ||
{ | ||
public interface PathElement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could be sealed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in rev2
// As a result, only support limited cases now which symbol reference has to be uniquely referenced | ||
ImmutableList.Builder<Expression> expressionsBuilder = ImmutableList.<Expression>builder() | ||
.addAll(project.getAssignments().getExpressions()); | ||
List<Expression> expressions = expressionsBuilder.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just be: List<Expression> expressions = ImmutableList.copyOf(project.getAssignments().getExpressions());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in rev2
|
||
partialTranslations = partialTranslations.entrySet().stream().filter(entry -> { | ||
ArrayFieldDereference arrayFieldDereference = (ArrayFieldDereference) entry.getValue(); | ||
return arrayFieldDereference.getTarget() instanceof Variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could be return arrayFieldDereference.getTarget() instanceof Variable variable && symbolReferenceNamesCount.get(variable.getTarget().getName()) == 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in rev2
combinedPrunedTypes = combinedPrunedTypes.union(prunedType); | ||
} | ||
} | ||
return Optional.ofNullable(combinedPrunedTypes) // Should never be null since subfields is non-empty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If combinedPrunedTypes
should never be null, then you can just use Optional.of(combinedPrunedTypes)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in rev2
…d filters, with bug fixes, style fixes and unit tests
@martint please take a look when you get a chance, even any high level comment would be helpful. |
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not required right ?
booleanProperty( | ||
ENABLE_PUSH_SUBSCRIPT_LAMBDA_INTO_SCAN, | ||
"Enable Push Subscript Lambda Into Scan feature", | ||
false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a config object to toggle the same ?
{ | ||
super(type); | ||
this.target = requireNonNull(target, "target is null"); | ||
this.elementFieldDereferences = requireNonNull(elementFieldDereference, "elementFieldDereference is null"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ImmutableList.copyOf(requireNonNull(elementFieldDereference, "elementFieldDereference is null"))
Can we add a verification with the Type as well ? i.e type instanceOf ArrayType
@Override | ||
public List<? extends ConnectorExpression> getChildren() | ||
{ | ||
return singletonList(target); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we pass the elementFieldDereferences here ? Bcoz they are not a constant/literal so if are running any logic on its children then it should be applied for the electFieldDereferences
@Override | ||
public String toString() | ||
{ | ||
return format("(%s).#%s", target, elementFieldDereferences); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the pattern be like #index_1#index_2#...
import static java.util.Objects.requireNonNull; | ||
|
||
// Class to represent subfield. Direct referenced from Presto | ||
public class Subfield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Subfield as in row type or ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a dedicated abstraction in the SPI to represent this. This information is available (and should be encoded) in the structure of ConnectorExpressions passed to the connector APIs. We can have utilities in the plugin toolkit module to extract the necessary info to make it easier for connector implementers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already represented as ArrayFieldDereference in this PR, then in this case we will still need to extract it to a form that readers can use regardless of table formats?
@@ -675,6 +678,32 @@ protected Optional<ConnectorExpression> visitFunctionCall(FunctionCall node, Voi | |||
return translateLike(node); | |||
} | |||
|
|||
// Very narrow case that only tries to extract a particular type of lambda expression | |||
// TODO: Expand the scope | |||
if ("transform".equals(functionName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work only for transform
- Don't we have to extend if for other functions like subscript
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transform for now
* | ||
* TODO: Remove lambda expression after subfields are pushed down | ||
*/ | ||
public class PushSubscriptLambdaThroughFilterIntoTableScan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we could push the projections through filter and the PushProjectionsIntoTableScan
could take care of them right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary, there could be project->filter->scan at the very beginning of planning for very simple queries
Thanks @Praveen2112 to take a look at my PR, will definitely revise it based on those, but before that, I really want to get a confirmation on how we are going to represent Subfields across connectors, in this way I can start rewriting all the relevant part with the new representation, currently it works very awkward that if Subfields exists, the respect it, otherwise, respect the dereference indexes and names which is the real limitation in Trino at this moment. Since that cleanup will be large, I do not want to start it before the community agrees with it. |
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time. |
Description
This is to extend the enhancement discussed here #3925, and depends/extends on the original PR #4270 that is currently rebasing by @Desmeister
Since the issue and discussion had been idled for years and this kind of optimization could be critical to anyone having highly nested schema and using Unnest, I would like to use this PR to formally restart the discussion on how the community want to eventually support this and if this is on the right direction (I have a working version locally, not this one, that speeds up the query while reducing actual data processed)
From my understanding of the previous discussions, this should be done through below steps:
dereferencing
involved withUnnest
into lambda functions with subscript expressions for each of theUnnests
TableScan
, in this case, this rule will help to pushdown the dereferencing further, while for any connectors that dont support dereferencing, the rule will preserve the Lambda expression to remove columnsPushDownDereferenceThroughUnnest
and many other expression specific rules.PushDownDereferenceThroughUnnest
is not handling any unnest symbols currently, but only replicated symbols. In order to support unnest symbols, I believe at least a new expression has to be created, or subscript expression has to be extended otherwise I dont see an easy way to represent the dereferences so it can be further pushed down through other unnests in anyway. I need more guidance on how this could be done or possible with what we have now, that is why this PR in particular is not handling any complex cases like nested Unnest and only push lambdas down through project and filters in a limited way.dereferencing
intoTableScan
visitFunctionCall
inConnectorExpressionTranslator
to create a new connector expression (can be merged with existingFieldDereference
expression if possible), then passing those into existingapplyProjection
method to let connectors decide how to handle those. For this PR, onlyHiveMetadata
has implementation to handle those, other connectors will simply ignoring them. TheapplyProjection
will create new projections andHiveColumnHandle
for Hive with extendedHiveColumnProjectionInfo
.Structs
.This PR is written in a way to reduce the impacts to the existing features while I can fully validate the performance impact while gathering feedbacks and directions from the community. Therefore implementations are normally wrapped in an
if
instead of fully refactoring the existing methodI believe if this is the right direction, changes can be contributed through below phases
Array<dereferences>
withinHiveColumnProjectionInfo
toSubfields
or anything similar to that and make sure all methods that used to depend onArray<dereferences>
now depend on the new representationapplyProjection
method (or not? It can simply be a non-iterative visitor at the very end like now.)The change has been fully validated except rebasing to the latest Trino release that could have a lot of conflicts due to AST/IR refactoring
Byte scanned decreased from 423B to 359B for the sample query, we've seen large performance improvement in production queries
Additional context and related issues
I would really appreciate any kind of comments or feedbacks as without clear directions, I can't further extend this without risking of throwing everything away later. Any of the component should be easily plug in if we have a clear idea of how we want to do it otherwise.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(X) Release notes are required, with the following suggested text: