-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize reading of metadata for large parquet schema #22451
Conversation
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
% comments
% Optimize ParquetTypeUtils#getPathIndex
commit, see comment there
lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ColumnChunkProperties.java
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ColumnChunkProperties.java
Outdated
Show resolved
Hide resolved
@@ -494,7 +494,8 @@ private ColumnChunkMetadata getColumnChunkMetaData(BlockMetadata blockMetaData, | |||
throws IOException | |||
{ | |||
for (ColumnChunkMetadata metadata : blockMetaData.getColumns()) { | |||
if (metadata.getPath().equals(ColumnPath.get(columnDescriptor.getPath()))) { | |||
// Column paths for nested structures have common root, so we compare in reverse to find mismatch sooner | |||
if (arrayEqualsReversed(metadata.getPath().toArray(), columnDescriptor.getPath())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ColumnPath.toArray
should copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I think there should also be a boolean equals(String[] path)
method to allow equality check without needing array copy.
I think we need to switch to our version of that class eventually, it just needs moving some more parquet-mr code into Trino.
lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java
Outdated
Show resolved
Hide resolved
lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ColumnChunkProperties.java
Show resolved
Hide resolved
Avoids having unbounded and unaccounted memory usage. Also avoids the overheads associated with concurrent map look-up
I've simplified the last commit to use a different approach. |
index = columnIndex; | ||
} | ||
Map<List<String>, ColumnDescriptor> descriptorsByPath = new HashMap<>(columns.size()); | ||
for (PrimitiveColumnIO columnIO : columns) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't entirely understand it.
Above
List<PrimitiveColumnIO> columns = getColumns(fileSchema, requestedSchema);
is used to get columns
against fileSchema
. Yet here we match path
again against fileSchema
too.
I think this definitely deserves a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think even this look-up is unnecessary. I've further simplified the code to build the map directly from getColumns
Avoid unnecessary conversion to ColumnPath Compare arrays in reverse to find mismatch quicker
Existing logic was complex due to performing case-insensitive matching. This was unnecesary because fileSchema and requestedSchema already contain lower cased names. Also, since requestedSchema is derived from fileSchema, we can build descriptors map directly from result of getColumns instead of repeating look-ups in fileSchema.
Description
Additional context and related issues
Partially addresses #22434
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: