Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize reading of metadata for large parquet schema #22451

Merged
merged 6 commits into from
Jun 25, 2024

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Jun 20, 2024

Description

Additional context and related issues

Partially addresses #22434

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Improve performance of reading from parquet files with large schemas. ({issue}`22451`)

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
% comments
% Optimize ParquetTypeUtils#getPathIndex commit, see comment there

@@ -494,7 +494,8 @@ private ColumnChunkMetadata getColumnChunkMetaData(BlockMetadata blockMetaData,
throws IOException
{
for (ColumnChunkMetadata metadata : blockMetaData.getColumns()) {
if (metadata.getPath().equals(ColumnPath.get(columnDescriptor.getPath()))) {
// Column paths for nested structures have common root, so we compare in reverse to find mismatch sooner
if (arrayEqualsReversed(metadata.getPath().toArray(), columnDescriptor.getPath())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnPath.toArray should copy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I think there should also be a boolean equals(String[] path) method to allow equality check without needing array copy.
I think we need to switch to our version of that class eventually, it just needs moving some more parquet-mr code into Trino.

@sopel39 sopel39 dismissed their stale review June 21, 2024 11:23

some comments still

Avoids having unbounded and unaccounted memory usage.
Also avoids the overheads associated with concurrent map look-up
@raunaqmorarka
Copy link
Member Author

I've simplified the last commit to use a different approach.
Existing logic was complex due to performing case-insensitive matching.
Since both fileSchema and requestedSchema contain lower case column names,
we can rely on MessageType#containsPath to match paths more efficiently using it's internal indexes.

index = columnIndex;
}
Map<List<String>, ColumnDescriptor> descriptorsByPath = new HashMap<>(columns.size());
for (PrimitiveColumnIO columnIO : columns) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't entirely understand it.
Above

List<PrimitiveColumnIO> columns = getColumns(fileSchema, requestedSchema);

is used to get columns against fileSchema. Yet here we match path again against fileSchema too.

I think this definitely deserves a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think even this look-up is unnecessary. I've further simplified the code to build the map directly from getColumns

Avoid unnecessary conversion to ColumnPath
Compare arrays in reverse to find mismatch quicker
Existing logic was complex due to performing case-insensitive matching.
This was unnecesary because fileSchema and requestedSchema already contain lower cased names.
Also, since requestedSchema is derived from fileSchema, we can build descriptors map directly
from result of getColumns instead of repeating look-ups in fileSchema.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants