Optimize reading of metadata for large parquet schema #22451

raunaqmorarka · 2024-06-20T09:24:58Z

Description

Additional context and related issues

Partially addresses #22434

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Hudi, Delta, Iceberg
* Improve performance of reading from parquet files with large schemas. ({issue}`22451`)

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java

findepi

LGTM
% comments
% Optimize ParquetTypeUtils#getPathIndex commit, see comment there

lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ColumnChunkProperties.java

findepi · 2024-06-20T10:28:25Z

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

@@ -494,7 +494,8 @@ private ColumnChunkMetadata getColumnChunkMetaData(BlockMetadata blockMetaData,
 throws IOException
 {
 for (ColumnChunkMetadata metadata : blockMetaData.getColumns()) {
- if (metadata.getPath().equals(ColumnPath.get(columnDescriptor.getPath()))) {
+ // Column paths for nested structures have common root, so we compare in reverse to find mismatch sooner
+ if (arrayEqualsReversed(metadata.getPath().toArray(), columnDescriptor.getPath())) {


ColumnPath.toArray should copy

True, but I think there should also be a boolean equals(String[] path) method to allow equality check without needing array copy.
I think we need to switch to our version of that class eventually, it just needs moving some more parquet-mr code into Trino.

lib/trino-parquet/src/main/java/io/trino/parquet/reader/ParquetReader.java

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java

lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ColumnChunkProperties.java

some comments still

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java

Avoids having unbounded and unaccounted memory usage. Also avoids the overheads associated with concurrent map look-up

raunaqmorarka · 2024-06-24T05:42:11Z

I've simplified the last commit to use a different approach.
Existing logic was complex due to performing case-insensitive matching.
Since both fileSchema and requestedSchema contain lower case column names,
we can rely on MessageType#containsPath to match paths more efficiently using it's internal indexes.

sopel39 · 2024-06-25T06:32:45Z

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java

- index = columnIndex;
- }
+ Map<List<String>, ColumnDescriptor> descriptorsByPath = new HashMap<>(columns.size());
+ for (PrimitiveColumnIO columnIO : columns) {


I still don't entirely understand it.
Above

List<PrimitiveColumnIO> columns = getColumns(fileSchema, requestedSchema);

is used to get columns against fileSchema. Yet here we match path again against fileSchema too.

I think this definitely deserves a comment

Yes, I think even this look-up is unnecessary. I've further simplified the code to build the map directly from getColumns

Avoid unnecessary conversion to ColumnPath Compare arrays in reverse to find mismatch quicker

Existing logic was complex due to performing case-insensitive matching. This was unnecesary because fileSchema and requestedSchema already contain lower cased names. Also, since requestedSchema is derived from fileSchema, we can build descriptors map directly from result of getColumns instead of repeating look-ups in fileSchema.

cla-bot bot added the cla-signed label Jun 20, 2024

raunaqmorarka mentioned this pull request Jun 20, 2024

High CPU usage when reading parquet with wide schema #22434

Closed

raunaqmorarka requested review from findepi, sopel39 and findinpath June 20, 2024 09:27

raunaqmorarka added the performance label Jun 20, 2024

findinpath reviewed Jun 20, 2024

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java Outdated Show resolved Hide resolved

findinpath reviewed Jun 20, 2024

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java Outdated Show resolved Hide resolved

findinpath approved these changes Jun 20, 2024

View reviewed changes

findepi approved these changes Jun 20, 2024

View reviewed changes

raunaqmorarka force-pushed the pq-meta branch from 4c0db36 to 2313462 Compare June 21, 2024 04:47

sopel39 previously approved these changes Jun 21, 2024

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/metadata/ColumnChunkProperties.java Show resolved Hide resolved

sopel39 reviewed Jun 21, 2024

View reviewed changes

lib/trino-parquet/src/main/java/io/trino/parquet/ParquetTypeUtils.java Show resolved Hide resolved

sopel39 approved these changes Jun 21, 2024

View reviewed changes

raunaqmorarka force-pushed the pq-meta branch from 2313462 to 58745ac Compare June 24, 2024 05:24

github-actions bot added the hive Hive connector label Jun 24, 2024

Use field id in freeCurrentRowGroupBuffers for consistency

3664a8c

raunaqmorarka force-pushed the pq-meta branch from 58745ac to b6bdd61 Compare June 24, 2024 05:25

raunaqmorarka added 3 commits June 24, 2024 11:07

Remove interning from ColumnChunkProperties

5d84c1c

Avoids having unbounded and unaccounted memory usage. Also avoids the overheads associated with concurrent map look-up

Convert ColumnChunkProperties into a record

6ea4322

Removed unused code from ColumnChunkMetadata

65d9c68

raunaqmorarka force-pushed the pq-meta branch from b6bdd61 to 1e9b61c Compare June 24, 2024 05:38

raunaqmorarka requested review from findinpath, findepi and sopel39 June 24, 2024 05:42

findepi approved these changes Jun 24, 2024

View reviewed changes

sopel39 reviewed Jun 25, 2024

View reviewed changes

raunaqmorarka force-pushed the pq-meta branch from 1e9b61c to 45c5afb Compare June 25, 2024 06:51

raunaqmorarka requested a review from sopel39 June 25, 2024 06:52

raunaqmorarka added 2 commits June 25, 2024 12:23

Optimize ParquetReader#getColumnChunkMetaData

cebc970

Avoid unnecessary conversion to ColumnPath Compare arrays in reverse to find mismatch quicker

raunaqmorarka force-pushed the pq-meta branch from 45c5afb to 42ff885 Compare June 25, 2024 06:58

raunaqmorarka merged commit 3803b41 into trinodb:master Jun 25, 2024
62 checks passed

raunaqmorarka deleted the pq-meta branch June 25, 2024 10:34

github-actions bot added this to the 451 milestone Jun 25, 2024

colebow mentioned this pull request Jun 25, 2024

Add Trino 451 release notes #22504

Merged

okayhooni mentioned this pull request Jul 18, 2024

Enabling parquet.experimental.vectorized-decoding may induce unexpected worker container crashes/restarts on AWS Graviton2/3 instances #22727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize reading of metadata for large parquet schema #22451

Optimize reading of metadata for large parquet schema #22451

raunaqmorarka commented Jun 20, 2024 •

edited

Loading

findepi left a comment

findepi Jun 20, 2024

raunaqmorarka Jun 21, 2024

raunaqmorarka commented Jun 24, 2024

sopel39 Jun 25, 2024

raunaqmorarka Jun 25, 2024

Optimize reading of metadata for large parquet schema #22451

Optimize reading of metadata for large parquet schema #22451

Conversation

raunaqmorarka commented Jun 20, 2024 • edited Loading

Description

Additional context and related issues

Release notes

findepi left a comment

Choose a reason for hiding this comment

findepi Jun 20, 2024

Choose a reason for hiding this comment

raunaqmorarka Jun 21, 2024

Choose a reason for hiding this comment

raunaqmorarka commented Jun 24, 2024

sopel39 Jun 25, 2024

Choose a reason for hiding this comment

raunaqmorarka Jun 25, 2024

Choose a reason for hiding this comment

raunaqmorarka commented Jun 20, 2024 •

edited

Loading