Support partitioning on nested ROW fields in Iceberg #15712

krvikash · 2023-01-13T13:54:41Z

Description

Fixes #15109
Support partitioning on nested ROW fields in Iceberg

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# Iceberg
* Add support for partitioning on nested ROW fields. ({issue}`15712`)

ebyhr · 2023-01-17T09:35:39Z

@krvikash Could you rebase on upstream to resolve conflicts?

krvikash · 2023-01-17T10:12:03Z

Rebased the PR with latest code.

krvikash · 2023-01-18T18:12:31Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

 assertThat(onTrino().executeQuery(format(selectByString, trinoTableName)))
 .containsOnly(row1);
+ // TODO


TODO task. Parquet format returning null for partition nested field.

If Spark Iceberg can't handle this kind of use-case please see whether there is already reported an issue to address this missing functionality.
We shouldn't introduce TODOs in the code relating to Spark limitations. Trino does not depend on Spark.

Simply add an assertion that the query is failing with an expected message.

I could not find any existing issue for PARQUET issue. Where Saprk returns null for the partitioned nested field.

krvikash · 2023-01-23T13:48:19Z

Hi, @alexjo2144 | @ebyhr | @findepi | @findepi, when you get time could you please review this?

krvikash · 2023-01-26T08:18:10Z

Rebased the PR with latest code.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ColumnIdentity.java

findinpath · 2023-01-26T15:33:54Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

 assertThat(onTrino().executeQuery(format(selectByString, trinoTableName)))
 .containsOnly(row1);
+ // TODO


If Spark Iceberg can't handle this kind of use-case please see whether there is already reported an issue to address this missing functionality.
We shouldn't introduce TODOs in the code relating to Spark limitations. Trino does not depend on Spark.

Simply add an assertion that the query is failing with an expected message.

findinpath · 2023-01-26T15:34:31Z

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

+
+ Row row1 = row("a", new byte[] {15, -15, 2, -16, -2, -1}, 1001, "field1");
+ String select = "SELECT _string, _varbinary, _bigint, _struct._field FROM %s WHERE _string = 'a'";
+ // ORC: Job aborted due to stage failure: Task 0 in stage 40.0 failed 1 times, most recent failure: Lost task 0.0 in stage 40.0 (TID 70) (spark executor driver): java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1


Same as for Parquet.

I could not find any existing issue for PARQUET issue. Where Saprk returns null for the partitioned nested field.

...roduct-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java

alexjo2144

Can you also add some predicate pushdown tests? Similar to one of these tests that use isFullyPushedDown https://github.com/trinodb/trino/blob/master/plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorTest.java#L1611

alexjo2144 · 2023-01-17T16:59:21Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

+ int parentSourceId = getParentSourceId(indexParents, field.sourceId());
+ Type sourceType = tableSchema.findType(parentSourceId);
+ if (sourceType.isMapType()) {
+ throw new TrinoException(NOT_SUPPORTED, "Partitioning field [" + field.name() + "] cannot be contained in a map");
+ }
+ if (sourceType.isListType()) {
+ throw new TrinoException(NOT_SUPPORTED, "Partitioning field [" + field.name() + "] cannot be contained in a array");
+ }
+ return requireNonNull(columnById.get(parentSourceId), () -> "Cannot find source column for partition field " + field);


This block is a bit confusing to me since it relies on getParentSourceId returning the field id that was passed to it if it's already a base column. Is an integer column the parent of itself? That's fuzzy. I might rephrase it as

boolean isBaseColumn = !parentIndex.contains(fieldId); if (isBaseColumn) { ... } else { ... }

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergBucketFunction.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergColumnHandle.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/PartitionTable.java

findinpath · 2023-01-30T12:28:46Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

+ .map(column -> {
+ org.apache.iceberg.types.Type type = toIcebergType(column.getType());
+ if (!type.isPrimitiveType()) {
+ type = TypeUtil.assignFreshIds(type, nextNestedFieldId::getAndIncrement);


This doesn't seem right to me.
Doing tricks like this can get us into trouble (there are some syntethic columns - see org.apache.iceberg.MetadataColumns#FILE_PATH ) which have the id set to Integer.MAX_VALUE - 1. Incrementing this value to get artificially a new field id for the row can lead to problems.

I was looking for where schemaFromHandles is being used and found the following:

deletes handling in IcebergPageSourceProvider

bucketing function in IcebergNodePartitioningProvider

In both places we do have schema available.

Please investigate whether using schema in schemaFromHandles would be possible.
If yes, we could work with TypeUtil.indexParents(schema) to get only the extra "row" contents we need from the schema.

This is related to some changes in #14837

Maybe we can pull some of those changes in here

Thanks, @findinpath | @alexjo2144 for pointing this out. I have added some changes from #14837 and now I do not need to reassign indexes.

IMO, It will be better if #14837 gets merged first because the current PR contains changes, that are unrelated to the supporting partitioning field.

github-actions · 2023-02-27T18:35:24Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

findepi · 2023-02-28T10:03:02Z

@krvikash @alexjo2144 @findinpath I see no approvals and some yet unresolved conversations. Where are we with the PR?

ebyhr · 2023-02-28T10:04:21Z

I believe this PR is waiting for #14837 (#15712 (comment))

krvikash · 2023-03-01T05:19:27Z

Rebased on top of #14837 's commit and resolved conflicts. This PR (2nd commit of this PR) is ready for review now.

krvikash · 2024-03-07T09:37:32Z

(some refactoring in TestIcebergV2)

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergBucketFunction.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

findinpath · 2024-03-19T12:39:50Z

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

+ List<Integer> path = new ArrayList<>(requireNonNull(indexPaths.get(fieldId)));
+ if (!path.isEmpty()) {
+ // Path does not include the base field id
+ baseField = indexById.get(path.removeFirst());


It is hard to follow what is happening when we have mutations.

Why do we remove the first element from the list?

The first element is used for getting the root field id, but the element shouldn't exist for IcebergColumnHandle.path. I updated the comment and slightly modified the logic.

ebyhr · 2024-04-03T05:53:16Z

Rebased on master to resolve conflicts.

findinpath · 2024-04-11T09:10:01Z

Pls rebase the code to resolve conflicts.

github-actions · 2024-05-10T17:40:50Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergBucketFunction.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

Co-authored-by: Victoria Bukta <[email protected]> Co-authored-by: Yuya Ebihara <[email protected]>

cla-bot bot added the cla-signed label Jan 13, 2023

github-actions bot added the tests:hive label Jan 13, 2023

krvikash requested review from ebyhr, findepi, findinpath and alexjo2144 January 17, 2023 09:06

krvikash force-pushed the support-nested-partitioning-fields branch 2 times, most recently from 0d38c8c to 53011d6 Compare January 17, 2023 10:11

krvikash commented Jan 18, 2023

View reviewed changes

krvikash force-pushed the support-nested-partitioning-fields branch from 53011d6 to 46d91a8 Compare January 23, 2023 13:48

krvikash force-pushed the support-nested-partitioning-fields branch from 46d91a8 to dbf698f Compare January 26, 2023 08:17

findinpath reviewed Jan 26, 2023

View reviewed changes

alexjo2144 reviewed Jan 26, 2023

View reviewed changes

findinpath reviewed Jan 30, 2023

View reviewed changes

krvikash force-pushed the support-nested-partitioning-fields branch 4 times, most recently from 9ab6531 to 9d97902 Compare February 4, 2023 10:50

krvikash mentioned this pull request Feb 8, 2023

Fix iceberg v2 table with equality delete can't deal with nested column #14837

Merged

github-actions bot added the stale label Feb 27, 2023

github-actions bot removed the stale label Feb 28, 2023

krvikash force-pushed the support-nested-partitioning-fields branch from 9d97902 to 958a762 Compare March 1, 2023 05:16

ebyhr force-pushed the support-nested-partitioning-fields branch 2 times, most recently from e2afa24 to 39c59dd Compare March 19, 2024 10:22

pajaks approved these changes Mar 19, 2024

View reviewed changes

findinpath reviewed Mar 19, 2024

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergBucketFunction.java Outdated Show resolved Hide resolved

findinpath reviewed Mar 19, 2024

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java Outdated Show resolved Hide resolved

findinpath reviewed Mar 19, 2024

View reviewed changes

ebyhr force-pushed the support-nested-partitioning-fields branch from 39c59dd to 76e3886 Compare March 21, 2024 08:18

findinpath approved these changes Mar 21, 2024

View reviewed changes

ebyhr requested review from raunaqmorarka and Praveen2112 March 21, 2024 23:21

ebyhr force-pushed the support-nested-partitioning-fields branch from 76e3886 to bfe8da9 Compare April 3, 2024 05:53

ebyhr force-pushed the support-nested-partitioning-fields branch from bfe8da9 to 1f15a36 Compare April 11, 2024 09:22

raunaqmorarka mentioned this pull request Apr 11, 2024

Add bloom filter write support to ParquetWriter #20662

Merged

findepi removed the tests:hive label Apr 18, 2024

github-actions bot added the stale label May 10, 2024

ebyhr added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label May 10, 2024

raunaqmorarka approved these changes May 13, 2024

View reviewed changes

ebyhr mentioned this pull request Jun 10, 2024

Fix failure when reading nested partition in Iceberg #22258

Closed

ebyhr force-pushed the support-nested-partitioning-fields branch from 1f15a36 to bf37211 Compare June 11, 2024 10:39

krvikash and others added 2 commits June 24, 2024 09:20

Rename getColumns to getTopLevelColumns in iceberg connector

db7e0ef

Support partitioning on nested ROW fields in Iceberg

336ec37

Co-authored-by: Victoria Bukta <[email protected]> Co-authored-by: Yuya Ebihara <[email protected]>

ebyhr force-pushed the support-nested-partitioning-fields branch from bf37211 to 336ec37 Compare June 24, 2024 00:26

ebyhr merged commit a48a0a6 into trinodb:master Jun 24, 2024
52 checks passed

github-actions bot added this to the 451 milestone Jun 24, 2024

krvikash deleted the support-nested-partitioning-fields branch June 24, 2024 06:23

colebow mentioned this pull request Jun 25, 2024

Add Trino 451 release notes #22504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support partitioning on nested ROW fields in Iceberg #15712

Support partitioning on nested ROW fields in Iceberg #15712

krvikash commented Jan 13, 2023 •

edited

Loading

ebyhr commented Jan 17, 2023

krvikash commented Jan 17, 2023

krvikash Jan 18, 2023

findinpath Jan 26, 2023

krvikash Feb 3, 2023

krvikash commented Jan 23, 2023

krvikash commented Jan 26, 2023

findinpath Jan 26, 2023

findinpath Jan 26, 2023

krvikash Feb 3, 2023

alexjo2144 left a comment

alexjo2144 Jan 17, 2023

findinpath Jan 30, 2023

findinpath Jan 30, 2023

alexjo2144 Jan 30, 2023

krvikash Feb 3, 2023

github-actions bot commented Feb 27, 2023

findepi commented Feb 28, 2023

ebyhr commented Feb 28, 2023 •

edited

Loading

krvikash commented Mar 1, 2023 •

edited

Loading

krvikash commented Mar 7, 2024

findinpath Mar 19, 2024

ebyhr Mar 21, 2024

ebyhr commented Apr 3, 2024

findinpath commented Apr 11, 2024

github-actions bot commented May 10, 2024

Support partitioning on nested ROW fields in Iceberg #15712

Support partitioning on nested ROW fields in Iceberg #15712

Conversation

krvikash commented Jan 13, 2023 • edited Loading

Description

Release notes

ebyhr commented Jan 17, 2023

krvikash commented Jan 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krvikash commented Jan 23, 2023

krvikash commented Jan 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexjo2144 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 27, 2023

findepi commented Feb 28, 2023

ebyhr commented Feb 28, 2023 • edited Loading

krvikash commented Mar 1, 2023 • edited Loading

krvikash commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Apr 3, 2024

findinpath commented Apr 11, 2024

github-actions bot commented May 10, 2024

krvikash commented Jan 13, 2023 •

edited

Loading

ebyhr commented Feb 28, 2023 •

edited

Loading

krvikash commented Mar 1, 2023 •

edited

Loading