Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow writing to Hudi universal format in Delta Lake #22404

Merged
merged 1 commit into from
Aug 26, 2024

Conversation

ebyhr
Copy link
Member

@ebyhr ebyhr commented Jun 17, 2024

Description

Hudi universal format isn't managed by reader/writer features unlike icebergCompatV1(2).
We should explicitly deny write operations with delta.universalFormat.enabledFormats configuration.

Fixes #22312

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.

@cla-bot cla-bot bot added the cla-signed label Jun 17, 2024
@github-actions github-actions bot added the delta-lake Delta Lake connector label Jun 17, 2024
public static final String APPEND_ONLY_CONFIGURATION_KEY = "delta.appendOnly";
public static final String COLUMN_MAPPING_MODE_CONFIGURATION_KEY = "delta.columnMapping.mode";
public static final String COLUMN_MAPPING_PHYSICAL_NAME_CONFIGURATION_KEY = "delta.columnMapping.physicalName";
public static final String MAX_COLUMN_ID_CONFIGURATION_KEY = "delta.columnMapping.maxColumnId";
public static final String ISOLATION_LEVEL_CONFIGURATION_KEY = "delta.isolationLevel";
private static final String DELETION_VECTORS_CONFIGURATION_KEY = "delta.enableDeletionVectors";
private static final String UNIVERSAL_FORMAT_CONFIGURATION_KEY = "delta.universalFormat.enabledFormats";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this documented at https://github.com/delta-io/delta/blob/master/PROTOCOL.md ?

if not, let's document where this comes from

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes from https://github.com/delta-io/delta/blob/master/docs/source/delta-uniform.md. I will add a code comment in the next push.

@findepi
Copy link
Member

findepi commented Jun 17, 2024

Hudi universal format isn't managed by reader/writer features unlike icebergCompatV1(2).
We should explicitly deny write operations with delta.universalFormat.enabledFormats configuration.

Sounds like a bug in universal format. Maybe we should fix that instead, instead of working around on our side?

@ebyhr
Copy link
Member Author

ebyhr commented Jun 17, 2024

I posted a question on Delta Lake Slack a few hours ago: https://delta-users.slack.com/archives/CJ70UCSHM/p1718614137068649. Let's wait for a reply from the community.

@ebyhr
Copy link
Member Author

ebyhr commented Jun 18, 2024

Received the following reply. They won't update delta protocol:

icebergCompatV1 only ensures Delta writes parquet files in iceberg compat way, but does not mandate the generation of iceberg metadata. Its possible that engines support icebergCompatV1 writes to Delta table, but did not update the iceberg metadata.

When we design the Uniform in Delta protocol in the early days, we have decided only make it "Hudi/Iceberg" compat to the protocol, because its weird to specify sth like "engine must know how to generate Hudi/Iceberg metadata" in the Delta protocol and put more specifications on those other formats. Thus as a result, we offered Hudi/Iceberg metadata generating as a feature that each engine should implement by themselves outside of Delta protocol specification.

I do acknowledge the risk of stale Hudi/Iceberg files under the current design, while in the reality there is usually 1 kind of engine writes to the Delta table and that engine should have knowledge of generating metadata for other formats.

@@ -195,6 +200,12 @@ public static boolean isDeletionVectorEnabled(MetadataEntry metadataEntry, Proto
return parseBoolean(metadataEntry.getConfiguration().get(DELETION_VECTORS_CONFIGURATION_KEY));
}

public static List<String> enabledUniversalFormats(MetadataEntry metadataEntry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEnabledUniversalFormats() ?

@findepi
Copy link
Member

findepi commented Jun 27, 2024

I do acknowledge the risk of stale Hudi/Iceberg files under the current design, while in the reality there is usually 1 kind of engine writes to the Delta table and that engine should have knowledge of generating metadata for other formats.

Clearly the assumption turned out to be invalid. Did they acknowledge that?

@findepi
Copy link
Member

findepi commented Jun 27, 2024

I am OK disallowing the writes without Delta lake protocol change (ie you can merge this PR), but it would be better not to workaround Delta protocol deficiencies on trino side.

Also, it's quite possible i just misunderstand. maybe they didn't put this as a writer feature intentionally, because the intent was to allow writes to Delta tables even if Hudi metadata generation is not supported by the writer. Thus, I think -- and prefer -- that we can abstain from this change altogether. As intended by Delta protocol design, we can continue to write to such tables, without updating Hudi compatibility layer.

@ebyhr you decide.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jul 19, 2024
@mosabua
Copy link
Member

mosabua commented Jul 19, 2024

Whats the path forward here @ebyhr and @findepi ?

@github-actions github-actions bot removed the stale label Jul 22, 2024
@github-actions github-actions bot added the stale label Aug 12, 2024
@ebyhr ebyhr added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Aug 13, 2024
@trinodb trinodb deleted a comment from github-actions bot Aug 13, 2024
@mosabua
Copy link
Member

mosabua commented Aug 26, 2024

Merge @ebyhr ?

@ebyhr
Copy link
Member Author

ebyhr commented Aug 26, 2024

Going to merge as-is to avoid breaking UniForm tables though the ideal fix should be happened in Delta side.

@ebyhr ebyhr merged commit 6db2db4 into trinodb:master Aug 26, 2024
24 checks passed
@ebyhr ebyhr deleted the ebi/delta-hudi-uniform branch August 26, 2024 23:08
@github-actions github-actions bot added this to the 455 milestone Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.
Development

Successfully merging this pull request may close these issues.

Add support for reading Hudi UniForm tables in Delta Lake connector
4 participants