Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write support for Bloom filters in Iceberg and Delta Lake connectors #21570

Open
1 of 2 tasks
jkylling opened this issue Apr 16, 2024 · 0 comments · May be fixed by #22525
Open
1 of 2 tasks

Write support for Bloom filters in Iceberg and Delta Lake connectors #21570

jkylling opened this issue Apr 16, 2024 · 0 comments · May be fixed by #22525

Comments

@jkylling
Copy link
Contributor

jkylling commented Apr 16, 2024

Tasks

There are two components to this task: Decide which format table property to use, and decide which Trino table properties to use.

Iceberg table properties

Parquet Bloom filter write support can be configured by setting the table properties

write.parquet.bloom-filter-max-bytes=<number of bytes>
write.parquet.bloom-filter-enabled.column.<column-name> = true

https://github.com/apache/iceberg/blob/732fbfd516a3dfb2028fd6795f8f564f70e44742/core/src/main/java/org/apache/iceberg/TableProperties.java#L166-L171

Whenever we write to a table where the table properties are set for some columns we write Bloom filters for that column.

We should silently ignore

write.parquet.bloom-filter-enabled.column.<column-name> = true

for unknown column names or for unsupported types.

Delta Lake table properties

This is not part of the standard yet. We could probably follow the same convention as for Iceberg. There is an issue to get this into the Delta Lake protocol at delta-io/delta#2751

Trino table properties

We mirror what was done for Hive in 5041496:

  • To control Bloom filter maximum size we use parquet_bloom_filter_max_bytes = BIGINT.
  • We use the same table properties as for Hive to configure writing of Bloom filters from Trino. That is, parquet_bloom_filter_columns = ARRAY['<column-name>'].
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant