-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manifest file caching support for Iceberg Hive Metastore catalog #22376
Conversation
df3db37
to
d4e6ec5
Compare
b040afc
to
cfc4df9
Compare
.../java/io/trino/plugin/iceberg/catalog/snowflake/SnowflakeIcebergTableOperationsProvider.java
Outdated
Show resolved
Hide resolved
|
||
public static Map<String, String> loadManifestCachingProperties(Map<String, String> properties, IcebergConfig icebergConfig) | ||
{ | ||
properties.put(IO_MANIFEST_CACHE_ENABLED, "true"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is configuring org.apache.iceberg.ManifestFiles#CONTENT_CACHES
, right?
the cache is keyed by FileIO
and our ForwardingFileIo
is typically short-lived. how effective is this caching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in one of our production environments, a big sql which was planned for 12s has been reduced to 8s~
In our internal implementation, we also used a Caffeine cache with the key CatalogType
and the value FileIo
to make a ForwardingFileIo
live longer, i can also add relevant codes to our PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 12s -> 8s improvement -- was it with or without fileio caching?
what would the numbers be if we don't do this PR and instead enable filesystem caching that we already have in trino?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 12s -> 8s improvement -- was it with or without fileio caching?
what would the numbers be if we don't do this PR and instead enable filesystem caching that we already have in trino?
@findepi the 12s -> 8s improvement was with fileio/manifest caching.
Sorry our newest production environment is 423, so i couldn't test the performance about filesystem caching, but i saw that PrestoDB have manifest caching
and filesystem caching
both, so i thought that they may not conflict with each other~ The metadata caching in memory is also faster than metadata caching in disks in most cases.
Thank you for your advice, I have modified some codes accordingly, please review again when you have time~ The CICD error seems not related to our PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi the 12s -> 8s improvement was with fileio/manifest caching.
thanks!
before merging this PR, we should understand what's the level of improvement that it brings (in isolation).
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java
Outdated
Show resolved
Hide resolved
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java
Outdated
Show resolved
Hide resolved
@hackeryang pls try adding a test into I had taken a stint on integrating this functionality into Iceberg as well some time ago in #20862 and the limitation on which I've stumbled was
I'm hopeful though that I missed something in my original PR and that your feature is effective for making use of the manifest cache. |
cfc4df9
to
3be5b93
Compare
We used a Caffeine cache in Sorry i haven't figure out good logic and APIs to write the ut yet, i will try to add one if i have a better idea, i saw that PrestoDB didn't add it |
Will there be ways to flush this cache (specific to a table) in the same way the metadata cache can be via SQL This allows you to have long cache times and still have data available as soon as it ingests by flushing the cache for tables after cache. its something that's currently missing from the file status cache |
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
fyi @cwsteinbach .. maybe you can help here. |
I have another PR adding iceberg metadata caching through a different approach #22739 |
That sounds great @raunaqmorarka - can we conclude that we should therefore close this PR and help with reviews and input on yours? |
yup, i think so. |
Closing as discussed in the previous comments since the approach in this PR is inferior to the linked PR from @raunaqmorarka - we will continue to collaborate their and would love to get review and testing feedback. |
Description
Iceberg library has a manifest caching feature contributed by Cloudera:
https://blog.cloudera.com/12-times-faster-query-planning-with-iceberg-manifest-caching-in-impala/
PrestoDB also has similar configurations to turn on this feature: prestodb/presto#21399
Iceberg manifest caching can accelerate query planning.
Additional context and related issues
https://prestodb.io/docs/current/connector/iceberg.html#manifest-file-caching
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: