Implement incremental refresh for single-table, predicate-only MVs #20959

marton-bod · 2024-03-06T15:13:52Z

Description

In the Iceberg connector, currently we do a full refresh of MVs in all cases, meaning we drop all the existing data from the MV storage table and then insert all the data again.

When possible, we could be smarter during refresh. While incrementally refreshing a MV can be tricky for many operation types such as join, aggregations or window functions, there are some easier wins such as single-table, predicate-only queries. More complicated cases can be tackled in subsequent implementation phases.

In this first phase of implementation, this PR introduces incremental refresh with the following limitations:

no joins
no group by
no windowing
no select-distinct (e.g. select distinct department from ...)
no select function calls (e.g. select sum(price), avg(price) from ...)
table contains appends only between fromSnapshot (time of last refresh) and currentSnapshot

(Basically nothing that would require touching/rewriting the already persisted data in the MV.)

For example, a type of MV supported for incremental refresh in this phase:

CREATE MATERIALIZED VIEW catalog.schema.mv AS
SELECT name, CASE location 
  WHEN 'UK' THEN 'United Kingdom' 
  WHEN 'US' THEN 'United States'
  ...
FROM catalog.schema.base_table 
WHERE id IN ('8e374aa5a5d8d3720458fe9c5363c35200', 'd08963f57c94e41fb7c95cd801dd85b200', ...)
AND last_active_date > DATE '2024-02-01'

Approach:

During refresh time, LogicalPlanner checks the plan tree and suggests either incremental or full refresh to the connector
In the connector, IcebergMetadata#beginRefreshMaterializedView() takes this information from the engine, checks a few additional things (e.g. presence of fromSnapshot for all base tables) and makes the final decision
When IcebergSplitManager creates the table scan, it would either create a TableScan (full refresh), or an IncrementalAppendScan (incremental refresh), using the fromSnapshot value. It can still revert back to full refresh in case the fromSnapshot is already expired or the refresh window contains any deletes/overwrites.
If it's an incremental refresh, we skip the storage table truncation during IcebergMetadata#finishRefreshMaterializedView() and perform an append-only commit.

Additional context and related issues

Solves issue partially: #18673

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
() Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg

* Support for incremental refresh for basic materialized views within Iceberg connector. ({issue}`209591`)

findepi · 2024-03-07T14:06:34Z

@sopel39 @alexjo2144 please take a look

marton-bod · 2024-03-26T12:10:15Z

@findepi @sopel39 @alexjo2144 Hi team, could you please take a quick look at this PR? Thank you very much!

github-actions · 2024-05-03T17:06:57Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergMaterializedViewTest.java

marton-bod · 2024-05-15T13:11:02Z

@martint as discussed in our zoom call, I've refactored this PR so that the analyzer passes the information to the connector to make the incremental/full refresh decision, as opposed to using table properties. Please take a look and let me know what you think. Thank you!

core/trino-main/src/main/java/io/trino/sql/analyzer/Analysis.java

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergUtil.java

core/trino-main/src/main/java/io/trino/sql/analyzer/StatementAnalyzer.java

core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java

sopel39

lgtm % comments

core/trino-main/src/main/java/io/trino/sql/planner/IncrementalRefreshVisitor.java

core/trino-main/src/main/java/io/trino/sql/planner/LogicalPlanner.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSessionProperties.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergMaterializedView.java

sopel39 · 2024-05-29T21:06:56Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergMaterializedView.java

@@ -86,6 +86,189 @@ protected String getStorageMetadataLocation(String materializedViewName)
 return table.getParameters().get(METADATA_LOCATION_PROP);
 }



please also add a test with update/delete. Part of scenario would be that incremental refresh should work again after compaction.

Added a new test case which verifies that we fall back to FULL refresh for cases when delete files are present. As for compaction reenabling incremental refresh, please see my comment below

As for compaction reenabling incremental refresh, please see my comment below

Would a test like that work:

initial (full) refresh with some data

add one row to source table

incremental refresh

compaction

full refresh

add one row to source table

we are back to incremental refresh

This is good, I would just modify step 4:

initial (full) refresh with some data

add one row to source table

incremental refresh

update or delete one row

full refresh

add one row to source table

we are back to incremental refresh

In the original test case you proposed, step 5 shouldn't do any refresh (essentially a noop), because step 4 (compaction) didn't add any new rows, it just binpacked the existing rows into fewer files. Nonetheless, that's also something worthwhile to test, so added an extra test to verify this assumption.

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java

cla-bot · 2024-06-05T09:19:33Z

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod.
This is most likely caused by a git client misconfiguration; please make sure to:

check if your git client is configured with an email to sign commits git config --list | grep email
If not, set it up using git config --global user.email [email protected]
Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

sopel39

lgtm % comments

core/trino-main/src/main/java/io/trino/sql/planner/IncrementalRefreshVisitor.java

core/trino-main/src/test/java/io/trino/sql/planner/TestIncrementalRefreshVisitor.java

core/trino-main/src/test/java/io/trino/sql/planner/TestMaterializedViews.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergMaterializedView.java

sopel39 · 2024-06-10T11:31:52Z

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergMaterializedView.java

@@ -86,6 +86,189 @@ protected String getStorageMetadataLocation(String materializedViewName)
 return table.getParameters().get(METADATA_LOCATION_PROP);
 }



As for compaction reenabling incremental refresh, please see my comment below

Would a test like that work:

initial (full) refresh with some data

add one row to source table

incremental refresh

compaction

full refresh

add one row to source table

we are back to incremental refresh

core/trino-main/src/test/java/io/trino/sql/planner/TestMaterializedViews.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergMaterializedView.java

sopel39 · 2024-06-12T10:26:56Z

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

+ * {@code refreshType} is a signal from the engine to the connector whether the MV refresh could be done incrementally or only fully, based on the plan.
+ * The connector is not obligated to perform the refresh in the fashion prescribed by {@code refreshType}, this is merely a hint from the engine.
+ */
+ default ConnectorInsertTableHandle beginRefreshMaterializedView(ConnectorSession session, ConnectorTableHandle tableHandle, List<ConnectorTableHandle> sourceTableHandles, RetryMode retryMode, RefreshType refreshType)


@martint do you want to ack on SPI change?

core/trino-main/src/main/java/io/trino/metadata/Metadata.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergSplitManager.java

When MV refresh is executing, the planner suggests either incremental or full refresh to the connector. In this first phase, incremental refresh is suggested only when Scan/Filter/Project nodes are present in the plan tree - otherwise full refresh. The Iceberg connector can act on this signal and use IncrementalAppendScan to scan the 'delta' records only and append them to the MV storage table (without truncation).

core/trino-main/src/main/java/io/trino/metadata/MetadataManager.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

sopel39

looks good! Couple of remaining comments

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

core/trino-spi/src/main/java/io/trino/spi/connector/ConnectorMetadata.java

core/trino-main/src/main/java/io/trino/tracing/TracingConnectorMetadata.java

core/trino-main/src/test/java/io/trino/sql/planner/TestMaterializedViews.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergMaterializedViewTest.java

sopel39 · 2024-06-19T20:24:57Z

core/trino-spi/src/main/java/io/trino/spi/RefreshType.java

+package io.trino.spi;
+
+/**
+ * Signifies whether a Materialized View refresh operation could be done incrementally or only fully.


For future reference: semantics of this will most likely change when we support full scope of incremental refreshes

mosabua · 2024-06-26T16:46:40Z

@marton-bod are you planning to send a small separate PR to document this?

marton-bod · 2024-06-27T10:45:31Z

@mosabua yes, I'll post a PR with the proposed doc changes. Thanks for bringing it up

@suspended

* Add additional test coverage for integer number to varchar coercer Co-authored-by: Yiqun Zhang <[email protected]> * Allow integer number to varchar coercion in ORC unpartitioned table Co-authored-by: Yiqun Zhang <[email protected]> * Allow integer number to varchar coercion in Parquet unpartitioned table * Quote current test param when test fails * Use enhanced switch statement * Remove redundant explicit generics * Extract a dedicated Double to Varchar Coercer for ORC * Allow double to varchar coercion in Parquet unpartitioned table * Merge Hudi query runners into one * Add support for case insensitive name cache in BigQuery Co-authored-by: regadas <[email protected]> * Fix failure when reading signed timestamp with time zone stats in Delta Lake * Avoid org.locationtech.jts:jts-core exclusion in Pinot * Improve 449 release notes * Add docs for Hive metadata caching config * Remove exceptions related to older version of Hive * Simplify BaseTestHiveCoercion Older version of Hive doesn't support Float/Real type for parquet table format, but Hive 3.0+ doesn't have such restriction. * Allow float to varchar coercion in hive table This coercion would work for partitioned tables for all formats and for unpartitioned tables it would work for ORC and Parquet table format. * Improve error message for resource groups * Gracefully handle missing spooling output stats in FTE scheduler It is rare but possible to get empty spooling output stats for task which completed successfully. This may happen if we observe FINISHED task state based on received TaskStatus but are later on unable to successfully retrieve TaskInfo. In such case we are building final TaskInfo based on last known taskInfo, just updating the taskState field. The spooling output stats will not be present. As we need this information in FTE mode we need to fail such task artificially * Add JDBC property to use current catalog in metadata if none provided Some BI tools don't pass a `catalog` when calling the `DatabaseMetaData` `getTables`, `getColumns` and `getSchemas` methods. This makes the JDBC driver search across all catalogs which can be expensive. This commit introduces a new boolean connection property `assumeNullCatalogMeansCurrentCatalog` (disabled by default) to be used with such BI tools. If enabled the driver will try to use current `catalog` of the JDBC connection when fetching Trino metadata like `getTables`, `getColumns`, `getSchemas` if the `catalog` argument to those methods is passed as `null`. Co-authored-by: Rafał Połomka <[email protected]> Co-authored-by: Ashhar Hasan <[email protected]> * Make sealed version of Avro Logical Types supported natively * Refactor block building decoder creation to connector specific class * Remove unnecessary exclusions from Pinot * Remove unnecessary, commented code * Rewrite JUnit Assertions to AssertJ Changed automatically by openrewrite recipe ``` <plugin> <groupId>org.openrewrite.maven</groupId> <artifactId>rewrite-maven-plugin</artifactId> <version>5.32.0</version> <configuration> <activeRecipes> <recipe>org.openrewrite.java.testing.assertj.JUnitToAssertj</recipe> </activeRecipes> </configuration> <dependencies> <dependency> <groupId>org.openrewrite.recipe</groupId> <artifactId>rewrite-testing-frameworks</artifactId> <version>2.10.1</version> </dependency> </dependencies> </plugin> ``` run ``` ./mvnw rewrite:run -Dmaven.javadoc.skip=true -DskipTests=true -Dmaven.site.skip=true -Dmaven.artifact.threads=16 -T 1C -e -Dair.check.skip-all=true --no-snapshot-updates -pl '!:trino-product-tests,!:trino-product-tests-launcher,!:trino-server-rpm' ``` Some formatting was fixed manually for readability. Then, rewrite remaining JUnit Assertions to AssertJ manually, because some entries were not migrated by `./mvnw rewrite:run`. * Rewrite TestNG Assertions to AssertJ Changed automatically by openrewrite recipe openrewrite/rewrite-testing-frameworks#520 ``` <plugin> <groupId>org.openrewrite.maven</groupId> <artifactId>rewrite-maven-plugin</artifactId> <version>5.32.0</version> <configuration> <activeRecipes> <recipe>org.openrewrite.java.testing.assertj.TestNgToAssertj</recipe> </activeRecipes> </configuration> <dependencies> <dependency> <groupId>org.openrewrite.recipe</groupId> <artifactId>rewrite-testing-frameworks</artifactId> <version>2.11.0-SNAPSHOT</version> </dependency> </dependencies> </plugin> ``` run ``` ./mvnw rewrite:run -Dmaven.javadoc.skip=true -DskipTests=true -Dmaven.site.skip=true -Dmaven.artifact.threads=16 -T 1C -e -Dair.check.skip-all=true --no-snapshot-updates -pl '!:trino-server-rpm' ``` Some formatting was fixed manually for readability. Then, rewrite remaining TestNG Assertions to AssertJ manually, because some entries were not migrated by `./mvnw rewrite:run`. * Prefer AssertJ assertions over TestNG, JUnit * Update airbase to 157 * Update exec-maven-plugin to 3.3.0 * Pass additional type information in OrcColumn OrcTypeKind captures only data types details while other information like precision and scale are not a part of these information. * Fix unsupported reads of decimals as varchar in parquet * Allow decimal to varchar coercion for unpartitioned tables * Fix grammar error in error message * Simplify boolean comparison in getColumnMappingMode() * Allow char to varchar coercion for hive tables * Improve docs for HTTP server config * Verify spark compatibility in native-fs special chars test * Use Identifier.getValue when analyzing table function arguments Otherwise, getPartitionBy and getOrderBy methods in TableArgument SPI returns quoted column names. * Avoid redundant null check before sizeOf() * Remove unused method from BigQueryClient * Convert DeleteFile to record in Iceberg * Convert IcebergMaterializedViewDefinition to record * Convert IcebergPartitioningHandle to record * Convert IcebergWritableTableHandle to record * Convert TrinoSortField to record in Iceberg * Convert MemoryInsertTableHandle to record * Convert MemoryDataFragment to record * Test Memory connector config class * Test reading liquid clustering tables in Delta Lake * Add support for reading UniForm with Iceberg in Delta Lake * Skip unsupported variant type in Delta Lake * Reduce number of row groups in tests using small row groups Avoids creating too many splits when split offsets are populated * Fix detecting start of row group in parquet reader * Populate split offsets when writing orc/parquet files in iceberg Offsets can be used by readers to align splits with row-group/stripe boundaries * Use timestampColumnMapping in ClickHouse * Avoid using deprecated ClickHouseContainer class * Set air.compiler.fail-warnings as true in ClickHouse * Refactor TestMultipleDistinctAggregationToMarkDistinct Remove deprecated symbol usage. * Extract DistinctAggregationStrategyChooser Extract the logic to determine whether the direct distinct aggregation applicability, which can be reused in multiple optimiser rules. * Use maximum NDV symbol to choose distinct aggregation strategy * Move OptimizeMixedDistinctAggregations Move class before refactoring to preserve history * Support for multiple aggregations in OptimizeMixedDistinctAggregations Adds support for multiple distinct aggregations in OptimizeMixedDistinctAggregations`. * Replace optimizer.optimize-mixed-distinct-aggregations Replace optimizer.optimize-mixed-distinct-aggregations with a new optimizer.distinct-aggregations-strategy `pre_aggregate`. Also rename corresponding config property optimizer.mark-distinct-strategy to optimizer.distinct-aggregations-strategy and values to NONE -> SINGLE_STEP and ALWAYS -> MARK_DISTINCT * Enable OptimizeMixedDistinctAggregations automatically Use estimated aggregation source NDV and the number of grouping keys to decide if pre-aggregate strategy should be used for a given aggregation * Remove unnecessary override method isRemotelyAccessible in IcebergSplit * Add non-null check in MemoryConnector * Fix example in test guideline * Add support for TRUNCATE statement in memory connector * Remove unsupported version check and usage in Cassandra * Add support for truncate statement in Iceberg * Enable TopN for non-textual types for clickhouse-connector * Prevent OOM when expanding projection over BigQuery no columns scan * Improve ternary operator formatting in BigQuerySplitManager * Produce single Big Query split when row count is all that matters Similar to what we do in other connectors. There is no reason to create multiple splits just to return row count information. * Use safe idiom of getting the only element * Automatically configure BigQuery scan parallelism * Cleanup in OrcTypeTranslator Use enhanced switch statements and re-arrange code based on the source data type * Fix coercion gaps for Hive tables in ORC format There were a few difference on the coercion supported between partitioned and un-partitioned table for ORC format. * Make scheduling of remote accessible splits with addresses more strict UniformNodeSelector or FTE scheduler will schedule remote accessible splits on selected nodes if such nodes are available and only fallback to other nodes if preferred nodes are no longer part of cluster. Connector might have stale node information when creating splits which could result in choosing offline nodes. Additionally, in FTE mode nodes can go down so split addresses could no longer be valid then task is restarted. Additionally, this commit simplifies UniformNodeSelector optimizedLocalScheduling which was hard to reason about and was not taking advantages of recent improvements like adaptive split queue length. Co-authored-by: Karol Sobczak <[email protected]> * Enable query and task retries in MariaDbClient Copy test overrides from 7b80852, as MariaDB also needs them. * Remove unused function parameter * Keep separate top-level classes in separate compilation units It's considered anti-pattern to have multiple non-nested classes in one .java file. We even have a checker for that, but it was suppressed in this case. * Remove redundant cast * Use anonymous variables * Cleanup warnings on top-level `@SuppressWarnings` should be applied selectively, not for the whole compilation unit. * Make helper methods static * Remove redundant throws declaration * Allow pinning to WebIdentityTokenCredentialsProvider Allow users to only use the WebIdentityTokenCredentialsProvider instead of the default credentials provider chain. * Extract STS client getter * Allow pinning to WebIdentityTokenCredentialsProvider Allow users to only use the WebIdentityTokenCredentialsProvider instead of the default credentials provider chain. * Extract constant to represent partition for non-partitioned Hudi table Co-Authored-By: Yuya Ebihara <[email protected]> * Small code fixes * remove dead code * fix comment typo * Improve performance of json functions Avoid allocating heap ByteBuffer used by InputStreamReader. * Fix broken testTruncateTable in Iceberg Snowflake catalog * Explicitly configure executorService for s3 multipartuploads Previously used forkjoin common pool meant for cpu bound operations * Increase s3.max-connections from AWS default of 50 to 500 Aligns native S3FileSystem with legacy TrinoS3FileSystem * Verify body in doScheduleAsyncCleanupRequest We observed rare cases when we got non-JSON result in doScheduleAsyncCleanupRequest and current logging does not provide enough information to understand the root cause. * Fix uncheckedCacheGet javadoc The `uncheckedCacheGet` will propagate `ExecutionError` if `Cache.get` throws it. This reverts commit 3d0632d. * Add Iceberg query runner with TASK retries enabled * Add `exchange.azure.endpoint` configuration option This option can be used instead of `exchange.azure.connection-string` and will use the default azure credentials when set. * Fix failure when reading non-primitive Parquet fields in Iceberg * Skip getting stats on varbinary type in Delta * Rename tests in TestDeltaLakeParquetStatisticsUtils * Add support for stats on timestamp type in Delta Lake * Fix format of `@DefunctConfig` values in Cassandra * Add support for typeWidening(-preview) reader feature in Delta Lake * Add test for HudiPlugin * Run TestPhoenixConnectorTest in parallel * Nessie: Support APIv2 client By default, Nessie API version will be inferred from Nessie URI. If the User has a custom URI which doesn't have version info in the URI, user can configure `iceberg.nessie-catalog.client-api-version` * Update netty to 4.1.111.Final * Update AWS SDK v2 to 2.26.0 * Update nessie to 0.90.2 * Update AWS SDK v1 to 1.12.741 * Update reactor-core to 3.6.7 * Update checker-qual to 3.44.0 * Update flyway to 10.15.0 * Update MongoDB to 5.1.1 * Update lucene-analysis-common to 9.11.0 * Update hudi to 0.15.0 * Update freemarker to 2.3.33 * Fix evictable cache invalidation race condition Before the change the following race was possible between threads A, B, C: - A calls invalidate(K) - A increments: invalidations++ - B changes state to be cached, and therefore calls invalidate(K) too - B increments: invalidations++ - C calls get(K) - C reads invalidations counter - C retrieves current token T for key K - C reads value V for T from cache - A reads and removes token T (same) for key K - B attempts to read and remove token for key K, not found - B exits invalidate(K) - C checks invalidations counter (didn't check) - C revives, i.e. re-inserts token T for key K - B calls get(K) - B retrieves token T (same) for key K - B reads value V for T from cache -- despite B having called invalidate(K) At least in this situation the problem is transient. Thread A will momentarily invalidate dataCache for token T, completing the invalidation. This commit fixes this. The bug was introduced by token reviving (commit 17faae3). This commit reverts that one and provides a different solution to the problem that commit was solving. * Pin 3rd party CI action versions GitHub allows to delete and re-publish a tag, so referencing 3rd party action by tag name should be discouraged. * Add encoding to error code in OAuth2 callback handler * Rearchitect IR optimizer and evaluator Model it as a collection of independent transformations, similar to the plan optimizer. * Distribute comparisons over case and switch Transforms expressions such as: a = case when c1 then r1 when c2 then r2 else r3 end into case when c1 then a = r1 when c2 then a = r2 else a = r3 end Additionally, simplifies expressions such as: case when c then true else false end * Enable Dependabot for GitHub Actions Dependabot will automatically open pull requests to update GitHub Actions. The main benefit of this is knowing that new versions are available, especially if third-party GitHub Actions are pinned to specific versions. Running it weekly should give enough time to review updates, and also react switfly to potential security updates. * Avoid abort in testing helper method or for-loop * Convert ResourceGroupInfo to a record class * Make sure expression type does not change after processing with rule * modify alluxio version to stable 2.9.4 * Include detailed error in S3 batch delete message * Clarify the default value of iceberg.catalog.type in Iceberg docs * Remove unused argument from getSparkTableStatistics * Fix description in Redis plugin * Optimized DATE_TRUNC function * Update libraries-bom to 26.41.0 * Update clickhouse-jdbc to 0.6.1 * Update nimbus-jose-jwt to 9.40.0 * Update AWS SDK v2 to 2.26.1 * Update GCS connector to hadoop3-2.2.23 * Update AWS SDK v1 to 1.12.742 * Update nessie to 0.90.4 * Update elasticsearch to 7.17.22 * Set hive.metastore.catalog.dir in Iceberg and Delta query runners * Remove redundant annotation * Use takari-smart-builder It's a default builder for mvnd * Replace air.main.basedir with session.rootDirectory * Update maven to 3.9.8 This version is aligned with mvnd-1.0.0 * Migrate iceberg caching product test to native filesystem * Fix argument order in ST_Point() function docs * Update docker-image version to 97 * Test allowColumnDefaults writer feature in Delta Lake * Check duplicated field names in Iceberg * Update jline to 3.26.2 * Update parquet to 1.14.1 * Improve code comment * Expose driver execution stats via JMX * Clean up code comments in JdbcJoinPushdownUtil * Clarify required configuration for fs caching * Track rate of pending acquires processing in BinPackingNodeAllocatorService * Add reason to pending acquires counters in BinPackingNodeAllocatorService * Install plugins in testing server before startup completes When startup completes (`StartupStatus.startupComplete()`), a worker will accept tasks (`TaskResource.failRequestIfInvalid`). The plugins and functions need to be installed before that happens. * Migrate unsupported writer version to integration test in Delta * Migrate unsupported writer feature to integration test in Delta * Remove testTrinoAlterTablePreservesChangeDataFeed testTrinoAlterTablePreservesTableMetadata verifies that table configuration is preserved. * Migrate append-only to integration test in Delta * Use correct isolation level in writing the commit info for DELETE * Add Trino 450 release notes * Document S3 max connections default change * [maven-release-plugin] prepare release 450 * [maven-release-plugin] prepare for next development iteration * Make AsyncResponse aware of the client disconnection With the current Jersey and Jetty implementations @suspended AsyncResponse can generate a lot of warnings that are caused by the fact that Jetty eagerly recycles request/response objects when the HTTP/2 RST_STREAM frame is sent by the client during request aborting. We abort in-flight requests for different reasons: either timeout is reached or when the client is closed during cleanup. For HTTP/1 that doesn't matter, but for HTTP/2 this will cause the AsyncResponse.resume to log an error, since the underlying request/response objects are already recycled and there is no client listening for the response. This change also cancels Future<?> bound to the AsyncContext since there is no point in processing it anymore. * Retain authorization during client redirects for basic auth * Create tables under volatile directory in TestDeltaLakeBasic catalogDir will be deleted after the test execution. * Fix wrong entry in 450 release note * Disallow row type without field names in Delta Lake * Check duplicated field names in Delta Lake Also, include exception message in DeltaLakeMetadata.addColumn. * Implement incremental refresh for single-table, predicate-only MVs (trinodb#20959) Add incremental refresh support for simple Iceberg MVs When MV refresh is executing, the planner suggests either incremental or full refresh to the connector. In this first phase, incremental refresh is suggested only when Scan/Filter/Project nodes are present in the plan tree - otherwise full refresh is performed. The Iceberg connector can act on this signal and use IncrementalAppendScan to scan the 'delta' records only and append them to the MV storage table (without truncation). Co-authored-by: Karol Sobczak <[email protected]> * Add Expression to Result in ExpressionAssertProvider * Add transform function const * Add JsonStringArrayExtract function Optimzie common case of transform(cast(json_parse(varchar_col) as array<json>), json -> json_extract_scalar(json, jsonPath)) case to be single and more performant call of JsonStringArrayExtract. * Expose lease distribution stats from BinPackingNodeAllocatorService * Update airlift to 249 * Fix product tests previously migrated to native fs to not use legacy fs * Add non-null check in SortingFileWriter * Allow adding and droping fields to records wrapped in array * Allow changing field type in records wrapped in an array * Refactor TestJsonStringArrayExtractScalar * Use List/Set injection into configuration This makes configuration more explicit in regard to expected types * Fix reading empty files when using native S3 FS * Fix configuration mistake * Allow reading empty files in native GCS FS * Workaround for reading empty files using native S3 FS When reading an empty file from S3 as a stream, using the AWS SDK v2, it incorrectly keeps reading the MD5 checksum included at the end of the response. This has been reported upstream as aws/aws-sdk-java-v2#3538 * Optimize IntegerNumberToDoubleCoercer#applyCoercedValue to avoid repeated reading of Block values * Add support for prefix configuration in Iceberg REST catalog REST catalog supports a `prefix` in the resource path as per the spec. But it is not exposed from Trino `IcebergRestCatalogConfig`. In Spark, this works by default because they accept generic property bag. * Rename getColumns to getTopLevelColumns in iceberg connector * Support partitioning on nested ROW fields in Iceberg Co-authored-by: Victoria Bukta <[email protected]> Co-authored-by: Yuya Ebihara <[email protected]> * Add Parquet Bloom filter write support to Iceberg connector * List unhealthy containers when product tests health check fails * Reschedule replacement tasks in case of artificial failure due to missing spooling output size The code introduced in trinodb#22298 was lacking rescheduling replacement tasks. As a result query execution got stuck after we observed missing spooling output stats for task. * Convert inner classes in ConstantPopulatingPageSource to record * Convert Iceberg PartitionColumn to record * Convert IcebergInputInfo to record * Convert IcebergPageSink.PartitionColumn to record * Convert Iceberg ReaderPageSourceWithRowPositions to record * Convert IcebergParquetColumnIOConverter.FieldContext to record * Convert IcebergPartitionColumn to record * Convert PartitionTransforms.ColumnTransform to record * Use field id in freeCurrentRowGroupBuffers for consistency * Remove interning from ColumnChunkProperties Avoids having unbounded and unaccounted memory usage. Also avoids the overheads associated with concurrent map look-up * Convert ColumnChunkProperties into a record * Removed unused code from ColumnChunkMetadata * Optimize ParquetReader#getColumnChunkMetaData Avoid unnecessary conversion to ColumnPath Compare arrays in reverse to find mismatch quicker * Optimize ParquetTypeUtils#getDescriptors Existing logic was complex due to performing case-insensitive matching. This was unnecesary because fileSchema and requestedSchema already contain lower cased names. Also, since requestedSchema is derived from fileSchema, we can build descriptors map directly from result of getColumns instead of repeating look-ups in fileSchema. * Update jjwt to 0.12.6 * Update minio to 8.5.11 * Update RoaringBitmap to 1.1.0 * Update AWS SDK v1 to 1.12.749 * Update AWS SDK v2 to 2.26.8 * Update oauth2-oidc-sdk to 11.13 * Update metrics-core to 4.2.26 * Update Nessie to 0.91.2 * Update openlineage to 1.17.1 * Update Azure SDK Azure SDK version 1.2.24 includes Azure Storage Blob 12.26, which introduces a breaking change, where blob names used to create a new blob client should not be URL-encoded anymore. Blob URLs in the storage don't change, so after adjusting the API usage, data written by older Trino versions should be readable by new Trino versions. * Disable caching of /ui responses This ensures that any sensitive information is not cached in the browser * Allow adding non-temurin JDK distributions This is a preparation for testing additional builds like Loom EA * Add Loom EA build * Verify ppc64le docker image * Add clarifying javadoc * Add more http proxy configs for native S3 filesystem implementation To make the http proxy config options in parity with legacy implementation * Push filter through TopNRankingNode * Fix initial value of diagnostics counter * Fix handling RESERVED status when updating stats * Allow uppercase characters in Iceberg quoted identifiers partition * Add special characters missing test coverage for native-fs s3 path * Use enhanced instanceof in KuduClientSession * Fix failure when filter operation is applied on varbinary column in Kudu * Remove enable-coordinator-dynamic-filters-distribution config This has been enabled by default for a long time without issues. It is no longer necessary to keep it as a configurable. * Migrate Hive view in Delta to integration tests * Migrate unregistering non-delta table to integration tests * Throw TrinoException for Kudu failures Inserting/Merging/Updating invalid data in Kudu doesn't throw a Kudu specific exception so we wrap all the exceptions as Trino specific exceptions * Add date data type support for Kudu Co-authored-by: Pascal Gasp <[email protected]> * Add t_cdf and t_pdf scalar functions * Fix spelling errors in documentation * Fix incorrect case optimization Expressions of the form: CASE WHEN <cond> THEN true ELSE false END were being simplified to <cond>, which is incorrect due to null handling. The original form would return false, while the simplified form returns null. * Allow pre-allocating memory per catalog on initialization * Add Trino 451 release notes * [maven-release-plugin] prepare release 451 * Merge fixes for Trino 451 (#35) --------- Co-authored-by: praveenkrishna.d <[email protected]> Co-authored-by: Yiqun Zhang <[email protected]> Co-authored-by: Piotr Findeisen <[email protected]> Co-authored-by: Yuya Ebihara <[email protected]> Co-authored-by: regadas <[email protected]> Co-authored-by: Manfred Moser <[email protected]> Co-authored-by: Colebow <[email protected]> Co-authored-by: uditkumar <[email protected]> Co-authored-by: Łukasz Osipiuk <[email protected]> Co-authored-by: Rafał Połomka <[email protected]> Co-authored-by: Ashhar Hasan <[email protected]> Co-authored-by: Jack Klamer <[email protected]> Co-authored-by: Sasha Sheikin <[email protected]> Co-authored-by: Mateusz "Serafin" Gajewski <[email protected]> Co-authored-by: Raunaq Morarka <[email protected]> Co-authored-by: Anu Sudarsan <[email protected]> Co-authored-by: takezoe <[email protected]> Co-authored-by: Kamil Endruszkiewicz <[email protected]> Co-authored-by: Lukasz Stec <[email protected]> Co-authored-by: chenjian2664 <[email protected]> Co-authored-by: Mayank Vadariya <[email protected]> Co-authored-by: Alexey Pavlenko <[email protected]> Co-authored-by: Karol Sobczak <[email protected]> Co-authored-by: Dejan Mircevski <[email protected]> Co-authored-by: Jan Waś <[email protected]> Co-authored-by: Grant Nicholas <[email protected]> Co-authored-by: Keith Whitley <[email protected]> Co-authored-by: ajantha-bhat <[email protected]> Co-authored-by: Grzegorz Nowak <[email protected]> Co-authored-by: Martin Traverso <[email protected]> Co-authored-by: Jianjian <[email protected]> Co-authored-by: senlizishi <[email protected]> Co-authored-by: Alexander Kolesnikov <[email protected]> Co-authored-by: Gontzal Monasterio <[email protected]> Co-authored-by: Marius Grama <[email protected]> Co-authored-by: Star Poon <[email protected]> Co-authored-by: Marton Bod <[email protected]> Co-authored-by: Brad <[email protected]> Co-authored-by: Vikash Kumar <[email protected]> Co-authored-by: Victoria Bukta <[email protected]> Co-authored-by: Jonas Irgens Kylling <[email protected]> Co-authored-by: kasiafi <[email protected]> Co-authored-by: Pascal Gasp <[email protected]> Co-authored-by: Emily Sunaryo <[email protected]> Co-authored-by: neriya shulman <[email protected]>

cla-bot bot added the cla-signed label Mar 6, 2024

github-actions bot added the iceberg Iceberg connector label Mar 6, 2024

marton-bod changed the title ~~rdar:https://123628905 Implement incremental refresh for single-table, predicate-only MVs~~ Implement incremental refresh for single-table, predicate-only MVs Mar 6, 2024

marton-bod force-pushed the incr_ref branch 2 times, most recently from 8ff865b to 53e7cc1 Compare March 7, 2024 11:14

marton-bod requested review from findepi, raunaqmorarka and findinpath March 7, 2024 12:18

marton-bod requested review from alexjo2144 and sopel39 March 8, 2024 14:08

marton-bod force-pushed the incr_ref branch from 53e7cc1 to 914f259 Compare April 11, 2024 13:50

github-actions bot added the stale label May 3, 2024

sopel39 reviewed May 14, 2024

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergMaterializedViewTest.java Show resolved Hide resolved

marton-bod force-pushed the incr_ref branch from 2712ab5 to efe06ae Compare May 14, 2024 15:25

github-actions bot removed the stale label May 14, 2024

marton-bod force-pushed the incr_ref branch from efe06ae to 5422d3e Compare May 20, 2024 11:36

sopel39 reviewed May 21, 2024

View reviewed changes

marton-bod force-pushed the incr_ref branch from 6060f10 to bd6eb8e Compare May 21, 2024 16:18

sopel39 reviewed May 21, 2024

View reviewed changes

marton-bod force-pushed the incr_ref branch from 8528292 to 0e51c67 Compare May 24, 2024 10:18

sopel39 reviewed May 29, 2024

View reviewed changes

cla-bot bot removed the cla-signed label Jun 5, 2024

marton-bod force-pushed the incr_ref branch from 58fb904 to 516098a Compare June 5, 2024 09:19

cla-bot bot added the cla-signed label Jun 5, 2024

marton-bod force-pushed the incr_ref branch from 516098a to eeb5d4b Compare June 5, 2024 10:04

sopel39 approved these changes Jun 10, 2024

View reviewed changes

sopel39 reviewed Jun 10, 2024

View reviewed changes

core/trino-main/src/test/java/io/trino/sql/planner/TestMaterializedViews.java Outdated Show resolved Hide resolved

sopel39 reviewed Jun 10, 2024

View reviewed changes

plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergMaterializedView.java Outdated Show resolved Hide resolved

sopel39 reviewed Jun 12, 2024

View reviewed changes

marton-bod force-pushed the incr_ref branch 2 times, most recently from 75521bb to e2aef8d Compare June 12, 2024 22:44

sopel39 reviewed Jun 13, 2024

View reviewed changes

marton-bod force-pushed the incr_ref branch from 6e8e1ef to 23af6c9 Compare June 14, 2024 15:08

Empty commit

74418e1

sopel39 reviewed Jun 15, 2024

View reviewed changes

core/trino-main/src/main/java/io/trino/metadata/MetadataManager.java Show resolved Hide resolved

sopel39 reviewed Jun 18, 2024

View reviewed changes

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated Show resolved Hide resolved

sopel39 approved these changes Jun 19, 2024

View reviewed changes

marton-bod added 2 commits June 19, 2024 10:21

Address review comments

8c14f0c

Address review comments

e8055f3

sopel39 reviewed Jun 19, 2024

View reviewed changes

sopel39 added 2 commits June 19, 2024 23:22

Added check, fixed test, fixed concurrent tests

85679a3

fixup

163dea5

sopel39 merged commit 0eb5bfc into trinodb:master Jun 20, 2024
99 checks passed

github-actions bot added this to the 451 milestone Jun 20, 2024

sopel39 added the release-notes label Jun 20, 2024

colebow mentioned this pull request Jun 25, 2024

Add Trino 451 release notes #22504

Merged

marton-bod mentioned this pull request Jul 1, 2024

Add documentation for Iceberg incremental MV refresh #22549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement incremental refresh for single-table, predicate-only MVs #20959

Implement incremental refresh for single-table, predicate-only MVs #20959

marton-bod commented Mar 6, 2024 •

edited by sopel39

Loading

findepi commented Mar 7, 2024

marton-bod commented Mar 26, 2024

github-actions bot commented May 3, 2024

marton-bod commented May 15, 2024

sopel39 left a comment

sopel39 May 29, 2024

marton-bod Jun 4, 2024

sopel39 Jun 10, 2024

marton-bod Jun 12, 2024

marton-bod Jun 12, 2024 •

edited

Loading

cla-bot bot commented Jun 5, 2024

sopel39 left a comment

sopel39 Jun 10, 2024

sopel39 Jun 12, 2024

sopel39 left a comment

sopel39 Jun 19, 2024

mosabua commented Jun 26, 2024

marton-bod commented Jun 27, 2024

		@@ -86,6 +86,189 @@ protected String getStorageMetadataLocation(String materializedViewName)
		return table.getParameters().get(METADATA_LOCATION_PROP);
		}

Implement incremental refresh for single-table, predicate-only MVs #20959

Implement incremental refresh for single-table, predicate-only MVs #20959

Conversation

marton-bod commented Mar 6, 2024 • edited by sopel39 Loading

Description

Additional context and related issues

Release notes

findepi commented Mar 7, 2024

marton-bod commented Mar 26, 2024

github-actions bot commented May 3, 2024

marton-bod commented May 15, 2024

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 May 29, 2024

Choose a reason for hiding this comment

marton-bod Jun 4, 2024

Choose a reason for hiding this comment

sopel39 Jun 10, 2024

Choose a reason for hiding this comment

marton-bod Jun 12, 2024

Choose a reason for hiding this comment

marton-bod Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

cla-bot bot commented Jun 5, 2024

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 Jun 10, 2024

Choose a reason for hiding this comment

sopel39 Jun 12, 2024

Choose a reason for hiding this comment

sopel39 left a comment

Choose a reason for hiding this comment

sopel39 Jun 19, 2024

Choose a reason for hiding this comment

mosabua commented Jun 26, 2024

marton-bod commented Jun 27, 2024

marton-bod commented Mar 6, 2024 •

edited by sopel39

Loading

marton-bod Jun 12, 2024 •

edited

Loading