Add storage api support for reading table for query function in bigquery #22432

krvikash · 2024-06-19T07:29:55Z

Description

This PR now uses the storage api to execute native bigquery query.

~~This PR adds an optional parameter mode for BigQuery query table function, which indicates which api should be used to run the native bigquery query.~~

~~By default when mode is not provided It will use REST_API to run the native bigquery query.~~

~~Parameter mode can be set as~~

~~1. REST_API (More Details: https://cloud.google.com/bigquery/docs/reference/rest)~~
~~2. STORAGE_API (More Details: https://cloud.google.com/bigquery/docs/reference/storage).~~

Additional context and related issues

While implementing this feature, we see that when native query result set is small then the REST_API performs better than STORAGE_API. But when native query result set is large then the STORAGE_API performs better than REST_API.
~
Here are some numbers:

--------------------------------------------------------------------------------------
When the Table has just one single row and calling select on the table 5 times, Then
--------------------------------------------------------------------------------------

testNativeQuerySelectForCaseSensitiveColumnNames (WITH Storage API)
2024-05-20T03:23:59.413-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 48

testNativeQuerySelectForCaseSensitiveColumnNames (With Rest API)
2024-05-20T06:07:00.375-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 20

--------------------------------------------------------------------------------------
When the Table has ~60K row, call select on the table 1 time, Then
--------------------------------------------------------------------------------------
SELECT * FROM lineitem; (WITH Storage API)
2024-05-20T06:17:16.949-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 47

SELECT * FROM lineitem; (With Rest API)
2024-05-20T06:13:18.182-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 141

Using Storage API has an overhead of creating a temporary cached table. This approach is similar to how the Materialized view is implemented in Bigquery.

Release notes

(X) Release notes are required, with the following suggested text:

# Section
* Add storage api support for reading table for query function in bigquery. ({issue}`22432`)

ebyhr · 2024-06-19T07:42:29Z

I don't think exposing API flag is a good idea. It's an internal implementation users shouldn't care.

Why not switch those API automatically based on referencedTables, totalBytesProcessed and etc in com.google.cloud.bigquery.JobStatistics.QueryStatistics? Is it difficult to identify whether the query contains BigQuery syntax or not?

krvikash · 2024-06-20T08:34:03Z

Hi @ebyhr, I tried running query on lineitem table which has 60175 rows. Here is the observation that

WIth STORAGE_API approach we get better performance than REST_API when the result set has more than 20000 rows
WIth REST_API approach we get better performance than STORAGE_API when the result set has less than 20000 rows

In all the cases TotalBytesProcessed value is same 8596160. This value is not helpful for determining which API we should automatically take to execute the native query.


[1] nativeQuery=SELECT * FROM tpch.lineitem, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 25 seconds

[2] nativeQuery=SELECT * FROM tpch.lineitem, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 14 seconds

[3] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 60000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 28 seconds

[4] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 60000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 11 seconds

[5] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 50000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 50192, Time taken: 23 seconds

[6] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 50000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 50192, Time taken: 11 seconds

[7] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 40000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 40278, Time taken: 18 seconds

[8] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 40000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 40278, Time taken: 11 seconds

[9] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 30000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 30209, Time taken: 15 seconds

[10] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 30000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 30209, Time taken: 12 seconds

[11] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 25000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 25180, Time taken: 13 seconds

[12] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 25000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 25180, Time taken: 10 seconds

[13] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 20000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 20060, Time taken: 11 seconds

[14] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 20000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 20060, Time taken: 11 seconds

[15] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 10000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 9965, Time taken: 7 seconds

[16] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 10000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 9965, Time taken: 10 seconds

[17] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 5000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 5066, Time taken: 5 seconds

[18] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 5000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 5066, Time taken: 10 seconds

[19] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 1000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 1004, Time taken: 3 seconds

[20] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 1000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 1004, Time taken: 10 seconds

[21] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 200, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 221, Time taken: 4 seconds

[22] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 200, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 221, Time taken: 8 seconds

[23] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 100, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 110, Time taken: 3 seconds

[24] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 100, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 110, Time taken: 9 seconds

[25] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 0, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 0, Time taken: 2 seconds

[26] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 0, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 0, Time taken: 9 seconds

krvikash · 2024-06-20T10:18:54Z

I am trying to see the possibility to use of QueryStage#shuffleOutputBytes. This value is populated only when the native query has been executed. Once the query has been executed then based on the QueryStage#shuffleOutputBytes value we can decide which api should be used to read the data from bigquery to trino.

krvikash · 2024-06-28T08:18:17Z

Hi @ebyhr @Praveen2112, I have removed the additional parameter approach. Now I am using Storage API internally for all the native query except for the case when there is duplicate column.

In terms of performance, for the cases where the resultset output is less the native query takes more time now than existing solution(REST_API).

I tried using QueryStage#shuffleOutputBytes but to get this we have to run the Bigquery job first and use this info for deciding REST_API or STORAGE_API. This will have a overhead for the query which tends to run in REST_API environment.

ebyhr · 2024-06-28T08:59:11Z

/test-with-secrets sha=da08ab54b8d950b2be3a78ee893833aaa8b58a0f

github-actions · 2024-06-28T09:02:07Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/9710216273

ebyhr · 2024-06-28T09:41:25Z

Could you confirm CI failure? It looks related to this change.

ebyhr · 2024-06-28T09:59:38Z

/test-with-secrets sha=9b1cd6e6fbfa2244f0bb420e949003b684756b1f

github-actions · 2024-06-28T10:00:44Z

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9710992320

ebyhr · 2024-07-02T14:08:49Z

Let me remove syntax-needs-review lablel as this PR doesn't change syntax.

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryQueryRelationHandle.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java

krvikash · 2024-07-03T07:53:27Z

Thanks @ebyhr for the review. Addressed comments.

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryQueryRelationHandle.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java

krvikash · 2024-07-04T10:57:36Z

@Praveen2112 @ebyhr could you please run this PR with secrets?

ebyhr · 2024-07-04T11:00:46Z

/test-with-secrets sha=1cd3b48a7b3550ecee4e9057330e7ec9637ccf0e

github-actions · 2024-07-04T11:05:48Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/9793194786

krvikash · 2024-07-04T11:15:28Z

Thanks @ebyhr @Praveen2112 for the review. Addressed comments.

ebyhr

Looks good to me except for a logic to build BigQuery SELECT statement in split manager.

ebyhr · 2024-07-04T21:52:09Z

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java

+ List<String> projectedColumnsNames = getProjectedColumnNames(columns);
+
+ String query = filter
+ .map(whereClause -> "SELECT " + String.join(",", projectedColumnsNames) + " FROM (" + bigQueryQueryRelationHandle.getQuery() + ") WHERE " + whereClause)


String.join(",", projectedColumnsNames)

This may create a invalid query SELECT FROM (...) if projectedColumnsNames is an empty right? I'm not sure if such situations really happen though. Can we change to * or throw an exception?

Also, why is projectedColumnsNames unused when filter doesn't exist?

I'm not sure if such situations really happen though

In this query SELECT count(*) FROM TABLE(bigquery.system.query(query => 'SELECT 1')), Projected column names is empty. Thanks for pointing it out. Fixing this. I think I can use createEmptyProjection in this case.

Also using projected column names fails when the query does not have column name provided and gets generated internally SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT 1')).

Caused by: com.google.cloud.bigquery.BigQueryException: Unrecognized name: f0_ at [1:8]

Also using projected column names fails when the query does not have column name provided and gets generated internally SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT 1')).

For this reason, I am going back to use * instead of projectedColumnNames

CC: @Praveen2112

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java

ebyhr · 2024-07-05T08:07:19Z

/test-with-secrets sha=4e1c1f2a16112d6297ced970c44de41b4c31b014

github-actions · 2024-07-05T08:08:34Z

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9805144765

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryClient.java

krvikash · 2024-07-08T06:30:27Z

Thanks @ebyhr. Addressed comments.

ebyhr · 2024-07-08T06:36:48Z

/test-with-secrets sha=38499b7253a36d0ee815b0f64a753bcfe8e1b836

github-actions · 2024-07-08T06:55:55Z

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9834807023

cla-bot bot added the cla-signed label Jun 19, 2024

krvikash added the syntax-needs-review label Jun 19, 2024

krvikash requested review from ssheikin, Praveen2112, vlad-lyutenko and marcinsbd June 19, 2024 07:30

github-actions bot added docs bigquery BigQuery connector labels Jun 19, 2024

krvikash self-assigned this Jun 19, 2024

krvikash requested review from martint and ebyhr June 19, 2024 07:40

krvikash force-pushed the krvikash/bigquery-table-functions branch from 08f7fc8 to da08ab5 Compare June 27, 2024 10:21

krvikash force-pushed the krvikash/bigquery-table-functions branch from da08ab5 to 9b1cd6e Compare June 28, 2024 09:57

ebyhr removed the syntax-needs-review label Jul 2, 2024

ebyhr reviewed Jul 2, 2024

View reviewed changes

krvikash force-pushed the krvikash/bigquery-table-functions branch from 9b1cd6e to 2c702ae Compare July 3, 2024 07:53

Praveen2112 reviewed Jul 4, 2024

View reviewed changes

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQueryQueryRelationHandle.java Outdated Show resolved Hide resolved

plugin/trino-bigquery/src/main/java/io/trino/plugin/bigquery/BigQuerySplitManager.java Outdated Show resolved Hide resolved

ebyhr reviewed Jul 4, 2024

View reviewed changes

krvikash force-pushed the krvikash/bigquery-table-functions branch from 2c702ae to 94d6f30 Compare July 4, 2024 10:05

krvikash force-pushed the krvikash/bigquery-table-functions branch from 94d6f30 to 1cd3b48 Compare July 4, 2024 10:56

ebyhr reviewed Jul 4, 2024

View reviewed changes

krvikash force-pushed the krvikash/bigquery-table-functions branch from 1cd3b48 to 4e1c1f2 Compare July 5, 2024 08:04

vlad-lyutenko approved these changes Jul 5, 2024

View reviewed changes

ebyhr approved these changes Jul 7, 2024

View reviewed changes

Add storage api support for reading table for query function in bigquery

38499b7

krvikash force-pushed the krvikash/bigquery-table-functions branch from 4e1c1f2 to 38499b7 Compare July 8, 2024 06:29

ebyhr merged commit 6527107 into trinodb:master Jul 8, 2024
17 of 18 checks passed

github-actions bot added this to the 452 milestone Jul 8, 2024

krvikash deleted the krvikash/bigquery-table-functions branch July 8, 2024 08:49

colebow mentioned this pull request Jul 10, 2024

Add Trino 452 release notes #22573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add storage api support for reading table for query function in bigquery #22432

Add storage api support for reading table for query function in bigquery #22432

krvikash commented Jun 19, 2024 •

edited

Loading

ebyhr commented Jun 19, 2024 •

edited

Loading

krvikash commented Jun 20, 2024 •

edited

Loading

krvikash commented Jun 20, 2024

krvikash commented Jun 28, 2024

ebyhr commented Jun 28, 2024

github-actions bot commented Jun 28, 2024 •

edited

Loading

ebyhr commented Jun 28, 2024

ebyhr commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

ebyhr commented Jul 2, 2024

krvikash commented Jul 3, 2024

krvikash commented Jul 4, 2024

ebyhr commented Jul 4, 2024

github-actions bot commented Jul 4, 2024 •

edited

Loading

krvikash commented Jul 4, 2024

ebyhr left a comment

ebyhr Jul 4, 2024

krvikash Jul 5, 2024

krvikash Jul 5, 2024

krvikash Jul 5, 2024

krvikash Jul 5, 2024

ebyhr commented Jul 5, 2024

github-actions bot commented Jul 5, 2024

krvikash commented Jul 8, 2024

ebyhr commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

Add storage api support for reading table for query function in bigquery #22432

Add storage api support for reading table for query function in bigquery #22432

Conversation

krvikash commented Jun 19, 2024 • edited Loading

Description

Additional context and related issues

Release notes

ebyhr commented Jun 19, 2024 • edited Loading

krvikash commented Jun 20, 2024 • edited Loading

krvikash commented Jun 20, 2024

krvikash commented Jun 28, 2024

ebyhr commented Jun 28, 2024

github-actions bot commented Jun 28, 2024 • edited Loading

ebyhr commented Jun 28, 2024

ebyhr commented Jun 28, 2024

github-actions bot commented Jun 28, 2024

ebyhr commented Jul 2, 2024

krvikash commented Jul 3, 2024

krvikash commented Jul 4, 2024

ebyhr commented Jul 4, 2024

github-actions bot commented Jul 4, 2024 • edited Loading

krvikash commented Jul 4, 2024

ebyhr left a comment

Choose a reason for hiding this comment

ebyhr Jul 4, 2024

Choose a reason for hiding this comment

krvikash Jul 5, 2024

Choose a reason for hiding this comment

krvikash Jul 5, 2024

Choose a reason for hiding this comment

krvikash Jul 5, 2024

Choose a reason for hiding this comment

krvikash Jul 5, 2024

Choose a reason for hiding this comment

ebyhr commented Jul 5, 2024

github-actions bot commented Jul 5, 2024

krvikash commented Jul 8, 2024

ebyhr commented Jul 8, 2024

github-actions bot commented Jul 8, 2024

krvikash commented Jun 19, 2024 •

edited

Loading

ebyhr commented Jun 19, 2024 •

edited

Loading

krvikash commented Jun 20, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jul 4, 2024 •

edited

Loading