Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storage api support for reading table for query function in bigquery #22432

Merged
merged 1 commit into from
Jul 8, 2024

Conversation

krvikash
Copy link
Contributor

@krvikash krvikash commented Jun 19, 2024

Description

This PR now uses the storage api to execute native bigquery query.

This PR adds an optional parameter mode for BigQuery query table function, which indicates which api should be used to run the native bigquery query.

By default when mode is not provided It will use REST_API to run the native bigquery query.

Parameter mode can be set as

1. REST_API (More Details: https://cloud.google.com/bigquery/docs/reference/rest)
2. STORAGE_API (More Details: https://cloud.google.com/bigquery/docs/reference/storage).

Additional context and related issues

While implementing this feature, we see that when native query result set is small then the REST_API performs better than STORAGE_API. But when native query result set is large then the STORAGE_API performs better than REST_API.
~
Here are some numbers:

--------------------------------------------------------------------------------------
When the Table has just one single row and calling select on the table 5 times, Then
--------------------------------------------------------------------------------------

testNativeQuerySelectForCaseSensitiveColumnNames (WITH Storage API)
2024-05-20T03:23:59.413-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 48

testNativeQuerySelectForCaseSensitiveColumnNames (With Rest API)
2024-05-20T06:07:00.375-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 20

--------------------------------------------------------------------------------------
When the Table has ~60K row, call select on the table 1 time, Then
--------------------------------------------------------------------------------------
SELECT * FROM lineitem; (WITH Storage API)
2024-05-20T06:17:16.949-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 47

SELECT * FROM lineitem; (With Rest API)
2024-05-20T06:13:18.182-0600 INFO ForkJoinPool-1-worker-1 stdout Time taken in seconds: 141

Using Storage API has an overhead of creating a temporary cached table. This approach is similar to how the Materialized view is implemented in Bigquery.

Release notes

(X) Release notes are required, with the following suggested text:

# Section
* Add storage api support for reading table for query function in bigquery. ({issue}`22432`)

@ebyhr
Copy link
Member

ebyhr commented Jun 19, 2024

I don't think exposing API flag is a good idea. It's an internal implementation users shouldn't care.

Why not switch those API automatically based on referencedTables, totalBytesProcessed and etc in com.google.cloud.bigquery.JobStatistics.QueryStatistics? Is it difficult to identify whether the query contains BigQuery syntax or not?

@krvikash
Copy link
Contributor Author

krvikash commented Jun 20, 2024

Hi @ebyhr, I tried running query on lineitem table which has 60175 rows. Here is the observation that

  1. WIth STORAGE_API approach we get better performance than REST_API when the result set has more than 20000 rows
  2. WIth REST_API approach we get better performance than STORAGE_API when the result set has less than 20000 rows

In all the cases TotalBytesProcessed value is same 8596160. This value is not helpful for determining which API we should automatically take to execute the native query.


[1] nativeQuery=SELECT * FROM tpch.lineitem, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 25 seconds

[2] nativeQuery=SELECT * FROM tpch.lineitem, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 14 seconds

[3] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 60000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 28 seconds

[4] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 60000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 60175, Time taken: 11 seconds

[5] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 50000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 50192, Time taken: 23 seconds

[6] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 50000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 50192, Time taken: 11 seconds

[7] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 40000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 40278, Time taken: 18 seconds

[8] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 40000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 40278, Time taken: 11 seconds

[9] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 30000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 30209, Time taken: 15 seconds

[10] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 30000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 30209, Time taken: 12 seconds

[11] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 25000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 25180, Time taken: 13 seconds

[12] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 25000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 25180, Time taken: 10 seconds

[13] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 20000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 20060, Time taken: 11 seconds

[14] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 20000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 20060, Time taken: 11 seconds

[15] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 10000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 9965, Time taken: 7 seconds

[16] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 10000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 9965, Time taken: 10 seconds

[17] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 5000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 5066, Time taken: 5 seconds

[18] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 5000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 5066, Time taken: 10 seconds

[19] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 1000, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 1004, Time taken: 3 seconds

[20] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 1000, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 1004, Time taken: 10 seconds

[21] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 200, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 221, Time taken: 4 seconds

[22] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 200, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 221, Time taken: 8 seconds

[23] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 100, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 110, Time taken: 3 seconds

[24] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 100, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 110, Time taken: 9 seconds

[25] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 0, queryMode=REST_API
TotalBytesProcessed: 8596160
Total Rows: 0, Time taken: 2 seconds

[26] nativeQuery=SELECT * FROM tpch.lineitem WHERE orderkey <= 0, queryMode=STORAGE_API
TotalBytesProcessed: 8596160
Total Rows: 0, Time taken: 9 seconds

@krvikash
Copy link
Contributor Author

I am trying to see the possibility to use of QueryStage#shuffleOutputBytes. This value is populated only when the native query has been executed. Once the query has been executed then based on the QueryStage#shuffleOutputBytes value we can decide which api should be used to read the data from bigquery to trino.

@krvikash krvikash force-pushed the krvikash/bigquery-table-functions branch from 08f7fc8 to da08ab5 Compare June 27, 2024 10:21
@krvikash
Copy link
Contributor Author

Hi @ebyhr @Praveen2112, I have removed the additional parameter approach. Now I am using Storage API internally for all the native query except for the case when there is duplicate column.

In terms of performance, for the cases where the resultset output is less the native query takes more time now than existing solution(REST_API).

I tried using QueryStage#shuffleOutputBytes but to get this we have to run the Bigquery job first and use this info for deciding REST_API or STORAGE_API. This will have a overhead for the query which tends to run in REST_API environment.

@ebyhr
Copy link
Member

ebyhr commented Jun 28, 2024

/test-with-secrets sha=da08ab54b8d950b2be3a78ee893833aaa8b58a0f

Copy link

github-actions bot commented Jun 28, 2024

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/9710216273

@ebyhr
Copy link
Member

ebyhr commented Jun 28, 2024

Could you confirm CI failure? It looks related to this change.

@krvikash krvikash force-pushed the krvikash/bigquery-table-functions branch from da08ab5 to 9b1cd6e Compare June 28, 2024 09:57
@ebyhr
Copy link
Member

ebyhr commented Jun 28, 2024

/test-with-secrets sha=9b1cd6e6fbfa2244f0bb420e949003b684756b1f

Copy link

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9710992320

@ebyhr
Copy link
Member

ebyhr commented Jul 2, 2024

Let me remove syntax-needs-review lablel as this PR doesn't change syntax.

@krvikash krvikash force-pushed the krvikash/bigquery-table-functions branch from 9b1cd6e to 2c702ae Compare July 3, 2024 07:53
@krvikash
Copy link
Contributor Author

krvikash commented Jul 3, 2024

Thanks @ebyhr for the review. Addressed comments.

@krvikash krvikash force-pushed the krvikash/bigquery-table-functions branch from 2c702ae to 94d6f30 Compare July 4, 2024 10:05
@krvikash krvikash force-pushed the krvikash/bigquery-table-functions branch from 94d6f30 to 1cd3b48 Compare July 4, 2024 10:56
@krvikash
Copy link
Contributor Author

krvikash commented Jul 4, 2024

@Praveen2112 @ebyhr could you please run this PR with secrets?

@ebyhr
Copy link
Member

ebyhr commented Jul 4, 2024

/test-with-secrets sha=1cd3b48a7b3550ecee4e9057330e7ec9637ccf0e

Copy link

github-actions bot commented Jul 4, 2024

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/9793194786

@krvikash
Copy link
Contributor Author

krvikash commented Jul 4, 2024

Thanks @ebyhr @Praveen2112 for the review. Addressed comments.

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except for a logic to build BigQuery SELECT statement in split manager.

List<String> projectedColumnsNames = getProjectedColumnNames(columns);

String query = filter
.map(whereClause -> "SELECT " + String.join(",", projectedColumnsNames) + " FROM (" + bigQueryQueryRelationHandle.getQuery() + ") WHERE " + whereClause)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.join(",", projectedColumnsNames)

This may create a invalid query SELECT FROM (...) if projectedColumnsNames is an empty right? I'm not sure if such situations really happen though. Can we change to * or throw an exception?

Also, why is projectedColumnsNames unused when filter doesn't exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if such situations really happen though

In this query SELECT count(*) FROM TABLE(bigquery.system.query(query => 'SELECT 1')), Projected column names is empty. Thanks for pointing it out. Fixing this. I think I can use createEmptyProjection in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also using projected column names fails when the query does not have column name provided and gets generated internally SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT 1')).

Caused by: com.google.cloud.bigquery.BigQueryException: Unrecognized name: f0_ at [1:8]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also using projected column names fails when the query does not have column name provided and gets generated internally SELECT * FROM TABLE(bigquery.system.query(query => 'SELECT 1')).

For this reason, I am going back to use * instead of projectedColumnNames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krvikash krvikash force-pushed the krvikash/bigquery-table-functions branch from 1cd3b48 to 4e1c1f2 Compare July 5, 2024 08:04
@ebyhr
Copy link
Member

ebyhr commented Jul 5, 2024

/test-with-secrets sha=4e1c1f2a16112d6297ced970c44de41b4c31b014

Copy link

github-actions bot commented Jul 5, 2024

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9805144765

@krvikash krvikash force-pushed the krvikash/bigquery-table-functions branch from 4e1c1f2 to 38499b7 Compare July 8, 2024 06:29
@krvikash
Copy link
Contributor Author

krvikash commented Jul 8, 2024

Thanks @ebyhr. Addressed comments.

@ebyhr
Copy link
Member

ebyhr commented Jul 8, 2024

/test-with-secrets sha=38499b7253a36d0ee815b0f64a753bcfe8e1b836

Copy link

github-actions bot commented Jul 8, 2024

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/9834807023

@ebyhr ebyhr merged commit 6527107 into trinodb:master Jul 8, 2024
17 of 18 checks passed
@github-actions github-actions bot added this to the 452 milestone Jul 8, 2024
@krvikash krvikash deleted the krvikash/bigquery-table-functions branch July 8, 2024 08:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

4 participants