Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix BucketValidator to validate on data page #22261

Merged
merged 1 commit into from
Jul 29, 2024

Conversation

ljw9111
Copy link
Contributor

@ljw9111 ljw9111 commented Jun 4, 2024

Description

A bug related to BucketValidator was causing select queries on certain column combination on Hive V1 bucketed table failed with HIVE_CURSOR_ERROR while it succeeded in the other column combination.

Minimal example

DDL of bucketed table

CREATE EXTERNAL TABLE `example_table`(
  `input_key` varchar(1), 
  `metrics` struct<gauges:array<struct<name:string,value:bigint>>>)
PARTITIONED BY ( 
  `run_id` varchar(1))
CLUSTERED BY ( 
  input_key) 
INTO 8 BUCKETS
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  '<S3_PATH_TO_TABLE>'
TBLPROPERTIES ( 
  'bucketing_format'='hive', 
  'bucketing_version'='1', 
)

Failed select query with HIVE_CURSOR_ERROR

select run_id, input_key, metrics.gauges from example_table

Succeeded select queries

select run_id, input_key, metrics from example_table
select * from example_table

Root cause

In HivePageSource.java > getNextPage method, BucketValidator and BucketAdapter validate if the data is in the right bucket. (BucketValidator - here, BucketAdapter - here). This validation is done only once by either BucketValidator or BucketAdapter depending on if bucketAdapter is empty.

The root cause of the query failure is that BucketValidator and BucketAdapter were validating different page while they should have validated the same data page. BucketValidator was validating on output page instead of data page in here while BucketAdapter was validating on data page in here.

Output page includes PREFILLED type column which is for partition or synthesized column while data page doesn't include that column type. BucketAdapter and BucketValidator both use the same bucketColumnIndices based on column mapping index which only accounts for regular or interim column type in here.

Thus when BucketValidator calls page.getColumns in validate method in here, in an edge case where in the column order, prefilled(partition) column precedes regular/interim columns, it returns wrong column as bucket column. This is because bucketColumnIndices is based only on regular or interim column while the output page contains blocks from columns including not only regular/interim column, but prefilled one causing mismatch between bucketColumnIndex and the block (indices of blocks got pushed by 1).
The wrong bucket column caused HIVE_CURSOR_ERROR and failed select queries.

BucketAdapter didn't have this issue because both data page being validated and the bucketColumnIndices being using in page.getColumns here both doesn't include prefilled column type.

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fix potential query failures caused by incorrect bucket column validation in some cases

Copy link

cla-bot bot commented Jun 4, 2024

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added the hive Hive connector label Jun 4, 2024
@pettyjamesm
Copy link
Member

cc: @electrum - does moving the bucket validation to operate on the input dataPage instead of the output page make sense here? It does seem to fix the issue with the incorrect block index reference but I'm having a hard time deducing the interaction between INTERIM columns in this context.

Copy link
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this?

@ljw9111
Copy link
Contributor Author

ljw9111 commented Jun 17, 2024

@electrum Thank you for reviewing this!
I tried to add unit test for this but bucketing related test file (AbstractTestHive.java) which included bucketing tests and helper functions has been deleted in this PR for some reason. So I think if we want to add unit test for this, we need to revive many helper functions used to be in AbstractTestHive.java. Is it okay?
I was thinking modifying testBucketedTableEvolutionWithDifferentReadBucketCount which used to be in AbstractTestHive.java and intentionally make partition column to be at the first of the column order and then run getNextPage method on this to see if it correctly fetches the bucket column.

Copy link

github-actions bot commented Jul 9, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jul 9, 2024
@mosabua
Copy link
Member

mosabua commented Jul 22, 2024

Can you help here @electrum ?

@github-actions github-actions bot removed the stale label Jul 23, 2024
@pettyjamesm
Copy link
Member

Thanks for adding the tests, @ljw9111. LGTM

@pettyjamesm pettyjamesm merged commit 94ea76d into trinodb:master Jul 29, 2024
56 checks passed
@github-actions github-actions bot added this to the 454 milestone Jul 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed hive Hive connector
Development

Successfully merging this pull request may close these issues.

None yet

4 participants