Fix BucketValidator to validate on data page #22261
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
A bug related to BucketValidator was causing select queries on certain column combination on Hive V1 bucketed table failed with HIVE_CURSOR_ERROR while it succeeded in the other column combination.
Minimal example
DDL of bucketed table
Failed select query with HIVE_CURSOR_ERROR
Succeeded select queries
Root cause
In HivePageSource.java > getNextPage method, BucketValidator and BucketAdapter validate if the data is in the right bucket. (BucketValidator - here, BucketAdapter - here). This validation is done only once by either BucketValidator or BucketAdapter depending on if bucketAdapter is empty.
The root cause of the query failure is that BucketValidator and BucketAdapter were validating different page while they should have validated the same data page. BucketValidator was validating on output page instead of data page in here while BucketAdapter was validating on data page in here.
Output page includes PREFILLED type column which is for partition or synthesized column while data page doesn't include that column type. BucketAdapter and BucketValidator both use the same bucketColumnIndices based on column mapping index which only accounts for regular or interim column type in here.
Thus when BucketValidator calls page.getColumns in validate method in here, in an edge case where in the column order, prefilled(partition) column precedes regular/interim columns, it returns wrong column as bucket column. This is because bucketColumnIndices is based only on regular or interim column while the output page contains blocks from columns including not only regular/interim column, but prefilled one causing mismatch between bucketColumnIndex and the block (indices of blocks got pushed by 1).
The wrong bucket column caused HIVE_CURSOR_ERROR and failed select queries.
BucketAdapter didn't have this issue because both data page being validated and the bucketColumnIndices being using in page.getColumns here both doesn't include prefilled column type.
Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: