Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 'ColumnPath not found' error reading Parquet files with nested REPEATED fields #5102

Merged
merged 3 commits into from
Nov 28, 2023

Conversation

mmaitre314
Copy link
Contributor

Which issue does this PR close?

Closes #5064.

Rationale for this change

reader_tree() maintains a stack of field names as it recurses through nested schemas. When it encounters a REPEATED field, it performs two pops for only one push and ends up being off by one. The change ensures the counts of pushes and pops match.

Are there any user-facing changes?

No

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review, this looks good. Left a minor improvement suggestion.

FYI @sunchao as you're likely more familiar with this code than I am

parquet/src/record/reader.rs Outdated Show resolved Hide resolved
@tustvold tustvold merged commit a361ce1 into apache:master Nov 28, 2023
16 checks passed
Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late LGTM too, thanks @tustvold for the ping.

@@ -274,12 +274,12 @@ impl TreeBuilder {
row_group_reader,
)?;

Reader::RepeatedReader(
return Ok(Reader::RepeatedReader(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: instead of returning early here we could also just remove the path.pop() above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will that not change the path passed to reader_tree above? Does that not matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing path.pop() early on and the recursive call to reader_tree() started failing because it did not find the fields it expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, nvm then 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RowGroupReader.get_row_iter() fails with Path ColumnPath not found
4 participants