-
Notifications
You must be signed in to change notification settings - Fork 874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add parquet chunked writing ability for list columns #6831
[REVIEW] Add parquet chunked writing ability for list columns #6831
Conversation
Caused by mistaken merge of temporarily reverted code in rapidsai@e118da0
Construct the schema regardless of first call or not and compare existing echema with constructed schema if not the first call
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #6831 +/- ##
===============================================
+ Coverage 81.94% 81.98% +0.04%
===============================================
Files 96 96
Lines 16166 16181 +15
===============================================
+ Hits 13247 13266 +19
+ Misses 2919 2915 -4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great for the most part, got a bunch of minor suggestions.
@jlowe @revans2 @nvdbaranec , Any objections to this change? When there are no nested columns in the table, the behaviour stays the same. |
This seems OK to me for now. It's backwards-compatible for the cases before nested types and accounts for expressing nullability if nested types are involved. Eventually I would expect this to be replaced with an explicitly specified write schema as discussed in #6862 |
Great! I was about to ask there whether it would be ok for the input schema to take on the responsibility of prescribing nullability as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand all the parquet stuff, but the code looks clean, except for a potential race condition with async memcopies.
…ret/cudf into list-parquet-chunked-write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest LGTM
Closes #6530
Changes:
table_metadata_with_nullability.column_nullable[i]
used to be the nullability of column[i]. Now it contains the flattened nullability of the table e.g. for a table of three columns,int, list<double>, float
, the nullability vector contains the values:write_chunk()
calls. Now the entire schema vector is compared rather than just types.parquet_column_view