Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add parquet chunked writing ability for list columns #6831

Merged
merged 18 commits into from
Dec 3, 2020

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Nov 23, 2020

Closes #6530

Changes:

  • Added a method of specifying the nullability of list columns. The API change is as follows: table_metadata_with_nullability.column_nullable[i] used to be the nullability of column[i]. Now it contains the flattened nullability of the table e.g. for a table of three columns, int, list<double>, float, the nullability vector contains the values:
Index Nullability of
0 int column
1 Level 0 of list column (list itself)
2 Level 1 of list column (double values)
3 float column
  • Modified the method of checking schema across write_chunk() calls. Now the entire schema vector is compared rather than just types.
  • Fixed a bug introduced in list writing PR where a non-nested column following a list column would have the wrong value of definition bits. Now all such cases where the information was being queried from schema have been fixed to use parquet_column_view
  • Fixed a regression introduced in a later commit in list writing PR while adding column_view with offset support to list columns. Changed pinned memory to normal pageable memory.
  • Added missing tests for chunked writer where the nullability is mismatched across calls, or nullability is specified in first call.

@devavret devavret requested a review from a team as a code owner November 23, 2020 12:28
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@codecov
Copy link

codecov bot commented Nov 23, 2020

Codecov Report

Merging #6831 (0cc1fff) into branch-0.17 (a2d2726) will increase coverage by 0.04%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.17    #6831      +/-   ##
===============================================
+ Coverage        81.94%   81.98%   +0.04%     
===============================================
  Files               96       96              
  Lines            16166    16181      +15     
===============================================
+ Hits             13247    13266      +19     
+ Misses            2919     2915       -4     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/string.py 86.30% <0.00%> (-0.35%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.53% <0.00%> (+0.03%) ⬆️
python/cudf/cudf/core/column/categorical.py 93.37% <0.00%> (+0.03%) ⬆️
python/cudf/cudf/core/column/timedelta.py 89.53% <0.00%> (+0.08%) ⬆️
python/cudf/cudf/core/index.py 93.25% <0.00%> (+0.11%) ⬆️
python/cudf/cudf/core/column/datetime.py 88.55% <0.00%> (+0.11%) ⬆️
python/cudf/cudf/core/dataframe.py 91.14% <0.00%> (+0.15%) ⬆️
python/cudf/cudf/utils/dtypes.py 89.10% <0.00%> (+0.38%) ⬆️
python/cudf/cudf/utils/utils.py 85.35% <0.00%> (+1.10%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a2d2726...0cc1fff. Read the comment docs.

@harrism harrism changed the title [REVIEW] Added parquet chunked writing ability for list columns [REVIEW] Add parquet chunked writing ability for list columns Nov 24, 2020
@harrism harrism added cuIO cuIO issue 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer libcudf Affects libcudf (C++/CUDA) code. labels Nov 24, 2020
@harrism harrism added this to PR-WIP in v0.17 Release via automation Nov 24, 2020
@harrism harrism moved this from PR-WIP to PR-Needs review in v0.17 Release Nov 24, 2020
@harrism harrism requested review from vuule and removed request for jrhemstad November 24, 2020 23:09
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great for the most part, got a bunch of minor suggestions.

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
@devavret
Copy link
Contributor Author

devavret commented Nov 30, 2020

  • Added a method of specifying the nullability of list columns. The API change is as follows: table_metadata_with_nullability.column_nullable[i] used to be the nullability of column[i]. Now it contains the flattened nullability of the table

@jlowe @revans2 @nvdbaranec , Any objections to this change?

When there are no nested columns in the table, the behaviour stays the same.

@jlowe
Copy link
Member

jlowe commented Dec 1, 2020

Any objections to this change?

This seems OK to me for now. It's backwards-compatible for the cases before nested types and accounts for expressing nullability if nested types are involved.

Eventually I would expect this to be replaced with an explicitly specified write schema as discussed in #6862

@devavret
Copy link
Contributor Author

devavret commented Dec 1, 2020

Eventually I would expect this to be replaced with an explicitly specified write schema as discussed in #6862

Great! I was about to ask there whether it would be ok for the input schema to take on the responsibility of prescribing nullability as well.

@devavret devavret requested a review from a team as a code owner December 3, 2020 01:12
@devavret devavret added non-breaking Non-breaking change feature request New feature or request labels Dec 3, 2020
@devavret devavret requested a review from vuule December 3, 2020 01:15
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand all the parquet stuff, but the code looks clean, except for a potential race condition with async memcopies.

cpp/src/io/parquet/page_enc.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest LGTM

@devavret devavret requested a review from harrism December 3, 2020 20:23
v0.17 Release automation moved this from PR-Needs review to PR-Reviewer approved Dec 3, 2020
@harrism harrism added 6 - Okay to Auto-Merge 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer labels Dec 3, 2020
@rapids-bot rapids-bot bot merged commit 9fb69a6 into rapidsai:branch-0.17 Dec 3, 2020
v0.17 Release automation moved this from PR-Reviewer approved to Done Dec 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
v0.17 Release
  
Done
Development

Successfully merging this pull request may close these issues.

[FEA] Support chunked Parquet writes for lists
6 participants