Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor batch coalesce to be based solely on batch data size #1133

Merged
merged 4 commits into from
Dec 2, 2020

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Nov 16, 2020

#1116 placed a 2GB limit on the GPU batch target size which will prevent us from trying to coalesce batches that can exceed the 2GB element limit for any single column. This PR refactors batch coalesce so it is based solely on data size rather than needing to also track string sizes. This significantly simplifies the coalesce logic and also removes the need for the plugin to traverse the column hierarchy to calculate individual column sizes when the batch is single-buffer based (e.g.: compressed batch or contiguous_split batch).

Once cudf's ColumnVector#getDeviceMemorySize is updated to account for nested columns then even normal GPU batches only will need the plugin to traverse the top-level columns and not need to check for string/list types.

Note that the original batch coalesce code checked for batches containing RapidsHostColumnVector, but those column types should never be seen by a batch coalesce. These types of batches only exist specifically between a partition and the shuffle, and there should never be a coalesce in-between. This also removes the implicit size fallback if the column type is ignored.

@jlowe jlowe added this to the Nov 9 - Nov 20 milestone Nov 16, 2020
@jlowe jlowe self-assigned this Nov 16, 2020
@jlowe jlowe added this to In progress in Release 0.3 via automation Nov 16, 2020
@jlowe
Copy link
Member Author

jlowe commented Nov 16, 2020

build

revans2
revans2 previously approved these changes Nov 17, 2020
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and I would be fine with this going in, but I would like to see us be a bit more defensive.

Release 0.3 automation moved this from In progress to Reviewer approved Nov 17, 2020
Release 0.3 automation moved this from Reviewer approved to Review in progress Nov 17, 2020
@jlowe jlowe marked this pull request as draft November 17, 2020 17:29
@jlowe
Copy link
Member Author

jlowe commented Nov 17, 2020

build

@jlowe
Copy link
Member Author

jlowe commented Nov 17, 2020

I checked with @nvdbaranec on cudf's concatenate and apparently only one of the two possible code paths is properly checking for a row count overflow. I'm marking this as a draft PR until a fix for that goes in, since this is removing the explicit string limit check in the plugin and relies on cudf::concatenate to catch this.

@jlowe
Copy link
Member Author

jlowe commented Nov 19, 2020

This is blocked pending rapidsai/cudf#6809

@sameerz sameerz added the feature request New feature or request label Nov 23, 2020
@jlowe jlowe marked this pull request as ready for review December 2, 2020 15:40
@jlowe
Copy link
Member Author

jlowe commented Dec 2, 2020

The cudf dependency has been merged to cudf 0.17, so this is finally ready to go.

@jlowe
Copy link
Member Author

jlowe commented Dec 2, 2020

build

Release 0.3 automation moved this from Review in progress to Reviewer approved Dec 2, 2020
@jlowe jlowe merged commit ffcfaa1 into NVIDIA:branch-0.3 Dec 2, 2020
Release 0.3 automation moved this from Reviewer approved to Done Dec 2, 2020
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…#1133)

* Refactor batch coalesce to be based solely on batch data size

Signed-off-by: Jason Lowe <[email protected]>

* Add TargetSize limit check and comments

Signed-off-by: Jason Lowe <[email protected]>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…#1133)

* Refactor batch coalesce to be based solely on batch data size

Signed-off-by: Jason Lowe <[email protected]>

* Add TargetSize limit check and comments

Signed-off-by: Jason Lowe <[email protected]>
@jlowe jlowe deleted the coalesce-batch-refactor branch September 10, 2021 15:31
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
No open projects
Release 0.3
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants