Refactor batch coalesce to be based solely on batch data size #1133

jlowe · 2020-11-16T23:56:39Z

#1116 placed a 2GB limit on the GPU batch target size which will prevent us from trying to coalesce batches that can exceed the 2GB element limit for any single column. This PR refactors batch coalesce so it is based solely on data size rather than needing to also track string sizes. This significantly simplifies the coalesce logic and also removes the need for the plugin to traverse the column hierarchy to calculate individual column sizes when the batch is single-buffer based (e.g.: compressed batch or contiguous_split batch).

Once cudf's ColumnVector#getDeviceMemorySize is updated to account for nested columns then even normal GPU batches only will need the plugin to traverse the top-level columns and not need to check for string/list types.

Note that the original batch coalesce code checked for batches containing RapidsHostColumnVector, but those column types should never be seen by a batch coalesce. These types of batches only exist specifically between a partition and the shuffle, and there should never be a coalesce in-between. This also removes the implicit size fallback if the column type is ignored.

Signed-off-by: Jason Lowe <[email protected]>

jlowe · 2020-11-16T23:57:13Z

build

revans2

Looks good and I would be fine with this going in, but I would like to see us be a bit more defensive.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala

Signed-off-by: Jason Lowe <[email protected]>

jlowe · 2020-11-17T17:29:45Z

build

jlowe · 2020-11-17T17:31:02Z

I checked with @nvdbaranec on cudf's concatenate and apparently only one of the two possible code paths is properly checking for a row count overflow. I'm marking this as a draft PR until a fix for that goes in, since this is removing the explicit string limit check in the plugin and relies on cudf::concatenate to catch this.

jlowe · 2020-11-19T23:12:53Z

This is blocked pending rapidsai/cudf#6809

jlowe · 2020-12-02T15:41:02Z

The cudf dependency has been merged to cudf 0.17, so this is finally ready to go.

jlowe · 2020-12-02T15:41:08Z

build

…#1133) * Refactor batch coalesce to be based solely on batch data size Signed-off-by: Jason Lowe <[email protected]> * Add TargetSize limit check and comments Signed-off-by: Jason Lowe <[email protected]>

…IDIA#1133) Signed-off-by: spark-rapids automation <[email protected]>

Refactor batch coalesce to be based solely on batch data size

1a7739a

Signed-off-by: Jason Lowe <[email protected]>

jlowe added this to the Nov 9 - Nov 20 milestone Nov 16, 2020

jlowe self-assigned this Nov 16, 2020

jlowe added this to In progress in Release 0.3 via automation Nov 16, 2020

revans2 previously approved these changes Nov 17, 2020

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoalesceBatches.scala Show resolved Hide resolved

Release 0.3 automation moved this from In progress to Reviewer approved Nov 17, 2020

Add TargetSize limit check and comments

7a1e330

Signed-off-by: Jason Lowe <[email protected]>

jlowe dismissed revans2’s stale review via 7a1e330 November 17, 2020 17:28

Release 0.3 automation moved this from Reviewer approved to Review in progress Nov 17, 2020

jlowe marked this pull request as draft November 17, 2020 17:29

Merge branch 'branch-0.3' into coalesce-batch-refactor

9bfc37b

jlowe mentioned this pull request Nov 19, 2020

Add nested type support to MetaUtils #1170

Merged

sameerz added the feature request New feature or request label Nov 23, 2020

sameerz modified the milestones: Nov 9 - Nov 20, Nov 23 - Dec 4 Nov 23, 2020

Merge branch 'branch-0.3' into coalesce-batch-refactor

91aa4ac

jlowe marked this pull request as ready for review December 2, 2020 15:40

abellina approved these changes Dec 2, 2020

View reviewed changes

Release 0.3 automation moved this from Review in progress to Reviewer approved Dec 2, 2020

jlowe merged commit ffcfaa1 into NVIDIA:branch-0.3 Dec 2, 2020

Release 0.3 automation moved this from Reviewer approved to Done Dec 2, 2020

jlowe deleted the coalesce-batch-refactor branch September 10, 2021 15:31

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to 6c4ff21692507fcf2f801f1986d53d09c1617537 (NV…

15f11ef

…IDIA#1133) Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor batch coalesce to be based solely on batch data size #1133

Refactor batch coalesce to be based solely on batch data size #1133

jlowe commented Nov 16, 2020

jlowe commented Nov 16, 2020

revans2 left a comment

jlowe commented Nov 17, 2020

jlowe commented Nov 17, 2020

jlowe commented Nov 19, 2020

jlowe commented Dec 2, 2020

jlowe commented Dec 2, 2020

Refactor batch coalesce to be based solely on batch data size #1133

Refactor batch coalesce to be based solely on batch data size #1133

Conversation

jlowe commented Nov 16, 2020

jlowe commented Nov 16, 2020

revans2 left a comment

Choose a reason for hiding this comment

jlowe commented Nov 17, 2020

jlowe commented Nov 17, 2020

jlowe commented Nov 19, 2020

jlowe commented Dec 2, 2020

jlowe commented Dec 2, 2020