[REVIEW] Move template param to member var to improve compile of hash/groupby.cu #6835
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The compile time/size of
cpp/src/groupby/hash/groupby.cu
is one of the top offenders for building libcudf.Current top 5 slowest compiles:
The two sort.cu files may be improved in a later PR. The
drop_duplicates.cu
is being addressed in #6822The simple change here is to
compute_single_pass_aggs
functor defined here:cudf/cpp/src/groupby/hash/groupby_kernels.cuh
Lines 65 to 66 in 591bead
The
skip_rows_with_nulls
template parameter is set to avoid calling (and inlining)cudf::bit_is_set()
. This function is minimal compared to thecudf::detail::aggregate_row
function that must be inlined twice to accommodate this template parameter. Simply changing this to a member variable means we still do not incur an extra call tocudf::bit_is_set()
when appropriate but also means we generate half as much device code for this specific function. Thecudf::detail::aggregate_row
code is quite significant.This change reduces the compile time for
hash/groupby.cu
from 16 minutes to 9 minutes. This moves it out of the top 5 (for now). This also reduces the size of the libcudf_base.so by ~5MB.There is no functional changes to any logic. The
gbenchmark/GROUPBY_BENCH
shows no change in performance.