Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of hash_character_ngrams using warp-per-string kernel #16212

Merged
merged 26 commits into from
Aug 1, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jul 8, 2024

Description

Improves the performance of nvtext::hash_character_ngrams using a warp-per-string kernel instead of a string per thread.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 8, 2024
@davidwendt davidwendt self-assigned this Jul 8, 2024
@davidwendt
Copy link
Contributor Author

Benchmark for hash_chararacter_ngrams

## [0] NVIDIA RTX A6000

|  rows  | width | ngrams |   Ref Time |   Cmp Time |           Diff |   %Diff |    x   |
|--------|-------|--------|------------|------------|----------------|---------|--------|
|  1024  |   128 |   5    | 222.331 us |  48.466 us |    -173.864 us | -78.20% |  4.6x  |
|  4096  |   128 |   5    | 220.053 us |  57.080 us |    -162.974 us | -74.06% |  3.9x  |
|  8192  |   128 |   5    | 215.480 us |  68.233 us |    -147.247 us | -68.33% |  3.2x  |
| 16364  |   128 |   5    | 220.708 us |  93.588 us |    -127.119 us | -57.60% |  2.4x  |
| 32768  |   128 |   5    | 230.986 us | 145.089 us |     -85.898 us | -37.19% |  1.6x  |
| 262144 |   128 |   5    |   2.924 ms | 913.710 us |   -2010.059 us | -68.75% |  3.2x  |
|  1024  |   512 |   5    | 807.259 us |  63.721 us |    -743.538 us | -92.11% | 12.7x  |
|  4096  |   512 |   5    | 819.646 us |  93.860 us |    -725.786 us | -88.55% |  8.7x  |
|  8192  |   512 |   5    | 819.803 us | 132.703 us |    -687.100 us | -83.81% |  6.2x  |
| 16364  |   512 |   5    | 873.233 us | 215.296 us |    -657.937 us | -75.34% |  4.1x  |
| 32768  |   512 |   5    | 930.522 us | 386.793 us |    -543.730 us | -58.43% |  2.4x  |
| 262144 |   512 |   5    |  39.315 ms |   2.810 ms |  -36504.542 us | -92.85% | 14.0x  |
|  1024  |  2048 |   5    |   3.269 ms | 130.677 us |   -3138.038 us | -96.00% | 25.0x  |
|  4096  |  2048 |   5    |   3.595 ms | 252.422 us |   -3342.432 us | -92.98% | 14.2x  |
|  8192  |  2048 |   5    |   3.471 ms | 406.363 us |   -3064.841 us | -88.29% |  8.5x  |
| 16364  |  2048 |   5    |   3.602 ms | 735.013 us |   -2866.837 us | -79.59% |  4.9x  |
| 32768  |  2048 |   5    |   3.776 ms |   1.387 ms |   -2388.923 us | -63.26% |  2.7x  |
| 262144 |  2048 |   5    | 234.571 ms |  10.527 ms | -224043.462 us | -95.51% | 22.3x  |
|  1024  |   128 |  10    | 296.423 us |  49.777 us |    -246.646 us | -83.21% |  6.0x  |
|  4096  |   128 |  10    | 291.558 us |  61.260 us |    -230.297 us | -78.99% |  4.8x  |
|  8192  |   128 |  10    | 286.159 us |  74.268 us |    -211.892 us | -74.05% |  3.9x  |
| 16364  |   128 |  10    | 291.627 us | 104.296 us |    -187.331 us | -64.24% |  2.8x  |
| 32768  |   128 |  10    | 300.675 us | 165.557 us |    -135.118 us | -44.94% |  1.8x  |
| 262144 |   128 |  10    |   2.761 ms |   1.082 ms |   -1678.584 us | -60.81% |  2.6x  |
|  1024  |   512 |  10    |   1.186 ms |  70.354 us |   -1115.780 us | -94.07% | 16.9x  |
|  4096  |   512 |  10    |   1.169 ms | 109.403 us |   -1059.529 us | -90.64% | 10.7x  |
|  8192  |   512 |  10    |   1.190 ms | 157.289 us |   -1032.764 us | -86.78% |  7.6x  |
| 16364  |   512 |  10    |   1.224 ms | 261.513 us |    -962.974 us | -78.64% |  4.7x  |
| 32768  |   512 |  10    |   1.293 ms | 479.156 us |    -814.235 us | -62.95% |  2.7x  |
| 262144 |   512 |  10    |  56.334 ms |   3.501 ms |  -52833.185 us | -93.79% | 16.1x  |
|  1024  |  2048 |  10    |   4.883 ms | 159.409 us |   -4723.640 us | -96.74% | 30.6x  |
|  4096  |  2048 |  10    |   5.148 ms | 315.407 us |   -4832.774 us | -93.87% | 16.3x  |
|  8192  |  2048 |  10    |   4.988 ms | 513.009 us |   -4474.621 us | -89.71% |  9.7x  |
| 16364  |  2048 |  10    |   5.134 ms | 924.567 us |   -4209.696 us | -81.99% |  5.6x  |
| 32768  |  2048 |  10    |   5.383 ms |   1.750 ms |   -3632.743 us | -67.49% |  3.1x  |
| 262144 |  2048 |  10    | 349.209 ms |  13.209 ms | -335999.289 us | -96.22% | 26.4x  |

@davidwendt davidwendt changed the base branch from branch-24.08 to branch-24.10 July 23, 2024 16:34
@davidwendt davidwendt removed the 2 - In Progress Currently a work in progress label Jul 26, 2024
@davidwendt davidwendt added the 3 - Ready for Review Ready for review by team label Jul 26, 2024
@davidwendt davidwendt marked this pull request as ready for review July 29, 2024 12:41
@davidwendt davidwendt requested a review from a team as a code owner July 29, 2024 12:41
cpp/src/text/generate_ngrams.cu Outdated Show resolved Hide resolved
cpp/src/text/generate_ngrams.cu Outdated Show resolved Hide resolved
return;
}

namespace cg = cooperative_groups;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for accommodating cg changes @davidwendt. Just to confirm, do the new changes affect the performance results from before in any way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It looks like the performance is the same with cooperative groups as with the cub warp-reduce.

}

namespace cg = cooperative_groups;
auto warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

Suggested change
auto warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());
auto const warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block());

}
}
auto const char_count = cg::reduce(warp, count, cg::plus<int>());
if (lane_idx == 0) { d_counts[str_idx] = std::max(0, char_count - ngrams + 1); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (lane_idx == 0) { d_counts[str_idx] = std::max(0, char_count - ngrams + 1); }
if (lane_idx == 0) { d_counts[str_idx] = max(0, char_count - ngrams + 1); }

non-blocking nit: use CUDA native functions instead of STL ones in device code since the former is hardware accelerated. cuda::std::max also works.

}));
auto [offsets, total_ngrams] =
cudf::detail::make_offsets_child_column(sizes_itr, sizes_itr + input.size(), stream, mr);
auto [offsets, total_ngrams] = [&] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat!

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 05745d0 into rapidsai:branch-24.10 Aug 1, 2024
84 checks passed
@davidwendt davidwendt deleted the ngram-hash-wide-chars branch August 1, 2024 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants