Faster GPU NMS operator #16542

ptrendx · 2019-10-19T00:27:14Z

Description

This PR significantly improves the performance of mx.sym.contrib.box_nms on GPU.

Test cases:

input size	topk	old time	new time	speedup
(1, 507, 5)	507	0.78 ms	0.14 ms	5.57
(1, 1875, 5)	1875	2.34 ms	0.22 ms	10.63
(1, 7500, 5)	2400	4.07 ms	0.41 ms	9.93
(1, 30000, 5)	2400	4.17 ms	0.41 ms	10.17
(1, 120000, 5)	2400	5.81 ms	0.47 ms	12.36
(1, 159882, 5)	12001	10.48 ms	1.17 ms	8.96

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
For new C++ functions in header files, their functionalities and arguments are documented.
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

Currently the new function is called when there is no requirement for backward pass, but it is easily extendable to include that.

src/operator/contrib/bounding_box.cu

src/operator/tensor/sort_op-inl.cuh

ptrendx · 2019-10-31T21:24:24Z

@zhreshold Do you have any other comments?

For the record - I think there are a few outstanding items to fully mark GPU implementation of NMS as "done", namely

handle the output required for backward in the new implementation
change the way sorting is done (currently sorting happens batch size times, and it turns out that for large topk it is not beneficial, the previous implementation that would always do 2 (larger) sorts was better in that regard.

That said, I will not be able to finish those 2 outstanding items before 1.6 code freeze and I believe performance improvements are enough to merge this PR and address those 2 points as part of the future PR.

apeforest · 2019-11-01T18:13:36Z

src/operator/contrib/bounding_box.cu

+
+template <typename DType>
+__global__ void FilterAndPrepareAuxDataKernel(const DType* data, DType* out, DType* scores,
+ index_t num_elements_per_batch,


nit: indentation alignment?

apeforest · 2019-11-01T18:13:45Z

src/operator/contrib/bounding_box.cu

+
+template <bool check_topk, bool check_score, typename DType>
+__global__ void CompactDataKernel(const index_t* indices, const DType* source,
+ DType* destination, const index_t topk,


nit: indentation alignment?

zhreshold

sorry for the delay, was traveling these days.
The changes lgtm!

* Adding second NMS op * NMS kernel * Removing second sort * Optimization * Adding out-of-place ability to SortByKey * Optimization pt2 * Optimizations pt3 * Do not recompute other boxes area every time * Sort only topk results during second sorting * Cleaning * Fixes from rebase * Fix lint and more fixes from rebase * Fix typo * Early exit in Triangle kernel * Fixes * Fix sort * Fix from rebase * Fix for the mixed naming convention * Fix the index_t with int comparisoon

ptrendx added 15 commits October 18, 2019 17:01

Adding second NMS op

fe2a7d9

NMS kernel

5d0a430

Removing second sort

0829222

Optimization

f3445f8

Adding out-of-place ability to SortByKey

e653f58

Optimization pt2

49ded5a

Optimizations pt3

3680e1d

Do not recompute other boxes area every time

1978b07

Sort only topk results during second sorting

dfaa4c5

Cleaning

9069192

Fixes from rebase

ad60b56

Fix lint and more fixes from rebase

eb89b07

Fix typo

2567266

Early exit in Triangle kernel

6ac371c

Fixes

b0ce511

zhreshold reviewed Oct 19, 2019

View reviewed changes

src/operator/contrib/bounding_box.cu Outdated Show resolved Hide resolved

catree mentioned this pull request Oct 20, 2019

CUDA backend for the DNN module opencv/opencv#14827

Merged

Kh4L reviewed Oct 23, 2019

View reviewed changes

src/operator/tensor/sort_op-inl.cuh Outdated Show resolved Hide resolved

ptrendx added 3 commits October 24, 2019 16:46

Fix sort

753e7c0

Fix from rebase

8cb9a5b

Fix for the mixed naming convention

dc1a599

ptrendx changed the title ~~[WIP] Faster GPU NMS optimization~~ Faster GPU NMS operator Oct 24, 2019

Fix the index_t with int comparisoon

3dfca4d

apeforest reviewed Nov 1, 2019

View reviewed changes

zhreshold approved these changes Nov 2, 2019

View reviewed changes

zhreshold merged commit 0c5677e into apache:master Nov 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster GPU NMS operator #16542

Faster GPU NMS operator #16542

ptrendx commented Oct 19, 2019 •

edited

Loading

ptrendx commented Oct 31, 2019 •

edited

Loading

apeforest Nov 1, 2019

apeforest Nov 1, 2019

zhreshold left a comment

Faster GPU NMS operator #16542

Faster GPU NMS operator #16542

Conversation

ptrendx commented Oct 19, 2019 • edited Loading

Description

Checklist

Essentials

Comments

ptrendx commented Oct 31, 2019 • edited Loading

apeforest Nov 1, 2019

Choose a reason for hiding this comment

apeforest Nov 1, 2019

Choose a reason for hiding this comment

zhreshold left a comment

Choose a reason for hiding this comment

ptrendx commented Oct 19, 2019 •

edited

Loading

ptrendx commented Oct 31, 2019 •

edited

Loading