-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda and Omp matrix conversion of Csr and Hybrid and Omp count_nonzeros of Ell #310
Cuda and Omp matrix conversion of Csr and Hybrid and Omp count_nonzeros of Ell #310
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, I only have comments for reducing code duplication in the CUDA kernels, and checking quickly the quality of the OpenMP parallelization:
- is using OpenMP with 1 thread creating a significant slowdown compared to reference?
- is using OpenMP with 4 threads (or more) creating a significant slowdown compared to 1 thread?
If that is the case, some internal loop parallelization might be the problems. If there is no such problem, it's good as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good, but please get rid of the unsigned long long int
and remove the code duplication.
@tcojean @thoasm Both of you mention the the function converter is like
Do you think it works? or any idea for it. |
@yhmtsai Yes, this should work. I did something similar multiple times already when splitting
I haven't benchmarked these two against each other (especially without doing any splitting), but I think they should both have almost no overhead (I actually hope they have none). |
I check the performance of omp kernel. |
|
6e1631e
to
3e51a52
Compare
Codecov Report
@@ Coverage Diff @@
## develop #310 +/- ##
===========================================
+ Coverage 98.05% 98.17% +0.12%
===========================================
Files 216 215 -1
Lines 16328 16529 +201
===========================================
+ Hits 16010 16227 +217
+ Misses 318 302 -16
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good job on removing all these duplications.
@yhmtsai for the OpenMP kernels, if they are not much slower than the reference ones, then I think it is fine. It at least shows the idea for parallelizing this. If there is a very important slowdown, then it's better to remove them. |
Hybrid fills all coo values, so it does not need to initial coo array. Csr-> Hybrid
Hybrid -> Csr
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a lot better now, but there are some parts which I would like to see improved.
Mainly renaming the as
function and adding some documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think after replacing the GKO_ASSERT
in the atomic_add
(I don't think we need it anymore after the fixed static_assert
is in place) and adding a bit more documentation, we should be ready to merge.
Well done!
Regarding the runtime results, I think they are perfectly fine. I don't think it is necessary to change it to just use the reference implementation. The only issue could be both 16 thread cases, but I think there, it was too much overhead compared to the (relatively small) size of the matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and CygWin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or CygWin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). Additions: + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](https://github.com/ginkgo-project/ginkgo/issues/312)[#317](https://github.com/ginkgo-project/ginkgo/issues/317)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) Fixes: + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck Tools and ecosystem: + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367))
The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). ### Additions + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual Studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](https://github.com/ginkgo-project/ginkgo/issues/312)[#317](https://github.com/ginkgo-project/ginkgo/issues/317)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) ### Fixes + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck ### Tools and ecosystem improvements + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367)) Related PR: #370
Cuda and Omp matrix conversion of Csr and Hybrid and Ell count_nonzeros on Omp
Two functions are implemented in header files, so they are still on the host.
I would like to put them in another PR if they are needed on Cuda.
I am not sure whether performance of make_srow is better or not.
The performance of sorting on GPU is better than CPU according to this link, so I think we need it on GPU.