Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda kernels for Lower triangular solve #336

Merged
merged 19 commits into from
Sep 6, 2019
Merged

Cuda kernels for Lower triangular solve #336

merged 19 commits into from
Sep 6, 2019

Conversation

pratikvn
Copy link
Member

@pratikvn pratikvn commented Aug 15, 2019

This PR implements the CUDA kernels for the lower triangular solver.

For CUDA versions <=9.1, one can only use csrsm_solve

For CUDA versions >=9.2, one can use csrsm2_solve, and this algorithm is more efficient and the previous csrsm_solve is deprecated from 10.1, so, unfortunately, we have to do a switch using #if defines.

Additionally, CUDA uses different algorithms than the simple algorithms implemented in the reference and omp kernels. Hence I am not sure if it makes sense to compare the kernels as we do for other solvers.

Also, I am not sure how to test the generate kernel properly (or of it is even testable), because all it does it allocate and create from information required for csrsm2_solve calls.

There are additional parameters that can be set for example the algo and the CUSPARSE_SOLVE_POLICY_USE_LEVEL (see documentation), which technically should be exposed to the user, as they may want to tweak it, but I dont do it yet as it can complicate the parameter choices. But I could add these if required.

Note: Currently, only single right hand side version works. I tried but I cannot get the multiple rhs to work. If someone has an idea of how to solve this, I am happy to discuss. The things that I have tried:

  1. Because Cusparse uses col-major rather than row major (as Ginkgo does), I tried (as @yhmtsai suggested) to use CUSPARSE_OPERATION_NON_TRANSPOSE for the rhs. But that does not work as well. I think maybe this is not a problem as the single right hand side seems to work for the CUSPARSE_OPERATION_NON_TRANSPOSE ans as you actually pass in the both dimensions of the rhs the cusparse function can actually figure out and the right thing.

  2. I also tried transposing the rhs and the sol matrix before hand and passing them to the cusparse function with the non-transpose now, but this also has the same problem.

Update: For future references: The multiple RHS as expected now works on CUDA>=9.2, but for CUDA versions <=9.1, multiple rhs solves are handled in a loop with each loop doing 1 rhs solve.

@pratikvn pratikvn self-assigned this Aug 15, 2019
@pratikvn pratikvn added mod:cuda This is related to the CUDA module. is:new-feature A request or implementation of a feature that does not exist yet. type:solver This is related to the solvers 1:ST:WIP This PR is a work in progress. Not ready for review. labels Aug 15, 2019
@codecov
Copy link

codecov bot commented Aug 15, 2019

Codecov Report

Merging #336 into develop will increase coverage by 0.01%.
The diff coverage is 98.62%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #336      +/-   ##
===========================================
+ Coverage    98.22%   98.23%   +0.01%     
===========================================
  Files          237      238       +1     
  Lines        17993    18076      +83     
===========================================
+ Hits         17673    17757      +84     
+ Misses         320      319       -1
Impacted Files Coverage Δ
cuda/base/cusparse_bindings.hpp 100% <ø> (ø) ⬆️
core/test/solver/lower_trs.cpp 100% <ø> (ø) ⬆️
reference/test/solver/lower_trs_kernels.cpp 100% <100%> (ø) ⬆️
reference/test/solver/lower_trs.cpp 100% <100%> (ø) ⬆️
omp/solver/lower_trs_kernels.cpp 100% <100%> (ø) ⬆️
cuda/test/solver/lower_trs_kernels.cpp 100% <100%> (ø)
include/ginkgo/core/solver/lower_trs.hpp 100% <100%> (+9.37%) ⬆️
reference/solver/lower_trs_kernels.cpp 100% <100%> (ø) ⬆️
omp/test/solver/lower_trs_kernels.cpp 100% <100%> (ø) ⬆️
core/solver/lower_trs.cpp 93.54% <88.88%> (-6.46%) ⬇️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 263a455...050c727. Read the comment docs.

@pratikvn pratikvn force-pushed the trs-cuda-kernels branch 3 times, most recently from 012ceda to 6558a53 Compare August 16, 2019 09:54
@pratikvn pratikvn added mod:cuda This is related to the CUDA module. and removed mod:cuda This is related to the CUDA module. labels Aug 19, 2019
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is failed when the b is multiple right hand side or its stride is not equal to 1 because Cuda's dense matrix is col-major but Ginkgo's is row-major.
the possible approach is that
set trans_B = CUSPARSE_OPERATION_TRANSPOSE for cuda version > 9.1
and do the transpose before solving for others

@pratikvn pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Aug 19, 2019
@pratikvn pratikvn added 1:ST:WIP This PR is a work in progress. Not ready for review. and removed 1:ST:ready-for-review This PR is ready for review labels Aug 20, 2019
@tcojean
Copy link
Member

tcojean commented Aug 22, 2019

If you want both better coverage result and to check your code for Intel compilers (should be fine though) you could rebase to the latest develop.

@pratikvn pratikvn force-pushed the trs-cuda-kernels branch 2 times, most recently from d1a715d to a271679 Compare August 23, 2019 12:58
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
@pratikvn pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Aug 27, 2019
thoasm
thoasm previously requested changes Aug 27, 2019
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first scan over this PR. It is not complete, but I found some parts that I would like to be changed.

omp/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
@pratikvn pratikvn force-pushed the trs-cuda-kernels branch 2 times, most recently from ed252d5 to 7992508 Compare August 29, 2019 10:30
core/solver/lower_trs_kernels.hpp Outdated Show resolved Hide resolved
cuda/test/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
cuda/test/solver/lower_trs_kernels.cpp Show resolved Hide resolved
include/ginkgo/core/base/executor.hpp Outdated Show resolved Hide resolved
omp/test/solver/lower_trs_kernels.cpp Show resolved Hide resolved
omp/test/solver/lower_trs_kernels.cpp Show resolved Hide resolved
+ Add a kernel to create and destroy the struct.
+ Remove the now not-needed clear kernel.
+ Add an additional transposability check to allocate
  memory for transpose the temp trans vector, only if needed.
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only have questions about gtest location.

cuda/test/solver/lower_trs_kernels.cpp Show resolved Hide resolved
Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a few improvements to the code, but otherwise LGTM.

core/solver/lower_trs_kernels.hpp Outdated Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Outdated Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Outdated Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Show resolved Hide resolved
cuda/test/solver/lower_trs_kernels.cpp Show resolved Hide resolved
cuda/test/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
cuda/test/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
+ Review update. Improve the ifdef checking.
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I have some comments.
Especially, I am worried what happens when the number of right hand sides that are specified with with_num_rhs are not the same as the b you give it.

include/ginkgo/core/solver/lower_trs.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/lower_trs.hpp Outdated Show resolved Hide resolved
core/solver/lower_trs_kernels.hpp Outdated Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Show resolved Hide resolved
cuda/test/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
core/solver/lower_trs_kernels.hpp Outdated Show resolved Hide resolved
reference/test/solver/lower_trs.cpp Outdated Show resolved Hide resolved
reference/test/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
@pratikvn pratikvn force-pushed the trs-cuda-kernels branch 3 times, most recently from fdb1978 to 0b012e7 Compare September 5, 2019 12:23
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One unused function, and some comments on documentation.
Looks good!

core/test/solver/lower_trs.cpp Outdated Show resolved Hide resolved
core/solver/lower_trs.cpp Show resolved Hide resolved
core/solver/lower_trs_kernels.hpp Outdated Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Outdated Show resolved Hide resolved
cuda/base/cusparse_bindings.hpp Outdated Show resolved Hide resolved
cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
include/ginkgo/core/solver/lower_trs.hpp Outdated Show resolved Hide resolved
omp/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
reference/solver/lower_trs_kernels.cpp Outdated Show resolved Hide resolved
reference/test/solver/lower_trs.cpp Outdated Show resolved Hide resolved
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comments.

cuda/solver/lower_trs_kernels.cu Outdated Show resolved Hide resolved
include/ginkgo/core/solver/lower_trs.hpp Outdated Show resolved Hide resolved
@tcojean tcojean self-requested a review September 6, 2019 10:04
Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

+ Fix the SolveStruct namespace clarification.
+ Add a proper free for the workspace.
+ Some doc clarifications.
Copy link
Member

@thoasm thoasm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@pratikvn pratikvn merged commit 6d420ff into develop Sep 6, 2019
@pratikvn pratikvn deleted the trs-cuda-kernels branch September 6, 2019 13:24
tcojean added a commit that referenced this pull request Oct 20, 2019
The Ginkgo team is proud to announce the new minor release of Ginkgo version
1.1.0. This release brings several performance improvements, adds Windows support, 
adds support for factorizations inside Ginkgo and a new ILU preconditioner
based on ParILU algorithm, among other things. For detailed information, check the respective issue.

Supported systems and requirements:
+ For all platforms, cmake 3.9+
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, 8.1+
  + clang: 3.9+
  + Intel compiler: 2017+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
+ Windows
  + MinGW and CygWin: gcc 5.3+, 6.3+, 7.3+, 8.1+
  + Microsoft Visual Studio: VS 2017 15.7+
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or CygWin.


The current known issues can be found in the [known issues
page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues).


Additions:
+ Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) 
+ New factorization support in Ginkgo, and addition of the ParILU
  algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324))
+ New ILU preconditioner ([#348](#348), [#353](#353))
+ Windows MinGW and Cygwin support ([#347](#347))
+ Windows Visual studio support ([#351](#351))
+ New example showing how to use ParILU as a preconditioner ([#358](#358))
+ New example on using loggers for debugging ([#360](#360))
+ Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306))
+ Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303))
+ New benchmark for sparse matrix format conversions ([#312](#312))
+ Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310))
+ Support for sorting rows in the CSR format by column idices ([#322](#322))
+ Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345))
+ Addition of a LinOp to handle perturbations of the form (identity + scalar *
  basis * projector) ([#334](#334))
+ New sparsity matrix representation format with Reference and OpenMP
  kernels ([#349](#349), [#350](#350))

Fixes:
+ Accelerate GMRES solver for CUDA executor ([#363](#363))
+ Fix BiCGSTAB solver convergence ([#359](#359))
+ Fix CGS logging by reporting the residual for every sub iteration ([#328](#328))
+ Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295))
+ Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318))
+ Fixed slowdown of COO SpMV on OpenMP ([#340](#340))
+ Fix gcc 6.4.0 internal compiler error ([#316](#316))
+ Fix compilation issue on Apple clang++ 10 ([#322](#322))
+ Make Ginkgo able to compile on Intel 2017 and above ([#337](#337))
+ Make the benchmarks spmv/solver use the same matrix formats ([#366](#366))
+ Fix self-written isfinite function ([#348](#348))
+ Fix Jacobi issues shown by cuda-memcheck

Tools and ecosystem:
+ Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365))
+ Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361))
+ Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309))
+ Add clang-tidy and iwyu support to Ginkgo ([#298](#298))
+ Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments
  to CMake ([#300](#300))
+ Add support for the xSDK R7 policy ([#325](#325))
+ Fix examples in html documentation ([#367](#367))
tcojean added a commit that referenced this pull request Oct 21, 2019
The Ginkgo team is proud to announce the new minor release of Ginkgo version
1.1.0. This release brings several performance improvements, adds Windows support,
adds support for factorizations inside Ginkgo and a new ILU preconditioner
based on ParILU algorithm, among other things. For detailed information, check the respective issue.

Supported systems and requirements:
+ For all platforms, cmake 3.9+
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, 8.1+
  + clang: 3.9+
  + Intel compiler: 2017+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, 8.1+
  + Microsoft Visual Studio: VS 2017 15.7+
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


The current known issues can be found in the [known issues
page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues).


### Additions
+ Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) 
+ New factorization support in Ginkgo, and addition of the ParILU
  algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324))
+ New ILU preconditioner ([#348](#348), [#353](#353))
+ Windows MinGW and Cygwin support ([#347](#347))
+ Windows Visual Studio support ([#351](#351))
+ New example showing how to use ParILU as a preconditioner ([#358](#358))
+ New example on using loggers for debugging ([#360](#360))
+ Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306))
+ Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303))
+ New benchmark for sparse matrix format conversions ([#312](#312))
+ Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310))
+ Support for sorting rows in the CSR format by column idices ([#322](#322))
+ Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345))
+ Addition of a LinOp to handle perturbations of the form (identity + scalar *
  basis * projector) ([#334](#334))
+ New sparsity matrix representation format with Reference and OpenMP
  kernels ([#349](#349), [#350](#350))

### Fixes
+ Accelerate GMRES solver for CUDA executor ([#363](#363))
+ Fix BiCGSTAB solver convergence ([#359](#359))
+ Fix CGS logging by reporting the residual for every sub iteration ([#328](#328))
+ Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295))
+ Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318))
+ Fixed slowdown of COO SpMV on OpenMP ([#340](#340))
+ Fix gcc 6.4.0 internal compiler error ([#316](#316))
+ Fix compilation issue on Apple clang++ 10 ([#322](#322))
+ Make Ginkgo able to compile on Intel 2017 and above ([#337](#337))
+ Make the benchmarks spmv/solver use the same matrix formats ([#366](#366))
+ Fix self-written isfinite function ([#348](#348))
+ Fix Jacobi issues shown by cuda-memcheck

### Tools and ecosystem improvements
+ Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365))
+ Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361))
+ Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309))
+ Add clang-tidy and iwyu support to Ginkgo ([#298](#298))
+ Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments
  to CMake ([#300](#300))
+ Add support for the xSDK R7 policy ([#325](#325))
+ Fix examples in html documentation ([#367](#367))


Related PR: #370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review is:new-feature A request or implementation of a feature that does not exist yet. mod:cuda This is related to the CUDA module. type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants