Cuda and Omp matrix conversion of Csr and Hybrid and Omp count_nonzeros of Ell #310

yhmtsai · 2019-05-25T19:59:54Z

Cuda and Omp matrix conversion of Csr and Hybrid and Ell count_nonzeros on Omp

Two functions are implemented in header files, so they are still on the host.

make_srow of csr with load_balance algorithm
the function of hybrid which decide the column limit of ell (It needs a sorting kernel)

I would like to put them in another PR if they are needed on Cuda.
I am not sure whether performance of make_srow is better or not.
The performance of sorting on GPU is better than CPU according to this link, so I think we need it on GPU.

tcojean

LGTM in general, I only have comments for reducing code duplication in the CUDA kernels, and checking quickly the quality of the OpenMP parallelization:

is using OpenMP with 1 thread creating a significant slowdown compared to reference?
is using OpenMP with 4 threads (or more) creating a significant slowdown compared to 1 thread?
If that is the case, some internal loop parallelization might be the problems. If there is no such problem, it's good as it is.

cuda/matrix/hybrid_kernels.cu

omp/matrix/csr_kernels.cpp

omp/matrix/hybrid_kernels.cpp

thoasm

Overall, it looks good, but please get rid of the unsigned long long int and remove the code duplication.

cuda/components/zero_array.cu

cuda/matrix/hybrid_kernels.cu

yhmtsai · 2019-05-27T17:27:33Z

@tcojean @thoasm Both of you mention the SizeType (cuda_size_type or unsigned long long int).
I would like to get the exact nnz per row of Coo by spmv which uses atomicAdd.
The answer should be IndexType (int32 or int64) or size_type.
However, the cuda default atomicAdd of integer operator is for int, unsigned int, unsigned long long int.
I guess cuda always thinks unsigned long long int for 64bit. (it is also used in double atomicAdd)
In my environment, the compiler thinks size_type and unsigned long long int are different
Thus, I use unsigned long long int to handle int32, int64 or size_type.
I think I will do the converter for type with 64bit <-> unsigned long long int and implement its atomicAdd, like double atomicAdd using __longlong_as_double and __double_as_longlong in some architecture.

the function converter is like

__device__ size_type ull_as_size_type(ull value) {
    union DataType {
        unsigned long long int ull_val;
        size_type st_val;
    };
    DataType temp;
    temp.ull_val = value;
    return temp.st_val;
}

Do you think it works? or any idea for it.

thoasm · 2019-05-27T17:49:28Z

@yhmtsai Yes, this should work. I did something similar multiple times already when splitting double in various parts.
As far as I know, this (or something very similar with reinterpret_cast) would also work:

__device__ size_type convert(unsigned long long int val) {
    return reinterpret_cast<size_type &>(val);
}

I haven't benchmarked these two against each other (especially without doing any splitting), but I think they should both have almost no overhead (I actually hope they have none).

yhmtsai · 2019-05-29T17:09:42Z

I check the performance of omp kernel.
They are not faster than reference kernel.
I use 10000 x 2019 with at least 30 nonzero for testing.
Should I delete them or use the reference kernel for conversion performance test?

yhmtsai · 2019-05-30T09:18:52Z

format_conversion.hpp is for handling the same function in format_conversion.
convert_coo_row_idxs_to_ptrs is not in coo_kernel.hpp, and it is not directly used by coo.cpp.
I also put it in format_conversion.hpp now.
atomic.cuh should handle 32-bit and 64-bit atomic add now.
if using cuda 10.1 or higher, it can also handle 16-bit atomic add.
if using cuda 10.0 or higher, I bind the cuda function of __half and __half according to architecture.

codecov · 2019-06-03T10:12:34Z

Codecov Report

Merging #310 into develop will increase coverage by 0.12%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           develop     #310      +/-   ##
===========================================
+ Coverage    98.05%   98.17%   +0.12%     
===========================================
  Files          216      215       -1     
  Lines        16328    16529     +201     
===========================================
+ Hits         16010    16227     +217     
+ Misses         318      302      -16

Impacted Files	Coverage Δ
include/ginkgo/core/matrix/hybrid.hpp	`91.52% <100%> (ø)`	⬆️
omp/matrix/ell_kernels.cpp	`91.66% <100%> (+6.66%)`	⬆️
omp/test/matrix/csr_kernels.cpp	`100% <100%> (ø)`	⬆️
cuda/test/matrix/csr_kernels.cpp	`100% <100%> (ø)`	⬆️
omp/matrix/csr_kernels.cpp	`94.28% <100%> (+7.92%)`	⬆️
omp/test/matrix/ell_kernels.cpp	`100% <100%> (ø)`	⬆️
cuda/test/matrix/hybrid_kernels.cpp	`100% <100%> (ø)`	⬆️
omp/matrix/hybrid_kernels.cpp	`98.82% <100%> (+20.56%)`	⬆️
omp/test/matrix/hybrid_kernels.cpp	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update de51ee9...c255b78. Read the comment docs.

tcojean

LGTM. Good job on removing all these duplications.

cuda/matrix/hybrid_kernels.cu

tcojean · 2019-06-03T10:32:18Z

@yhmtsai for the OpenMP kernels, if they are not much slower than the reference ones, then I think it is fine. It at least shows the idea for parallelizing this. If there is a very important slowdown, then it's better to remove them.

yhmtsai · 2019-06-03T17:45:45Z

Hybrid fills all coo values, so it does not need to initial coo array.
I delete the functions of cuda and omp.
The following is omp performance.
The matrices are 10000 x 2019 with at least 30 elements in each row.

Csr-> Hybrid

OMP_NUM_THREADS=1 ./csr_kernels
Ref : 1.36327e+08ns.
Omp : 1.09301e+08ns.
OMP_NUM_THREADS=4 ./csr_kernels
Ref : 1.33657e+08ns.
Omp : 8.19972e+07ns. (Speedup 1.33x over 1-thread)
OMP_NUM_THREADS=8 ./csr_kernels
Ref : 1.32982e+08ns.
Omp : 7.30211e+07ns. (Speedup 1.5x over 1-thread)
OMP_NUM_THREADS=16 ./csr_kernels
Ref : 1.3298e+08ns.
Omp : 1.42636e+08ns. (Speedup 0.77x over 1-thread)

Hybrid -> Csr

OMP_NUM_THREADS=1 ./hybrid_kernels
Ref : 6.49456e+07ns.
Omp : 9.57758e+07ns.
OMP_NUM_THREADS=4 ./hybrid_kernels
Ref : 6.49526e+07ns.
Omp : 1.00031e+08ns. (Speedup 0.96x over 1-thread)
OMP_NUM_THREADS=8 ./hybrid_kernels
Ref : 7.43198e+07ns.
Omp : 1.18494e+08ns. (Speedup 0.81x over 1-thread)
OMP_NUM_THREADS=16 ./hybrid_kernels
Ref : 6.39154e+07ns.
Omp : 6.36856e+08ns. (Speedup 0.15x over 1-thread)

thoasm

It looks a lot better now, but there are some parts which I would like to see improved.
Mainly renaming the as function and adding some documentation.

cuda/components/atomic.cuh

cuda/components/format_conversion.hpp

cuda/components/reduction.cuh

cuda/components/segment_scan.cuh

cuda/components/zero_array.cu

cuda/components/format_conversion.hpp

thoasm

I think after replacing the GKO_ASSERT in the atomic_add (I don't think we need it anymore after the fixed static_assert is in place) and adding a bit more documentation, we should be ready to merge.

Well done!

cuda/components/format_conversion.cuh

thoasm · 2019-06-04T13:25:37Z

Regarding the runtime results, I think they are perfectly fine. I don't think it is necessary to change it to just use the reference implementation. The only issue could be both 16 thread cases, but I think there, it was too much overhead compared to the (relatively small) size of the matrix.

thoasm

LGTM!

The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and CygWin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or CygWin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). Additions: + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](https://github.com/ginkgo-project/ginkgo/issues/312)[#317](https://github.com/ginkgo-project/ginkgo/issues/317)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) Fixes: + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck Tools and ecosystem: + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367))

The Ginkgo team is proud to announce the new minor release of Ginkgo version 1.1.0. This release brings several performance improvements, adds Windows support, adds support for factorizations inside Ginkgo and a new ILU preconditioner based on ParILU algorithm, among other things. For detailed information, check the respective issue. Supported systems and requirements: + For all platforms, cmake 3.9+ + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, 8.1+ + clang: 3.9+ + Intel compiler: 2017+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, 8.1+ + Microsoft Visual Studio: VS 2017 15.7+ + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. The current known issues can be found in the [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues). ### Additions + Upper and lower triangular solvers ([#327](#327), [#336](#336), [#341](#341), [#342](#342)) + New factorization support in Ginkgo, and addition of the ParILU algorithm ([#305](#305), [#315](#315), [#319](#319), [#324](#324)) + New ILU preconditioner ([#348](#348), [#353](#353)) + Windows MinGW and Cygwin support ([#347](#347)) + Windows Visual Studio support ([#351](#351)) + New example showing how to use ParILU as a preconditioner ([#358](#358)) + New example on using loggers for debugging ([#360](#360)) + Add two new 9pt and 27pt stencil examples ([#300](#300), [#306](#306)) + Allow benchmarking CuSPARSE spmv formats through Ginkgo's benchmarks ([#303](#303)) + New benchmark for sparse matrix format conversions ([#312](https://github.com/ginkgo-project/ginkgo/issues/312)[#317](https://github.com/ginkgo-project/ginkgo/issues/317)) + Add conversions between CSR and Hybrid formats ([#302](#302), [#310](#310)) + Support for sorting rows in the CSR format by column idices ([#322](#322)) + Addition of a CUDA COO SpMM kernel for improved performance ([#345](#345)) + Addition of a LinOp to handle perturbations of the form (identity + scalar * basis * projector) ([#334](#334)) + New sparsity matrix representation format with Reference and OpenMP kernels ([#349](#349), [#350](#350)) ### Fixes + Accelerate GMRES solver for CUDA executor ([#363](#363)) + Fix BiCGSTAB solver convergence ([#359](#359)) + Fix CGS logging by reporting the residual for every sub iteration ([#328](#328)) + Fix CSR,Dense->Sellp conversion's memory access violation ([#295](#295)) + Accelerate CSR->Ell,Hybrid conversions on CUDA ([#313](#313), [#318](#318)) + Fixed slowdown of COO SpMV on OpenMP ([#340](#340)) + Fix gcc 6.4.0 internal compiler error ([#316](#316)) + Fix compilation issue on Apple clang++ 10 ([#322](#322)) + Make Ginkgo able to compile on Intel 2017 and above ([#337](#337)) + Make the benchmarks spmv/solver use the same matrix formats ([#366](#366)) + Fix self-written isfinite function ([#348](#348)) + Fix Jacobi issues shown by cuda-memcheck ### Tools and ecosystem improvements + Multiple improvements to the CI system and tools ([#296](#296), [#311](#311), [#365](#365)) + Multiple improvements to the Ginkgo containers ([#328](#328), [#361](#361)) + Add sonarqube analysis to Ginkgo ([#304](#304), [#308](#308), [#309](#309)) + Add clang-tidy and iwyu support to Ginkgo ([#298](#298)) + Improve Ginkgo's support of xSDK M12 policy by adding the `TPL_` arguments to CMake ([#300](#300)) + Add support for the xSDK R7 policy ([#325](#325)) + Fix examples in html documentation ([#367](#367)) Related PR: #370

yhmtsai self-assigned this May 26, 2019

yhmtsai added mod:cuda This is related to the CUDA module. type:matrix-format This is related to the Matrix formats mod:openmp This is related to the OpenMP module. labels May 26, 2019

pratikvn added the 1:ST:ready-for-review This PR is ready for review label May 27, 2019

pratikvn requested review from thoasm, tcojean and hartwiganzt May 27, 2019 07:32

tcojean requested changes May 27, 2019

View reviewed changes

thoasm requested changes May 27, 2019

View reviewed changes

hartwiganzt approved these changes May 30, 2019

View reviewed changes

yhmtsai added 6 commits May 30, 2019 17:19

cuda csr->hybrid

d4494bd

cuda hybrid -> csr

f03144d

omp csr -> hybrid

f8559dc

omp hybrid -> csr and ell count_zero

7f39b29

implement 16/32/64 bit atomic_add

1e61c16

delete duplicate function

3e51a52

yhmtsai force-pushed the csr_hybrid_on_cuda_omp branch from 6e1631e to 3e51a52 Compare May 30, 2019 09:20

fix omp typo

89ab7f6

tcojean approved these changes Jun 3, 2019

View reviewed changes

cuda/matrix/hybrid_kernels.cu Outdated Show resolved Hide resolved

delete initial_coo of csr->hyb and modify template

6e60328

thoasm requested changes Jun 3, 2019

View reviewed changes

modify by suggestion

1002488

fix include name

b85e058

thoasm reviewed Jun 4, 2019

View reviewed changes

cuda/components/format_conversion.cuh Show resolved Hide resolved

add description and use static_assert

c255b78

thoasm approved these changes Jun 4, 2019

View reviewed changes

thoasm added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Jun 4, 2019

thoasm merged commit 82e6da6 into ginkgo-project:develop Jun 4, 2019

upsj mentioned this pull request Apr 20, 2021

Implement atomic operations for custom value types #65

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda and Omp matrix conversion of Csr and Hybrid and Omp count_nonzeros of Ell #310

Cuda and Omp matrix conversion of Csr and Hybrid and Omp count_nonzeros of Ell #310

yhmtsai commented May 25, 2019

tcojean left a comment

thoasm left a comment

yhmtsai commented May 27, 2019

thoasm commented May 27, 2019

yhmtsai commented May 29, 2019

yhmtsai commented May 30, 2019

codecov bot commented Jun 3, 2019 •

edited

Loading

tcojean left a comment

tcojean commented Jun 3, 2019

yhmtsai commented Jun 3, 2019

thoasm left a comment

thoasm left a comment

thoasm commented Jun 4, 2019

thoasm left a comment

Cuda and Omp matrix conversion of Csr and Hybrid and Omp count_nonzeros of Ell #310

Cuda and Omp matrix conversion of Csr and Hybrid and Omp count_nonzeros of Ell #310

Conversation

yhmtsai commented May 25, 2019

tcojean left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

yhmtsai commented May 27, 2019

thoasm commented May 27, 2019

yhmtsai commented May 29, 2019

yhmtsai commented May 30, 2019

codecov bot commented Jun 3, 2019 • edited Loading

Codecov Report

tcojean left a comment

Choose a reason for hiding this comment

tcojean commented Jun 3, 2019

yhmtsai commented Jun 3, 2019

Csr-> Hybrid

Hybrid -> Csr

thoasm left a comment

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

thoasm commented Jun 4, 2019

thoasm left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 3, 2019 •

edited

Loading