DPCPP SpGEMM, SpGEAM, Transpose, Sort #799

upsj · 2021-06-18T10:21:18Z

This PR adds SpGEMM, SpGEAM, transpose, sort and is_sorted kernels to DPC++. They don't give great performance, but they work.

SpGEAM uses a simple two-way merge algorithm
SpGEMM uses a binary heap-based multiway merge algorithm like in OpenMP
Sort uses Max-Heapsort, which is the simplest asymptotically optimal in-place sorting algorithm
Transpose uses atomics to count the number of non-zeros per columns and assign unique indices in a second pass, followed by sorting

codecov · 2021-06-18T13:27:23Z

Codecov Report

Merging #799 (4406ddc) into develop (417df77) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           develop     #799   +/-   ##
========================================
  Coverage    94.29%   94.29%           
========================================
  Files          408      408           
  Lines        32620    32621    +1     
========================================
+ Hits         30758    30759    +1     
  Misses        1862     1862

Impacted Files	Coverage Δ
omp/reorder/rcm_kernels.cpp	`97.53% <0.00%> (-0.61%)`	⬇️
core/base/extended_float.hpp	`92.23% <0.00%> (+0.97%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 417df77...4406ddc. Read the comment docs.

dpcpp/base/executor.dp.cpp

tcojean

LGTM. A few comments or questions.

dpcpp/components/prefix_sum.dp.cpp

tcojean · 2021-06-24T12:49:43Z

dpcpp/matrix/csr_kernels.dp.cpp

+    auto out_row_ptrs = trans->get_row_ptrs();
+    auto out_cols = trans->get_col_idxs();
+    auto out_vals = trans->get_values();
+    components::fill_array(exec, tmp_counts, num_cols, IndexType{});


Note that DPC++/SYCL now also has queue->memset() which we could use in some situations, but I guess in that case this is more fitting?

I think it would be better if we would use queue->memset() inside components::fill_array for DPC++ and keep the call here (IMO, the current code is more descriptive and I doubt we would get performance benefits from memset()).

memset is only for byte set, right?
for fill array it only works for zero (some type -1)

no, SYCL has a typed memset template

dpcpp/base/executor.dp.cpp

dpcpp/components/atomic.dp.hpp

tcojean · 2021-06-24T16:31:10Z

dpcpp/components/atomic.dp.hpp

+    struct atomic_helper<                                                  \
+        addressSpace, ValueType,                                           \
+        std::enable_if_t<(sizeof(ValueType) == sizeof(CONVERTER_TYPE))>> { \
+        __dpct_inline__ static ValueType atomic_add(                       \
+            ValueType *__restrict__ addr, ValueType val)                   \


Do we need this? Since they have fetch_add and their stuff is templated everywhere, can't we use the base type directly?
https://intel.github.io/llvm-docs/doxygen/classcl_1_1sycl_1_1atomic.html

IsValidAtomicType does not support complex and some of them do not support float.( __SYCL_STATIC_ASSERT_NOT_FLOAT)
I do not do it pretty well by template. the complex is done by 8 byte type but complex is two 8 byte impl.

thoasm

LGTM!
I have some nits, but nothing important.

dpcpp/components/atomic.dp.hpp

dpcpp/matrix/csr_kernels.dp.cpp

thoasm · 2021-07-12T19:31:50Z

dpcpp/matrix/csr_kernels.dp.cpp

+    auto out_row_ptrs = trans->get_row_ptrs();
+    auto out_cols = trans->get_col_idxs();
+    auto out_vals = trans->get_values();
+    components::fill_array(exec, tmp_counts, num_cols, IndexType{});


I think it would be better if we would use queue->memset() inside components::fill_array for DPC++ and keep the call here (IMO, the current code is more descriptive and I doubt we would get performance benefits from memset()).

dpcpp/matrix/csr_kernels.dp.cpp

Co-authored-by: Thomas Grützmacher <[email protected]>

sonarcloud · 2021-07-14T09:24:42Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
17.8% Duplication

Ginkgo release 1.4.0 The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #857

Release 1.4.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #866

upsj added the 1:ST:ready-for-review This PR is ready for review label Jun 18, 2021

upsj self-assigned this Jun 18, 2021

upsj force-pushed the dpcpp_spgemm branch 4 times, most recently from 56abb9c to a191fa9 Compare June 18, 2021 10:41

ginkgo-bot added mod:dpcpp This is related to the DPC++ module. reg:build This is related to the build system. reg:testing This is related to testing. type:matrix-format This is related to the Matrix formats labels Jun 18, 2021

upsj force-pushed the dpcpp_spgemm branch from a191fa9 to c25639a Compare June 18, 2021 11:44

upsj added this to the Ginkgo 1.4.0 milestone Jun 18, 2021

upsj changed the title ~~Dpcpp spgemm~~ DPCPP SpGEMM, SpGEAM, Transpose, Sort Jun 18, 2021

upsj requested review from tcojean, Slaedr, fritzgoebel, greole, MarcelKoch, pratikvn, thoasm and yhmtsai June 18, 2021 14:43

pratikvn reviewed Jun 18, 2021

View reviewed changes

dpcpp/base/executor.dp.cpp Outdated Show resolved Hide resolved

tcojean approved these changes Jun 24, 2021

View reviewed changes

thoasm approved these changes Jul 12, 2021

View reviewed changes

upsj added 3 commits July 14, 2021 09:22

add DPC++ SPGEMM and SPGEAM kernels

b94c24c

add dpcpp is_sorted_by_col_idxs kernel

ec8a838

add dpcpp csrsort

952a11c

upsj and others added 2 commits July 14, 2021 09:22

add dpcpp transpose kernels

e5eb038

review updates

4406ddc

Co-authored-by: Thomas Grützmacher <[email protected]>

upsj force-pushed the dpcpp_spgemm branch from 693b292 to 4406ddc Compare July 14, 2021 07:29

upsj added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Jul 14, 2021

upsj merged commit 865d9c4 into develop Jul 14, 2021

upsj deleted the dpcpp_spgemm branch July 14, 2021 11:23

yhmtsai mentioned this pull request Jul 29, 2021

dpcpp porting the rest of matrix format #845

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPCPP SpGEMM, SpGEAM, Transpose, Sort #799

DPCPP SpGEMM, SpGEAM, Transpose, Sort #799

upsj commented Jun 18, 2021 •

edited

Loading

codecov bot commented Jun 18, 2021 •

edited

Loading

tcojean left a comment

tcojean Jun 24, 2021

thoasm Jul 12, 2021

yhmtsai Jul 13, 2021

upsj Jul 13, 2021

tcojean Jun 24, 2021 •

edited

Loading

yhmtsai Jun 24, 2021

thoasm left a comment

thoasm Jul 12, 2021

sonarcloud bot commented Jul 14, 2021

DPCPP SpGEMM, SpGEAM, Transpose, Sort #799

DPCPP SpGEMM, SpGEAM, Transpose, Sort #799

Conversation

upsj commented Jun 18, 2021 • edited Loading

codecov bot commented Jun 18, 2021 • edited Loading

Codecov Report

tcojean left a comment

Choose a reason for hiding this comment

tcojean Jun 24, 2021

Choose a reason for hiding this comment

thoasm Jul 12, 2021

Choose a reason for hiding this comment

yhmtsai Jul 13, 2021

Choose a reason for hiding this comment

upsj Jul 13, 2021

Choose a reason for hiding this comment

tcojean Jun 24, 2021 • edited Loading

Choose a reason for hiding this comment

yhmtsai Jun 24, 2021

Choose a reason for hiding this comment

thoasm left a comment

Choose a reason for hiding this comment

thoasm Jul 12, 2021

Choose a reason for hiding this comment

sonarcloud bot commented Jul 14, 2021

upsj commented Jun 18, 2021 •

edited

Loading

codecov bot commented Jun 18, 2021 •

edited

Loading

tcojean Jun 24, 2021 •

edited

Loading