Add Compressed Basis GMRES (CB-GMRES) #693

thoasm · 2021-01-23T01:14:52Z

This PR adds the compressed basis GMRES, which uses classical Gram-Schmidt GMRES where the basis can be stored in lower precision. The precision of the basis is decided by the user with a factory parameter.

This PR is an updated version of #640 because the underlying branch changed (to better reflect the content of the branch).

thoasm · 2021-01-23T01:15:12Z

format!

thoasm · 2021-01-23T01:15:17Z

label!

hartwiganzt · 2021-01-25T07:52:28Z

I think all tests passed (except for the cleanup), only one windows test fails. Maybe you can already start reviewing this @tcojean @fritzgoebel @pratikvn @yhmtsai @upsj @Slaedr ?

yhmtsai

the first part of my review
I do not go into the implementation and algorithm of CB-GMRES.
you may already do some comments.

BENCHMARKING.md

benchmark/run_all_benchmarks.sh

BENCHMARKING.md

common/components/atomic.hpp.inc

examples/gmres-solver/doc/intro.dox

examples/gmres-solver/doc/tooltip

examples/gmres-solver/gmres-solver.cpp

include/ginkgo/core/solver/cb_gmres.hpp

reference/stop/residual_norm_kernels.cpp

yhmtsai

some comments and questions of gpu kernels

yhmtsai · 2021-01-26T06:51:41Z

common/solver/cb_gmres_kernels.hpp.inc

+ __shared__ UninitializedArray<remove_complex<ValueType>,
+ default_dot_dim *(default_dot_dim + 1)>
+ reduction_helper_array;
+ remove_complex<ValueType> *__restrict__ reduction_helper =
+ reduction_helper_array;


it is remove_complex, so I think directly using __shared__ is fine

Would you mind if we keep it like that? It does work and prevents a warning from showing up.

yhmtsai · 2021-01-26T06:52:36Z

common/solver/cb_gmres_kernels.hpp.inc

+ // Used that way to get around dynamic initialization warning and
+ // template error when using `reduction_helper_array` directly in `reduce`
+ __shared__ UninitializedArray<remove_complex<ValueType>,
+ default_dot_dim *(default_dot_dim + 1)>
+ reduction_helper_array;
+ remove_complex<ValueType> *__restrict__ reduction_helper =
+ reduction_helper_array;


common/solver/cb_gmres_kernels.hpp.inc

This is necessary for the Intel compiler to pass the test `SolvesStencilSystem2`.

- Remove unnecessary code - Add wrapper function to `atomic_helper` in order make it easier to implement other atomic operations (atomic_add and atomic_max use this wrapper)

in CB-GMRES Co-authored-by: Terry Cojean <[email protected]> Co-authored-by: Yuhsiang M. Tsai <[email protected]>

- Add arnoldi_norm documentation and add it in the tests (was not part of it previously, and needs a fix) - Add documentation to benchmarking - Rename namespace and helper for range helper Co-authored-by: Terry Cojean <[email protected]> Co-authored-by: Yuhsiang M. Tsai <[email protected]>

Co-authored-by: Terry Cojean <[email protected]>

Both examples are removed because the functionality is so similar to simple-solver, so it does not add a lot of value.

- Add more documentation to CB-GMRES - Add CB-GMRES to test-install Co-authored-by: Terry Cojean <[email protected]> Co-authored-by: Yuhsiang M. Tsai <[email protected]>

`--expt-relaxed-constexpr` is now used for every CUDA version

Co-authored-by: Pratik Nayak <[email protected]>

The following modifications were done for CB-GMRES: - Remove unused kernel parameters `num_reorth_steps` and `num_reorth_vectors` - Remove unused `b_norm` - Make unused kernel parameters unnamed - Add some explicit casts to prevent warning

- Update documentation of GMRES to mention the usage of MGS - Use reduced precision in CB-GMRES by default

Co-authored-by: Pratik Nayak <[email protected]>

- Extract GPU kernels for CB-GMRES and GMRES into a new file to avoid duplication. - Adopt the updated GMRES functionality for these kernels for CPU and GPU Co-authored-by: Pratik Nayak <[email protected]>

sonarcloud · 2021-02-19T23:04:10Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
0 Security Hotspots
64 Code Smells

32.9% Coverage
11.6% Duplication

Ginkgo release 1.4.0 The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #857

Release 1.4.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem which enables Intel-GPU and CPU execution. The only Ginkgo features which have not been ported yet are some preconditioners. Ginkgo's mixed-precision support is greatly enhanced thanks to: 1. The new Accessor concept, which allows writing kernels featuring on-the-fly memory compression, among other features. The accessor can be used as header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example. 2. All LinOps now transparently support mixed-precision execution. By default, this is done through a temporary copy which may have a performance impact but already allows mixed-precision research. Native mixed-precision ELL kernels are implemented which do not see this cost. The accessor is also leveraged in a new CB-GMRES solver which allows for performance improvements by compressing the Krylov basis vectors. Many other features have been added to Ginkgo, such as reordering support, a new IDR solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU for now), machine topology information, and more! Supported systems and requirements: + For all platforms, cmake 3.13+ + C++14 compliant compiler + Linux and MacOS + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+ + clang: 3.9+ + Intel compiler: 2018+ + Apple LLVM: 8.0+ + CUDA module: CUDA 9.0+ + HIP module: ROCm 3.5+ + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`. + Windows + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+ + Microsoft Visual Studio: VS 2019 + CUDA module: CUDA 9.0+, Microsoft Visual Studio + OpenMP module: MinGW or Cygwin. Algorithm and important feature additions: + Add a new DPC++ Executor for SYCL execution and other base utilities [#648](#648), [#661](#661), [#757](#757), [#832](#832) + Port matrix formats, solvers and related kernels to DPC++. For some kernels, also make use of a shared kernel implementation for all executors (except Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856) + Add accessors which allow multi-precision kernels, among other things. [#643](#643), [#708](#708) + Add support for mixed precision operations through apply in all LinOps. [#677](#677) + Add incomplete Cholesky factorizations and preconditioners as well as some improvements to ILU. [#672](#672), [#837](#837), [#846](#846) + Add an AMGX implementation and kernels on all devices but DPC++. [#528](#528), [#695](#695), [#860](#860) + Add a new mixed-precision capability solver, Compressed Basis GMRES (CB-GMRES). [#693](#693), [#763](#763) + Add the IDR(s) solver. [#620](#620) + Add a new fixed-size block CSR matrix format (for the Reference executor). [#671](#671), [#730](#730) + Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780) + Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649) + Add matrix assembly support on CPUs. [#644](#644) + Extends ISAI from triangular to general and spd matrices. [#690](#690) Other additions: + Add the possibility to apply real matrices to complex vectors. [#655](#655), [#658](#658) + Add functions to compute the absolute of a matrix format. [#636](#636) + Add symmetric permutation and improve existing permutations. [#684](#684), [#657](#657), [#663](#663) + Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697) + Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850) + Row-major accessor is generalized to more than 2 dimensions and a new "block column-major" accessor has been added. [#707](#707) + Add an heat equation example. [#698](#698), [#706](#706) + Add ccache support in CMake and CI. [#725](#725), [#739](#739) + Allow tuning and benchmarking variables non intrusively. [#692](#692) + Add triangular solver benchmark [#664](#664) + Add benchmarks for BLAS operations [#772](#772), [#829](#829) + Add support for different precisions and consistent index types in benchmarks. [#675](#675), [#828](#828) + Add a Github bot system to facilitate development and PR management. [#667](#667), [#674](#674), [#689](#689), [#853](#853) + Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781) + Add ssh debugging for Github Actions CI. [#749](#749) + Add pipeline segmentation for better CI speed. [#737](#737) Changes: + Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854) + Add implicit residual log for solvers and benchmarks. [#714](#714) + Change handling of the conjugate in the dense dot product. [#755](#755) + Improved Dense stride handling. [#774](#774) + Multiple improvements to the OpenMP kernels performance, including COO, an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740) + Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718) + Improved Identity constructor and treatment of rectangular matrices. [#646](#646) + Allow CUDA/HIP executors to select allocation mode. [#758](#758) + Check if executors share the same memory. [#670](#670) + Improve test install and smoke testing support. [#721](#721) + Update the JOSS paper citation and add publications in the documentation. [#629](#629), [#724](#724) + Improve the version output. [#806](#806) + Add some utilities for dim and span. [#821](#821) + Improved solver and preconditioner benchmarks. [#660](#660) + Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812) Fixes: + Sorting fix for the Jacobi preconditioner. [#659](#659) + Also log the first residual norm in CGS [#735](#735) + Fix BiCG and HIP CSR to work with complex matrices. [#651](#651) + Fix Coo SpMV on strided vectors. [#807](#807) + Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769) + Fix device_reset issue by moving counter/mutex to device. [#810](#810) + Fix `EnableLogging` superclass. [#841](#841) + Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726) + Decreased test size for a few device tests. [#742](#742) + Fix multiple issues with our CMake HIP and RPATH setup. [#712](#712), [#745](#745), [#709](#709) + Cleanup our CMake installation step. [#713](#713) + Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785) + Simplify third-party integration. [#786](#786) + Improve Ginkgo device arch flags management. [#696](#696) + Other fixes and improvements to the CMake setup. [#685](#685), [#792](#792), [#705](#705), [#836](#836) + Clarification of dense norm documentation [#784](#784) + Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840) + Make multiple operators/constructors explicit. [#650](#650), [#761](#761) + Fix some issues, memory leaks and warnings found by MSVC. [#666](#666), [#731](#731) + Improved solver memory estimates and consistent iteration counts [#691](#691) + Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754) + Fix for ForwardIterator requirements in iterator_factory. [#665](#665) + Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722) + Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852) Related PR: #866

thoasm requested review from upsj, pratikvn, Slaedr, yhmtsai, hartwiganzt and tcojean January 23, 2021 01:14

thoasm self-assigned this Jan 23, 2021

thoasm assigned josealiaga Jan 23, 2021

thoasm added the 1:ST:ready-for-review This PR is ready for review label Jan 23, 2021

thoasm force-pushed the cb_gmres branch 7 times, most recently from 24c758c to 7f5aa76 Compare January 25, 2021 01:19

yhmtsai mentioned this pull request Jan 25, 2021

Feature requests #639

Open

9 tasks

thoasm force-pushed the cb_gmres branch from 7f5aa76 to 5c879d1 Compare January 25, 2021 13:10

yhmtsai reviewed Jan 25, 2021

View reviewed changes

yhmtsai reviewed Jan 26, 2021

View reviewed changes

Thomas Grützmacher and others added 22 commits February 19, 2021 13:44

Update tolerance for one reference CB-GMRES test

6bb09da

This is necessary for the Intel compiler to pass the test `SolvesStencilSystem2`.

Update atomic_max

3655021

- Remove unnecessary code - Add wrapper function to `atomic_helper` in order make it easier to implement other atomic operations (atomic_add and atomic_max use this wrapper)

Remove unnecessary kernels and properly name them

5270478

in CB-GMRES Co-authored-by: Terry Cojean <[email protected]> Co-authored-by: Yuhsiang M. Tsai <[email protected]>

Add Helper INSTANTIATE macro for CB-GMRES

58e1104

Co-authored-by: Terry Cojean <[email protected]>

Remove CB-GMRES and GMRES example

c47a4ca

Both examples are removed because the functionality is so similar to simple-solver, so it does not add a lot of value.

Review update

2b587ba

- Add more documentation to CB-GMRES - Add CB-GMRES to test-install Co-authored-by: Terry Cojean <[email protected]> Co-authored-by: Yuhsiang M. Tsai <[email protected]>

Remove unnecessary includes of iostream and time.h

da8a6b4

Remove circular dependency of compute_norm2 in (CB)-GMRES

c4c1270

Update solver generation in benchmark

00d33b5

Update eta and arnoldi_norms in CB-GMRES

b163893

Remove CUDA 9.0 exception for constexpr parameter

342957a

`--expt-relaxed-constexpr` is now used for every CUDA version

Review Update

370d208

Co-authored-by: Pratik Nayak <[email protected]>

Sonarcloud update

1b2071d

The following modifications were done for CB-GMRES: - Remove unused kernel parameters `num_reorth_steps` and `num_reorth_vectors` - Remove unused `b_norm` - Make unused kernel parameters unnamed - Add some explicit casts to prevent warning

Review update; Improve run_all_benchmarks.sh

b4a6fc9

- Update documentation of GMRES to mention the usage of MGS - Use reduced precision in CB-GMRES by default

Put storage_precision enum into cb_gmres namespace

599e261

Add CB-GMRES example

5126858

Remove unnecessary included files for CB-GMRES

169040d

Review update

05374fd

Co-authored-by: Pratik Nayak <[email protected]>

Review update

389d038

- Extract GPU kernels for CB-GMRES and GMRES into a new file to avoid duplication. - Adopt the updated GMRES functionality for these kernels for CPU and GPU Co-authored-by: Pratik Nayak <[email protected]>

Update contributors.txt

6d6bbab

Update contributors.txt

4d722f1

thoasm force-pushed the cb_gmres branch from 091b558 to 4d722f1 Compare February 19, 2021 12:55

thoasm added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Feb 19, 2021

thoasm merged commit c178258 into develop Feb 21, 2021

thoasm deleted the cb_gmres branch February 21, 2021 07:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Compressed Basis GMRES (CB-GMRES) #693

Add Compressed Basis GMRES (CB-GMRES) #693

thoasm commented Jan 23, 2021

thoasm commented Jan 23, 2021

thoasm commented Jan 23, 2021

hartwiganzt commented Jan 25, 2021

yhmtsai left a comment

yhmtsai left a comment

yhmtsai Jan 26, 2021

thoasm Feb 2, 2021

yhmtsai Jan 26, 2021

sonarcloud bot commented Feb 19, 2021

Add Compressed Basis GMRES (CB-GMRES) #693

Add Compressed Basis GMRES (CB-GMRES) #693

Conversation

thoasm commented Jan 23, 2021

thoasm commented Jan 23, 2021

thoasm commented Jan 23, 2021

hartwiganzt commented Jan 25, 2021

yhmtsai left a comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

yhmtsai Jan 26, 2021

Choose a reason for hiding this comment

thoasm Feb 2, 2021

Choose a reason for hiding this comment

yhmtsai Jan 26, 2021

Choose a reason for hiding this comment

sonarcloud bot commented Feb 19, 2021