Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Mixed precision SpMV and BLAS operations #677

Merged
merged 5 commits into from
Mar 31, 2021
Merged

Conversation

upsj
Copy link
Member

@upsj upsj commented Dec 17, 2020

This PR adds full mixed precision support to Ginkgo, at least in terms of LinOp compatibility.
By default, this happens with the new make_temporary_conversion helper that works similarly to make_temporary_clone and is wrapped in the precision_dispatch function, which applies the correct precision and complex-to-real conversions for SpMV and preconditioner applications. For solvers, a more explicit conversion is necessary.

The temporary conversion wrapper won't give ideal performance, since it requires additional conversions from and to the input/output vectors as well as associated allocations of temporary memory for each apply operation. For solvers, this might still pay off due to the long runtime of a single apply. Mixed Precision IR is then almost equivalent to the following

auto solver =
    gko::solver::Ir<double>::build()
        .with_solver(
            gko::solver::Gmres<float>::build()
                .with_criteria(
                    gko::stop::ResidualNormReduction<SolverType>::build()
                        .with_reduction_factor(inner_reduction_factor)
                        .on(exec),
                    gko::stop::Iteration::build()
                        .with_max_iters(max_inner_iters)
                        .on(exec))
                .on(exec))
        .with_criteria(gko::stop::ResidualNormReduction<ValueType>::build()
                            .with_reduction_factor(outer_reduction_factor)
                            .on(exec),
                        gko::stop::Iteration::build()
                            .with_max_iters(max_outer_iters)
                            .on(exec))
        .on(exec)
        ->generate(give(A));

except for the fact that A is always stored and computed on in double precision.

TODO:

  • Comprehensive Reference Tests
  • No need to convert x to ValueType for LinOps where apply_uses_initial_guess() == false This would be way too complex and probably be overkill, since this is not meant to provide good performance
  • Modify MPIR example

@upsj upsj added is:experimental This is an experimental feature/PR/issue/module. mod:core This is related to the core module. labels Dec 17, 2020
@upsj upsj self-assigned this Dec 17, 2020
@ginkgo-bot ginkgo-bot added mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:openmp This is related to the OpenMP module. mod:reference This is related to the reference module. reg:benchmarking This is related to benchmarking. type:matrix-format This is related to the Matrix formats type:preconditioner This is related to the preconditioners type:solver This is related to the solvers labels Dec 17, 2020
@upsj upsj added 1:ST:ready-for-review This PR is ready for review and removed is:experimental This is an experimental feature/PR/issue/module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. mod:openmp This is related to the OpenMP module. mod:reference This is related to the reference module. labels Mar 8, 2021
@upsj upsj mentioned this pull request Mar 8, 2021
1 task
Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a few mistakes I could find seemingly due to copy/paste. Of course tests are missing. In addition, I think you should change the mixed precision IR example to reflect the new structure, or add an extra example/case in the same example?

Another question, can anything be done about ISAI? I think changing the TRS might be enough for ILU and IC, but ISAI might need some more changes particularly since it stores the improximate inverse as a plain CSR matrix?

core/base/combination.cpp Outdated Show resolved Hide resolved
core/base/perturbation.cpp Outdated Show resolved Hide resolved
core/base/temporary_conversion.hpp Outdated Show resolved Hide resolved
core/base/temporary_conversion.hpp Outdated Show resolved Hide resolved
core/base/temporary_conversion.hpp Outdated Show resolved Hide resolved
Comment on lines 79 to 86
} else {
GKO_NOT_IMPLEMENTED;
precision_dispatch_spmv<ValueType>(
[&](auto dense_b, auto dense_x) {
exec->run(
diagonal::make_apply_to_dense(this, dense_b, dense_x));
},
b, x);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of functionality that should work as all formats can convert to Dense, but do we want that to happen, what about other formats? We could keep a GKO_NOT_IMPLEMENTED as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the new form corresponds more to what is in other cases like for CSR where we assume all other cases to be dense.

Copy link
Member

@yhmtsai yhmtsai Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer staying GKO_NOT_IMPLEMENTED. Converting to dense may need a lot of storage

Copy link
Member Author

@upsj upsj Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I hadn't even considered the sparse -> dense conversions here. So I will instead have to test against all Dense types instead, instead of ConvertibleTo

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, diagonal to dense might not be something one would want to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually misremembered my implementation, I never try to cast to ConvertibleTo in make_temporary_conversion, only to Dense directly, so this is not an issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean here is that when someone passes a Csr type or any other form of LinOp instead of a Dense, before we would got an exception thrown here GKO_NOT_IMPLEMENTED. Now what would happen, in what way would it fail, it's actually not clear looking at the code of make_temporary_conversion.

Exceptions are part of the interface AFAIK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before my last commit, it would have thrown NotSupported in make_temporary_conversion. Now with the new changes in place, it would throw NotSupported in conversion_helper:::convert, but now that I think of it, the previous version might actually be better, together with a make_temporary_conversion/precision_dispatch_nothrow that just returns nullptr or something like that in case of an error.

core/preconditioner/jacobi.cpp Outdated Show resolved Hide resolved
core/solver/bicg.cpp Outdated Show resolved Hide resolved
core/base/precision_dispatch.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/idr.hpp Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 8, 2021

Codecov Report

Merging #677 (534a02a) into develop (3ab51db) will increase coverage by 0.21%.
The diff coverage is 96.88%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #677      +/-   ##
===========================================
+ Coverage    92.56%   92.78%   +0.21%     
===========================================
  Files          389      392       +3     
  Lines        29220    30408    +1188     
===========================================
+ Hits         27047    28213    +1166     
- Misses        2173     2195      +22     
Impacted Files Coverage Δ
include/ginkgo/core/solver/bicg.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/bicgstab.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/cb_gmres.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/cg.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/cgs.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/fcg.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/gmres.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/idr.hpp 100.00% <ø> (ø)
include/ginkgo/core/solver/ir.hpp 100.00% <ø> (ø)
omp/solver/cb_gmres_kernels.cpp 78.48% <ø> (ø)
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab51db...534a02a. Read the comment docs.

core/base/temporary_conversion.hpp Outdated Show resolved Hide resolved
Comment on lines 79 to 86
} else {
GKO_NOT_IMPLEMENTED;
precision_dispatch_spmv<ValueType>(
[&](auto dense_b, auto dense_x) {
exec->run(
diagonal::make_apply_to_dense(this, dense_b, dense_x));
},
b, x);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, diagonal to dense might not be something one would want to do.

core/solver/bicg.cpp Outdated Show resolved Hide resolved
@upsj
Copy link
Member Author

upsj commented Mar 20, 2021

@tcojean I am a bit unsure about the MPIR example, since modifying it to use our mixed precision support would not be 100% equivalent: MPIR uses the matrix stored in float for the inner solves, but in double for the outer solves. So either we use the matrix in float, which gives us only float-level precision in the overall solve, or we store it in double, which uses "too precise" SpMVs in the inner solver.

@tcojean
Copy link
Member

tcojean commented Mar 24, 2021

For the example if it's not exactly the same effect as the old one, I understand it makes sense to keep the old one. The next question then is whether we need a new example to advertise these new features/help people play with it in a simple fashion. I also like Mike's idea of an apply(Dense, Dense) as that would isolate the dispatching into one place and put relevant (lengthy) implementation in another place. Otherwise, LGTM (minus leftover compilation issues).

@thoasm thoasm mentioned this pull request Mar 26, 2021
4 tasks
@upsj upsj force-pushed the mixed_precision_spmv branch 2 times, most recently from b601292 to 13f9ce0 Compare March 28, 2021 22:05
@pratikvn
Copy link
Member

pratikvn commented Mar 29, 2021

In terms of performance, can we make sure that the base case of no mixed precision still has the same performance as before (in current develop) ? All the applies are now wrapped by the precision_dispatch and lambdas, which is a major change. Can you maybe quickly run some benchmarks for some small matrices for all the applies ? I know that it shouldn't affect the performance, but I think it is better to be sure.

@upsj
Copy link
Member Author

upsj commented Mar 29, 2021

@pratikvn Good point, I ran a small benchmark (ani4.mtx with CG on reference) to get some performance numbers:
Before 0.03631952699999999
After 0.03605373300000001
So I don't see any overhead at all from 100 repetitions of the solve benchmarks.
Looking at the overall code, this is not too surprising to me, since the only substantial changes happen in the code path that would previously fail (wrong Dense value type), everything else is almost equivalent to what we did previously with gko::as

Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. A minor comment on variable naming.

One more important issue, I think I would really prefer if we used @yhmtsai 's idea of apply(Dense*, Dense*) such that the all form of dispatch code is isolated from the lengthy implementation. That would I think make the code much clearer and also hopefully help reduce the amount of code changes as well.

reference/test/matrix/coo_kernels.cpp Outdated Show resolved Hide resolved
reference/test/matrix/coo_kernels.cpp Outdated Show resolved Hide resolved
upsj and others added 3 commits March 30, 2021 13:38
* add missing conjugation to CBGMRES
* work around complex accessor issues

Co-authored-by: Thomas Grützmacher <[email protected]>
Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

core/solver/cb_gmres.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@fritzgoebel fritzgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@upsj upsj added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Mar 30, 2021
Copy link
Contributor

@Slaedr Slaedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! That is quite elegant. After seeing the code for the solver apply etc. I realize how good this is.

I think I found some bugs in a couple of tests. I also have a few minor clarifications.

I wonder if it makes sense to write a test to ensure that const arguments are not copied back, for performance reasons. Maybe what you could do is, in one test, write a dummy operator templated on value_type with an apply(const LinOp *x, LinOp *y). Inside the apply, const_cast the x and change it. Outside, ensure that x was not modified. It might be nice to have such a test in case the conversion_helper etc. need to be modified in future.

Also, it would be nice to have the new tests that you added to one of the objects, maybe CSR matrix kernels, to the other backends as well. Just the CSR tests would be enough, I think, and would add almost nothing to the testing time and we could be sure all this works on the other backends too.

include/ginkgo/core/base/temporary_conversion.hpp Outdated Show resolved Hide resolved
reference/test/base/combination.cpp Outdated Show resolved Hide resolved
reference/test/base/combination.cpp Outdated Show resolved Hide resolved
reference/test/matrix/sellp_kernels.cpp Outdated Show resolved Hide resolved
reference/test/matrix/sellp_kernels.cpp Outdated Show resolved Hide resolved
reference/test/matrix/sparsity_csr_kernels.cpp Outdated Show resolved Hide resolved
reference/test/matrix/sparsity_csr_kernels.cpp Outdated Show resolved Hide resolved
@upsj
Copy link
Member Author

upsj commented Mar 30, 2021

@Slaedr Thanks, all of those are really good suggestions, I incorporated all of them

Copy link
Contributor

@Slaedr Slaedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one set of small nits about the tolerance.

cuda/test/matrix/dense_kernels.cpp Outdated Show resolved Hide resolved
hip/test/matrix/dense_kernels.hip.cpp Outdated Show resolved Hide resolved
omp/test/matrix/dense_kernels.cpp Outdated Show resolved Hide resolved
* fix missing documentation
* fix mixed precision tests
* add tests for device temporary conversion
* add tests for make_temporary_conversion behavior

Co-authored-by: Aditya Kashi <[email protected]>
@upsj
Copy link
Member Author

upsj commented Mar 31, 2021

@Slaedr Good catch! This was actually an accident, since our implementations seems to be bitwise equivalent, so we could even use zero tolerance. We're not comparing against a "ground truth", but two equally inexact implementations

@sonarcloud
Copy link

sonarcloud bot commented Mar 31, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 30 Code Smells

78.8% 78.8% Coverage
0.1% 0.1% Duplication

@upsj upsj merged commit 40de7dc into develop Mar 31, 2021
@upsj upsj deleted the mixed_precision_spmv branch March 31, 2021 17:52
@upsj upsj mentioned this pull request Apr 20, 2021
tcojean added a commit that referenced this pull request Aug 20, 2021
Ginkgo release 1.4.0

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)


Related PR: #857
tcojean added a commit that referenced this pull request Aug 23, 2021
Release 1.4.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)

Related PR: #866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. mod:core This is related to the core module. reg:benchmarking This is related to benchmarking. type:matrix-format This is related to the Matrix formats type:preconditioner This is related to the preconditioners type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants