Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a batch::Bicgstab solver class, core, ref and omp kernels #1438

Merged
merged 36 commits into from
Nov 1, 2023

Conversation

pratikvn
Copy link
Member

This PR adds a batch::Bicgstab solver and only the reference kernels for now. Another PR will be created to add the cuda, hip and dpcpp kernels to avoid making this PR too large.

In addition, some general solver, stopping critieria, logger and preconditioner framework is also added. These are fairly simple and I think it helps review these in the context of the solver itself.

  1. Batch stopping criteria
  2. Simple batch logger
  3. Some batch matrix generation utilities
  4. A basic BatchIdentity matrix class and a corresponding Identity preconditioner to enable unpreconditioned solves.
  5. The batch dispatch mechanism that selects the correct matrix, solver, preconditioner, stopping critieria at runtime and dispatches the correct kernel on the device.

@pratikvn pratikvn added 1:ST:WIP This PR is a work in progress. Not ready for review. type:batched-functionality This is related to the batched functionality in Ginkgo labels Oct 21, 2023
@pratikvn pratikvn added this to the Release 1.7.0 milestone Oct 21, 2023
@pratikvn pratikvn self-assigned this Oct 21, 2023
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. type:solver This is related to the solvers type:preconditioner This is related to the preconditioners type:matrix-format This is related to the Matrix formats type:stopping-criteria This is related to the stopping criteria mod:all This touches all Ginkgo modules. labels Oct 21, 2023
@pratikvn pratikvn force-pushed the batch-bicgstab branch 2 times, most recently from 25a894a to 26472b9 Compare October 23, 2023 05:36
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use our unified kernels approach for some of these parts. In particular, the logger and stopping criteria don't use any backend specific stuff, except for some function attributes. Those could also be handled uniformly through macros, which we already have.

I think even the identity preconditioner could be handled this way, although that would require some adjustments to our unified kernels, so I think we should postpone that.

@MarcelKoch MarcelKoch self-requested a review October 23, 2023 09:13
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first part of my review



/**
* Logs the final residual and iteration count for a batch solver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Logs the final residual and iteration count for a batch solver.
* Logs the final actual residual norm and iteration count for a batch solver.

It is for actual residual not implicit residual, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the solver, so I would not specify that here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it also applied to criterion?
If it is, it gives unexpected convergence behavior. User sometimes gets the residual indeed less the requirement (actual residual) but sometimes get higher residual as converged result because it depends on the implicit one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, criterion checks are also always with whatever residual the solver provides.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should clarify that we always check against the implicit residual within the solvers. In some cases, the implicit residual and the actual residual may be the same, but that depends on the solver.

common/cuda_hip/log/batch_logger.hpp.inc Outdated Show resolved Hide resolved
common/cuda_hip/preconditioner/batch_identity.hpp.inc Outdated Show resolved Hide resolved
common/cuda_hip/stop/batch_criteria.hpp.inc Outdated Show resolved Hide resolved
common/cuda_hip/stop/batch_criteria.hpp.inc Outdated Show resolved Hide resolved
include/ginkgo/core/matrix/batch_identity.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/matrix/batch_identity.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_bicgstab.hpp Outdated Show resolved Hide resolved
reference/preconditioner/batch_identity.hpp Show resolved Hide resolved
reference/stop/batch_criteria.hpp Outdated Show resolved Hide resolved
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code can use some of the new core developments. For example, the factory parameter can be unified, or maybe the workspace can be extended to also cover the batched case. But some of those changes (e.g. the workspace) could be done at a later time. So for now I'm focusing on the interface to allow for these changes.
Part 1/n

include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
@MarcelKoch MarcelKoch self-requested a review October 24, 2023 08:04
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_solver_base.hpp Outdated Show resolved Hide resolved
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 2/n, mostly done with the interface and core stuff (except the test helpers). I think especially on the logger side there are some inconsistencies that I would like to see addressed.

include/ginkgo/core/matrix/batch_dense.hpp Outdated Show resolved Hide resolved
core/matrix/batch_struct.hpp Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
core/solver/batch_dispatch.hpp Outdated Show resolved Hide resolved
core/solver/batch_dispatch.hpp Outdated Show resolved Hide resolved
core/matrix/batch_struct.hpp Show resolved Hide resolved
core/solver/batch_dispatch.hpp Outdated Show resolved Hide resolved
core/test/solver/batch_bicgstab.cpp Outdated Show resolved Hide resolved
core/test/solver/batch_bicgstab.cpp Outdated Show resolved Hide resolved
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second part

core/solver/batch_dispatch.hpp Outdated Show resolved Hide resolved
dpcpp/log/batch_logger.hpp Show resolved Hide resolved
include/ginkgo/core/log/logger.hpp Outdated Show resolved Hide resolved
reference/preconditioner/batch_identity.hpp Outdated Show resolved Hide resolved
* Sets the input and generates the identity preconditioner.(Nothing needs
* to be actually generated.)
*/
void generate(size_type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does batch_identity need to be preconditioner?
batch_identity will be passed through the generated_preconditioner or the default preconditioner, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, the solver will always have prec.generate(...) and prec_apply(...) calls. As it is templated, in the default case, we need to have the identity preconditioner.

Comment on lines +266 to +262
initialize(A_entry, b_entry, gko::batch::to_const(x_entry), rho_old_entry,
omega_entry, alpha_entry, r_entry, r_hat_entry, p_entry,
p_hat_entry, v_entry, rhs_norms_entry, res_norms_entry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function call is slightly different from the core/solver/bicgstab. Is there any benefit merge b-Ax and r_hat = r to initialize? keeping them similar to core might be easier for reviewing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I draw back my comment because the other kernel can put the dot together unlike the core already

reference/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
reference/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved

template <typename StopType, typename PrecType, typename LogType,
typename BatchMatrixType, typename ValueType>
inline void batch_entry_bicgstab_impl(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think the core part can be shared among backends, but I do not focus on that now.
I assume the fused kernel from GPU perspective

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can think about unifying this later.

omp/solver/batch_bicgstab_kernels.cpp Outdated Show resolved Hide resolved
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part 3/3. This concerns mostly the reference/omp kernel and tests. There are only few notes on the kernels (beside moving parts into common/unified). I think there are some easy generalizations in the test helpers possible.

core/test/utils/batch_helpers.hpp Outdated Show resolved Hide resolved
core/test/utils/batch_helpers.hpp Outdated Show resolved Hide resolved
core/test/utils/batch_helpers.hpp Outdated Show resolved Hide resolved
core/test/utils/batch_helpers.hpp Show resolved Hide resolved
core/test/utils/batch_helpers.hpp Outdated Show resolved Hide resolved
for (size_t i = 0; i < this->num_batch_items; i++) {
ASSERT_LE(res_log_array[i] / this->linear_system.rhs_norm->at(i, 0, 0),
this->solver_settings.residual_tol);
ASSERT_NEAR(res_log_array[i], res.res_norm->get_const_values()[i],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is a helpful test. IMO it would be better to compare the solver result to the true solution, or just leave it out. The test above might already be sufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it should be equal not near, I think?

reference/stop/batch_criteria.hpp Outdated Show resolved Hide resolved
reference/test/solver/batch_bicgstab_kernels.cpp Outdated Show resolved Hide resolved
omp/solver/batch_bicgstab_kernels.cpp Show resolved Hide resolved
omp/solver/batch_bicgstab_kernels.cpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Show resolved Hide resolved
include/ginkgo/core/log/batch_logger.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/solver/batch_bicgstab.hpp Outdated Show resolved Hide resolved
Comment on lines +165 to +168
auto iter_array = res.log_data->iter_counts.get_const_data();
for (size_t i = 0; i < num_batch_items; i++) {
ASSERT_EQ(iter_array[i], ref_iters);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make the linear system unsolved? otherwise, it might be less than ref_iters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the tolerance of 0 is not acheivable and it should always hit the ref iters

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using nan is maybe more general, which also fit if we decide to use <= not <

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will that work on device as well ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. It should work if the compiler does not use fast math.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it is still not possible be acheive a tolerance of 0, so i think nan is not necessary.

auto comp_res_norm =
exec->copy_val_to_host(res.res_norm->get_const_values() + i);
ASSERT_LE(iter_counts->get_const_data()[i], max_iters);
EXPECT_LE(res_norm->get_const_data()[i], comp_tol);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this criterion need use 100 * tol not tol if the criterion is absolute residual norm?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were issues only on some systems, particularly MSVC. Not sure why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's might related to the optimization or different random input?
The codes gives me the confusion about the criterion.
From my first thought, it is actual residual norm check. That's why I do not think that the residual norm does not match the required criterion makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is a bit stale and has been updated. So, I think it should be correct now. In the updated code, comp_res_norm is the actual residual while resnorm is the residual from the logger, which in this case is the implicit residual.

for (size_t i = 0; i < this->num_batch_items; i++) {
ASSERT_LE(res_log_array[i] / this->linear_system.rhs_norm->at(i, 0, 0),
this->solver_settings.residual_tol);
ASSERT_NEAR(res_log_array[i], res.res_norm->get_const_values()[i],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it should be equal not near, I think?

Comment on lines 259 to 260
EXPECT_LE(rel_res_norm, res_norm.get_const_data()[i]);
ASSERT_LE(rel_res_norm, tol * 10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
EXPECT_LE(rel_res_norm, res_norm.get_const_data()[i]);
ASSERT_LE(rel_res_norm, tol * 10);
EXPECT_EQ(rel_res_norm, res_norm.get_const_data()[i]);
ASSERT_LE(rel_res_norm, tol);


GKO_ASSERT_BATCH_MTX_NEAR(res.x, linear_system.exact_sol, tol * 50);
for (size_t i = 0; i < num_batch_items; i++) {
ASSERT_LE(res.res_norm->get_const_values()[i], tol * 50);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ASSERT_LE(res.res_norm->get_const_values()[i], tol * 50);
ASSERT_LE(res.res_norm->get_const_values()[i], tol);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both MSVC and NVHPC seem to have issues with even 50.

@MarcelKoch
Copy link
Member

@pratikvn Do you mind holding off on the rebasing until all reviews are done (unless necessary)? Github can't keep track of the new changes otherwise (and VS Code seems also unable to do so).

@pratikvn pratikvn force-pushed the batch-bicgstab branch 2 times, most recently from 82712a3 to e17e58d Compare October 25, 2023 22:54
@pratikvn
Copy link
Member Author

pratikvn commented Oct 25, 2023

@yhmtsai , the issue of tolerance is the same we have had in other places. Some compilers always seem to need higher values for tolerances, so the values of 50, 10 and 100 have been set empirically.

@pratikvn pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Oct 25, 2023
@pratikvn
Copy link
Member Author

pratikvn commented Nov 1, 2023

As the discussion of the experimental namespace is independent of this PR and this PR has been reviewed, I will go ahead and merge this now to simplify the other batch PR as our CI seems to be stuck.

@pratikvn pratikvn merged commit 3d8dc38 into develop Nov 1, 2023
10 of 15 checks passed
Batched Ginkgo automation moved this from In progress to Completed Nov 1, 2023
@pratikvn pratikvn deleted the batch-bicgstab branch November 1, 2023 09:06
@tcojean tcojean mentioned this pull request Nov 6, 2023
tcojean added a commit that referenced this pull request Nov 10, 2023
Release 1.7.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as:
- Complete GPU-resident sparse direct solvers feature set and interfaces,
- Improved Cholesky factorization performance,
- A new MC64 reordering,
- Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types,
- MPI support for the SYCL backend,
- Improved ParILU(T)/ParIC(T) preconditioner convergence,
and more!

If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions).

Supported systems and requirements:
+ For all platforms, CMake 3.16+
+ C++14 compliant compiler
+ Linux and macOS
  + GCC: 5.5+
  + clang: 3.9+
  + Intel compiler: 2019+
  + Apple Clang: 14.0 is tested. Earlier versions might also work.
  + NVHPC: 22.7+
  + Cray Compiler: 14.0.1+
  + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+
  + HIP module: ROCm 4.5+
  + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`.
  + MPI: standard version 3.1+, ideally GPU Aware, for best performance
+ Windows
  + MinGW: GCC 5.5+
  + Microsoft Visual Studio: VS 2019+
  + CUDA module: CUDA 10.1+, Microsoft Visual Studio
  + OpenMP module: MinGW.

### Version support changes

+ CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382)
+ Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368)

### Interface changes

+ `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439)

### New Deprecations

+ The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315)
+ The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350)
+ The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418)
+ The Permutation class' `permute_mask` functionality. [#1415](#1415)
+ Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348)

### Summary of previous deprecations
+ `gko::lend()` is not necessary anymore.
+ The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`.
+ The class `AmgxPgm` is deprecated in favor of `Pgm`.
+ Default constructors for the CSR `load_balance` and `automatical` strategies
+ The PolymorphicObject's move-semantic `copy_from` variant
+ The templated `SolverBase` class.
+ The class `MachineTopology` is deprecated in favor of `machine_topology`.
+ Logger constructors and create functions with the `executor` parameter.
+ The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc.
+ Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter.
+ The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`.

### Added features

+ Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379)
+ Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371)
+ Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413)
+ Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437)
+ Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438).
+ Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443).
+ New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120)
+ New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415)
+ LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432)
+ New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445)
+ Sorting kernels for SparsityCsr on all backends [#1343](#1343)
+ Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426)
+ Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436)
+ Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227)
+ Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275)
+ Adds `icpx` compiler supports [#1350](#1350)
+ All backends can be built simultaneously [#1333](#1333)
+ Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372)
+ Reordering algorithms in sparse_blas benchmark [#1354](#1354)
+ Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385)
+ Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387)
+ Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408)


### Improvements

+ Significantly improve Cholesky factorization performance [#1366](#1366)
+ Improve parallel build performance [#1378](#1378)
+ Allow constrained parallel test execution using CTest resources [#1373](#1373)
+ Use arithmetic type more inside mixed precision ELL [#1414](#1414)
+ Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439)
+ Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434)

### Fixes

+ Fix an over-allocation for OpenMP reductions [#1369](#1369)
+ Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362)
+ Fix several typos in the API and documentation [#1348](#1348)
+ Fix inconsistent `Threads` between generations [#1388](#1388)
+ Fix benchmark median condition [#1398](#1398)
+ Fix HIP 5.6.0 compilation [#1411](#1411)
+ Fix missing destruction of rand_generator from cuda/hip [#1417](#1417)
+ Fix PAPI logger destruction order [#1419](#1419)
+ Fix TAU logger compilation [#1422](#1422)
+ Fix relative criterion to not iterate if the residual is already zero [#1079](#1079)
+ Fix memory_order invocations with C++20 changes [#1402](#1402)
+ Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440)
+ Fix checking OpenMPI version in cross-compilation settings [#1446](#1446)
+ Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444)


### Related PR: #1451
tcojean added a commit that referenced this pull request Nov 10, 2023
Release 1.7.0 to develop

The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as:
- Complete GPU-resident sparse direct solvers feature set and interfaces,
- Improved Cholesky factorization performance,
- A new MC64 reordering,
- Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types,
- MPI support for the SYCL backend,
- Improved ParILU(T)/ParIC(T) preconditioner convergence,
and more!

If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions).

Supported systems and requirements:
+ For all platforms, CMake 3.16+
+ C++14 compliant compiler
+ Linux and macOS
  + GCC: 5.5+
  + clang: 3.9+
  + Intel compiler: 2019+
  + Apple Clang: 14.0 is tested. Earlier versions might also work.
  + NVHPC: 22.7+
  + Cray Compiler: 14.0.1+
  + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+
  + HIP module: ROCm 4.5+
  + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`.
  + MPI: standard version 3.1+, ideally GPU Aware, for best performance
+ Windows
  + MinGW: GCC 5.5+
  + Microsoft Visual Studio: VS 2019+
  + CUDA module: CUDA 10.1+, Microsoft Visual Studio
  + OpenMP module: MinGW.

### Version support changes

+ CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382)
+ Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368)

### Interface changes

+ `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439)

### New Deprecations

+ The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315)
+ The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350)
+ The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418)
+ The Permutation class' `permute_mask` functionality. [#1415](#1415)
+ Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348)

### Summary of previous deprecations
+ `gko::lend()` is not necessary anymore.
+ The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`.
+ The class `AmgxPgm` is deprecated in favor of `Pgm`.
+ Default constructors for the CSR `load_balance` and `automatical` strategies
+ The PolymorphicObject's move-semantic `copy_from` variant
+ The templated `SolverBase` class.
+ The class `MachineTopology` is deprecated in favor of `machine_topology`.
+ Logger constructors and create functions with the `executor` parameter.
+ The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc.
+ Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter.
+ The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`.

### Added features

+ Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379)
+ Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371)
+ Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413)
+ Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437)
+ Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438).
+ Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443).
+ New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120)
+ New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415)
+ LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432)
+ New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445)
+ Sorting kernels for SparsityCsr on all backends [#1343](#1343)
+ Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426)
+ Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436)
+ Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227)
+ Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275)
+ Adds `icpx` compiler supports [#1350](#1350)
+ All backends can be built simultaneously [#1333](#1333)
+ Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372)
+ Reordering algorithms in sparse_blas benchmark [#1354](#1354)
+ Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385)
+ Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387)
+ Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408)


### Improvements

+ Significantly improve Cholesky factorization performance [#1366](#1366)
+ Improve parallel build performance [#1378](#1378)
+ Allow constrained parallel test execution using CTest resources [#1373](#1373)
+ Use arithmetic type more inside mixed precision ELL [#1414](#1414)
+ Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439)
+ Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434)

### Fixes

+ Fix an over-allocation for OpenMP reductions [#1369](#1369)
+ Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362)
+ Fix several typos in the API and documentation [#1348](#1348)
+ Fix inconsistent `Threads` between generations [#1388](#1388)
+ Fix benchmark median condition [#1398](#1398)
+ Fix HIP 5.6.0 compilation [#1411](#1411)
+ Fix missing destruction of rand_generator from cuda/hip [#1417](#1417)
+ Fix PAPI logger destruction order [#1419](#1419)
+ Fix TAU logger compilation [#1422](#1422)
+ Fix relative criterion to not iterate if the residual is already zero [#1079](#1079)
+ Fix memory_order invocations with C++20 changes [#1402](#1402)
+ Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440)
+ Fix checking OpenMPI version in cross-compilation settings [#1446](#1446)
+ Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444)

### Related PR: #1454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. mod:all This touches all Ginkgo modules. reg:build This is related to the build system. reg:testing This is related to testing. type:batched-functionality This is related to the batched functionality in Ginkgo type:matrix-format This is related to the Matrix formats type:preconditioner This is related to the preconditioners type:solver This is related to the solvers type:stopping-criteria This is related to the stopping criteria
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants