Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse Cuthill-McKee #500

Merged
merged 35 commits into from
Oct 16, 2020
Merged

Conversation

lksriemer
Copy link
Contributor

@lksriemer lksriemer commented Apr 8, 2020

Updated

This PR implements the reverse Cuthill-McKee reordering. The initial commits were written by @pratikvn.
This PR also implements the unordered reverse Cuthill-McKee algorithm (see here, algorithm 6) for the OpenMP executor.

This is now ready to review. Some things I already thought about while integrating:

  • Is passing vectors by reference, as is sometimes done here, in line with ginkgos coding guidelines? Apparently yes.
  • Is returning vectors by value, as is done here at least once, in line with ginkgos coding guidelines? Apparently yes.
  • Should there maybe be ... tests for the (main) test? It is one complex function after all. Apparently no.

TODO:

  • Rewrite OMP kernel
  • Add proper documentation.
  • Fix the templated tests.
  • Add an example to examples/ to demonstrate usage.

cuda/reorder/rcm_kernels.cu Outdated Show resolved Hide resolved
omp/reorder/rcm_kernels.cpp Outdated Show resolved Hide resolved
omp/reorder/rcm_kernels.cpp Outdated Show resolved Hide resolved
omp/reorder/rcm_kernels.cpp Outdated Show resolved Hide resolved
@lksriemer
Copy link
Contributor Author

I commented on the omp impl, but actually those remarks are more suited for the reference impl. For the omp impl, I'd try implementing something as described here.

@pratikvn pratikvn added mod:core This is related to the core module. 1:ST:do-not-merge Please do not merge PR this yet. is:new-feature A request or implementation of a feature that does not exist yet. mod:openmp This is related to the OpenMP module. mod:reference This is related to the reference module. type:reordering This is related to the matrix(LinOp) reordering 1:ST:WIP This PR is a work in progress. Not ready for review. labels Apr 8, 2020
omp/reorder/rcm_kernels.cpp Outdated Show resolved Hide resolved
@upsj upsj changed the title [DO NOT MERGE] Reverse Cuthill-McKee Reverse Cuthill-McKee Apr 8, 2020
@lksriemer
Copy link
Contributor Author

I will push some changes soon, mainly to give an insight into the current situation.

I probably violated some conventions and made some unidiomatic choices. Please point them out. An in-depth review of the algorithm isn't neccessary yet. Is my handling of gko::Array correct?

I have decided to forward the strategy to the kernel, this leaves some possibilities for optimizations open. However, when implementing the multithreaded version, I might come to the conclusion this isn't neccessary at all, and some structural switching can be done in the common code. While I'm at it: Is it bad style to treat these functions like "pseudo-kernels", as in passing them the executor?

@pratikvn
Copy link
Member

@lksriemer, just an organizational suggestion: To keep a clear history and commit graph, we only have merge commits into our main branches (develop and master). When you have new commits, please rebase your commits upon the main branch. Can you please do that here ? For example, you did a merge and reverted the merge, which makes the history weird, making the review harder.

Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest doing a soft reset of your new commits over the reordering-rcm branch, which should give you the new diff. You can then add your commits in a logical fashion with as many commits as you feel comfortable and then rebase on develop.

include/ginkgo/core/reorder/rcm.hpp Outdated Show resolved Hide resolved
include/ginkgo/core/reorder/rcm.hpp Show resolved Hide resolved
reference/reorder/rcm_kernels.cpp Outdated Show resolved Hide resolved
reference/reorder/rcm_kernels.cpp Outdated Show resolved Hide resolved
@pratikvn
Copy link
Member

Thanks, if you can do the same for the 7cfc3d0 commit (reset and rebase), that would be great.

@lksriemer
Copy link
Contributor Author

Should have thought about that sooner, sorry.

lksriemer and others added 18 commits October 15, 2020 18:09
The added test for the omp rcm_kernel asserts if a perm is in rcm order.
Change formatting, mostly newlines.

Co-authored-by: Pratik Nayak <[email protected]>
The omp reduction initializer interacted badly with the wrapper ctor.
The copy assignment is also explicitly deleted.
Move operators will be implicitly deleted.
Everything following are changes to the rcm implementation.
Remove unneccessary consts from rcm declaration.
Do not store adjacency matrix and degrees as members in the class.
Replace explicit gko::vector with vector.
replace std::memcpy with std::copy_n.
Various documentation improvments.
Reverse description order of return paramters.
Make IndexType explicit instead of auto.
Use std::min_elelemt
Fix various spelling errors, typos, rewordings.
Removes the 'explicit' from the ExectorAllocators rebinding construtor.
Changes occurences of array<bool> to vector<bool>, to save memory.
Minor cleanup
Remove the unnecessary test for the rcm adjacency matrix.
Add a test case for the correct rcm result.
Rewrite the assert_correct_permutation function to comparing with iota.
Move test matrices to test class.
Improve nested vector initialization.
Allocate degrees inside rcm::generate.
Change from goto to immediately evaluated lambda expression.
Refactor loop body into an inline helper function.
Relace some autos.
Reorganize includes to conform to include order.
Replace autos with IndexType where necessary.
Replace size_type with IndexType for num_vertices.
Refactor for the number of levels to be more explicit.
Make perm signal value -1.
Factor out sort_by_degree in a generalized, fast small sorting function.
Make level_processed a constexpr.
In the small_sort make some types explicit.
Wrap omp locks in a omp_mutex, use RAII guards.

Co-authored-by: Tobias Ribizel <[email protected]>
Includes immintrin.h only if explicitly not intel compiler.
omp_init_lock_with_hint is an OpenMP 5.0 feature.
This is not significant from a performance perspective.
Include types.hpp to make CI happy.
Add include guards to omp-only headers.
Remove an unnecessary #undef.
Refactor count_lelvels, including explicit types, simpler control flow.
Move all responsibility for prefix sum out of count_levels.
Make some constants types explicit.
Remove unnecessary includes.
Rewrite three reductions as non-complex omp reductions.
Make some types related to omp functions explicit.
Update to GKO_FACTORY_PARAMETER_SCALAR.
Minor changes, mostly explicit types and formatting.
Make the comparator macro a function.
Introduce builtin cmpxchg support for all supported compilers.
Manually format omp reductions to 80 chars.
Add documenting comment to omp_mutex.
Remove unnecessary undef of macro in cpp file.
Minor changes to formatting, naming, documentation.
Formatting changes, introduce type aliases v_type, i_type to Rcm.
Remove unnecessary break.
Use gko::share instead of releasing unique_ptr.
Remove autos in rcm tests.
Move comparator to detail namespace.
Rewrite omp reductions as manual reductions.
Fix typo.

Co-authored-by: Thomas Grützmacher <[email protected]>
@lksriemer
Copy link
Contributor Author

Alright, this should now be mergeable.

@hartwiganzt
Copy link
Collaborator

Alright, this should now be mergeable.

Please feel free to merge, @lksriemer !

@sonarcloud
Copy link

sonarcloud bot commented Oct 16, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 69 Code Smells

86.8% 86.8% Coverage
0.0% 0.0% Duplication

warning The version of Java (1.8.0_121) you have used to run this analysis is deprecated and we will stop accepting it from October 2020. Please update to at least Java 11.
Read more here

@pratikvn pratikvn merged commit a0e1050 into ginkgo-project:develop Oct 16, 2020
@lksriemer lksriemer deleted the reordering-rcm-wip branch October 16, 2020 08:14
@pratikvn
Copy link
Member

Thank you @lksriemer for your high quality contribution! We look forward to your future contributions.

@pratikvn pratikvn added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Oct 16, 2020
tcojean added a commit that referenced this pull request Aug 20, 2021
Ginkgo release 1.4.0

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)


Related PR: #857
tcojean pushed a commit that referenced this pull request Aug 20, 2021
See also the discussion under #500.
tcojean added a commit that referenced this pull request Aug 23, 2021
Release 1.4.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)

Related PR: #866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. is:new-feature A request or implementation of a feature that does not exist yet. mod:core This is related to the core module. mod:openmp This is related to the OpenMP module. mod:reference This is related to the reference module. type:reordering This is related to the matrix(LinOp) reordering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants