Replace ginkgo common kernels by Kokkos #12

davidscn · 2023-09-13T07:55:49Z

Main changes of this PR

These changes are compatible with ~~https://github.com/ginkgo-project/ginkgo/tree/kokkos-extension~~ ginkgo-project/ginkgo#1358 and a (sufficiently recent) Kokkos version, I used version 4.1.

--------- Co-authored-by: Marcel Koch <[email protected]>

Preventing error messages of the form: error: An extended __host__ __device__ lambda cannot be defined inside a generic lambda expression("operator()").

davidscn · 2024-06-06T12:15:13Z

My setup:

Ginkgo at develop
Kokkos at develop
rebased preCICE (more or less) on v3.1.1
GCC 10.2
CUDA/NVCC 11.4.3
NVIDIA A100

Current status:
Compiling works now. I had to inline the lambdas, as the CUDA compiler doesn't allow for nested lambdas (error: An extended __host__ __device__ lambda cannot be defined inside a generic lambda expression("operator()").)

The most fundamental thing which doesn't work in the described setup is the runtime flexibility. If I compile the setup, the runtime check

precice/src/mapping/device/GinkgoRBFKernels.cpp

Lines 43 to 44 in 576dce9

 if (std::dynamic_pointer_cast<const gko::ReferenceExecutor>(exec) || 

 std::dynamic_pointer_cast<const gko::OmpExecutor>(exec)) {

always returns false, even if I use the reference executor. The test suite runs all available Ginkgo configurations. If I permute the indices for the column-major access, the reference executor tests pass and work nicely (they would then also work if the if branching with the dynamic casting would work).

However, the cuda executor tests don't work, unfortunately. This puzzles me a little bit at the moment as the executed kernel (a) should use the correct if branch and (b) should be the same as the reference executor.

I ran the following test

mpirun -np 4 ./testprecice --run_test=MappingTests/GinkgoRadialBasisFunctionSolver/cuSolver/MapCompactPolynomialC0

and printed out the rbf system matrix as well as the polynomial matrix (using the following code snippet)

  {
    auto tmp = gko::clone(_hostExecutor, _rbfSystemMatrix);
    std::cout << "System matrix" << std::endl;
    for (Eigen::Index i = 0; i < tmp->get_size()[0]; ++i) {
      for (Eigen::Index j = 0; j < tmp->get_size()[1]; ++j) {
        std::cout << tmp->at(i, j) << "   ";
      }
      std::cout << std::endl;
    }
  }

Comparing the very first test this suite executes, the polynomial matrix seems to be correct. ~~For the kernel matrix, I get~~ See last commit, CUDA works as well with the correct permutation.

matrix Q
0   0   1   
1   0   1   
1   1   1   
0   1   1   
matrix V
0   0   1   
System matrix
1   0.0277778   0.0277778   0.0277778   
0.0277778   1   1   1   
0.0277778   1   1   1   
0.0277778   1   1   1   
Output vertices
0   
0   
Evaluation matrix
0.0277778   0.0277778   0.0277778   0.0277778   
~/precice/src/mapping/tests/RadialBasisFctHelper.hpp(88): error: in "MappingTests/GinkgoRadialBasisFunctionSolver/cuSolver/MapCompactPolynomialC0": check value == 1.0 has failed [0.87500000000000178 != 1]. Relative difference exceeds tolerance [0.142857 > 1e-09]
Failure occurred in a following context:
    Test context represents "Serial" and runs on rank 0 out of 1. Initialized: { Events}

whereas for the reference executor I get (with the permuted indices/ the 'correct' if branch) the correct matrices:

matrix Q
0   0   1   
1   0   1   
1   1   1   
0   1   1   
matrix V
0   0   1   
System matrix
1   0.0277778   0   0.0277778   
0.0277778   1   0.0277778   0   
0   0.0277778   1   0.0277778   
0.0277778   0   0.0277778   1   
Output vertices
0   0   
Evaluation matrix
1   0.0277778   0   0.0277778

MarcelKoch · 2024-06-10T10:06:40Z

@davidscn I've looked a bit into the code and got it running on an AMD machine. Here are some notes to get it running with HIP:

the cmake setup has to be changed a bit, since cmake understands HIP as a language (I can provide a patch for that)
precice (and kokkos) have to be compiled with amdclang
- I've compiled ginkgo with gcc, which I honestly didn't expect to work, but it didn't create issues
kokkos has to be compiled with -fPIC (precice already does that)
other TPL, e.g. boost, eigen, need also be compiled with amdclang
- compiling everything with the same compiler is not surprising and just good practice anyway

Regarding the runtime check:

if (std::dynamic_pointer_cast<const gko::ReferenceExecutor>(exec) || 
     std::dynamic_pointer_cast<const gko::OmpExecutor>(exec)) {

This will always be false, if Kokkos is build with a GPU device enabled. Since you are now using kokkos, maybe it would make sense to remove the runtime executor choice for the GinkgoParameter struct.

davidscn · 2024-06-11T08:17:21Z

Thanks!

the cmake setup has to be changed a bit, since cmake understands HIP as a language (I can provide a patch for that)

Yes, this was on the radar, we only recently updated our minimum CMake version to a compatible one.

I've compiled ginkgo with gcc, which I honestly didn't expect to work

You mean you compiled Ginkgo with GCC, whereas you compiled preCICE and Kokkos using amdclang? You are surprised that the overall setup is working? At least in your docs you write, that GCC is supported for Ginkgo.

other TPL, e.g. boost, eigen, need also be compiled with amdclang

Are you sure about that? Eigen is header-only and anyway compiled into preCICE, such that Eigen and preCICE are compatible by construction. Boost data structures are not interfacing with the device memory or similar. I could imagine that using a Boost compiled with another compiler could work. Having the same for Kokkos, Ginkgo and preCICE makes sense for me.

This will always be false, if Kokkos is build with a GPU device enabled.

Hm I have to think a bit about it. Having the reference executor for testing and validation has definitely value.

MarcelKoch · 2024-06-12T06:34:54Z

You mean you compiled Ginkgo with GCC, whereas you compiled preCICE and Kokkos using amdclang? You are surprised that the overall setup is working? At least in your docs you write, that GCC is supported for Ginkgo.

Yes, I've used different compilers for Ginkgo and preCICE+Kokkos. The issue is not using GCC for Ginkgo, but linking something that was compiled with GCC to something that was compiled with clang. In the past I've had some quite annoying linker errors because of that.

Are you sure about that? Eigen is header-only and anyway compiled into preCICE, such that Eigen and preCICE are compatible by construction. Boost data structures are not interfacing with the device memory or similar. I could imagine that using a Boost compiled with another compiler could work. Having the same for Kokkos, Ginkgo and preCICE makes sense for me.

Recompiling boost (also with amdclang) was necessary, since it's linked against. Regarding eigen, I'm not sure if it was necessary, but since I had to reinstall boost, I just also did the same with eigen (I installed both through spack).

davidscn · 2024-06-12T06:52:31Z

I see, thanks for the clarification.

Regarding

This will always be false, if Kokkos is build with a GPU device enabled.

I also just realized, that you ported the PP variables indicating the supported backend already to the Kokkos setup. Running the Ginkgo reference would essentially require to compile Kokkos with CPU support only, right? There is no chance to have the reference executor for reference testing next to another executor?!

MarcelKoch · 2024-06-19T13:38:02Z

So I've made both the reference executor and device executor work at the same time. This basically requires some more templates for the kernels, since now the kernels can be run either on the host or device.
enable_reference_and_device.patch.txt

davidscn · 2024-06-19T14:00:30Z

Thanks a lot, that's more than appreciated! I will integrate this next week. Do you have the HIP patch by chance available anyway? Otherwise, I would just copy-implement it from Ginkgo's main CMakeLists.txt

MarcelKoch · 2024-06-19T14:09:17Z

Sure, here is the cmake patch
fix_hip_cmake_setup.patch.txt

BTW, I noticed that the test doesn't succeed for the hipSolver case, i.e. when using the hip sparse QR solver. I'm looking a bit into it, but maybe it would be better if someone with more knowledge of what is supposed to happen looks at it.

github-actions bot assigned davidscn Sep 13, 2023

davidscn force-pushed the kokkos-extension branch from a78d60d to 344707d Compare September 13, 2023 08:07

MarcelKoch and others added 15 commits June 6, 2024 09:53

remove deprecated ginkgo parts

d790715

move definitions into separate file

d46450f

Small cosmetics

de82966

make compile with kokkos layer

ff8c1c0

fixes cmake

348d35d

add second kernel

929409e

kernel fixup

9739992

fix kernels

86da8ce

maybe fix host/device functions

8e09b0b

fixes transposed access to device arrays

677d852

initializes kokkos for ginkgo

089095f

COnvergence checks updates

96327c6

Handle initialization of Kokkos appropriately

d3602b4

Handle finalization properly

861c480

Port to kokkos-mapping branch of Ginkgo develop

af450d5

--------- Co-authored-by: Marcel Koch <[email protected]>

davidscn force-pushed the kokkos-extension branch from 39ff35a to af450d5 Compare June 6, 2024 07:58

David Schneider added 3 commits June 6, 2024 10:05

Fix invokaction/signature of device solver functions

bea59e0

Inline lambdas to make the CUDA compiler happy

f59545f

Preventing error messages of the form: error: An extended __host__ __device__ lambda cannot be defined inside a generic lambda expression("operator()").

Make compile with Cuda

576dce9

Use correct permutation for CUDA

3dbbbdc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ginkgo common kernels by Kokkos #12

Replace ginkgo common kernels by Kokkos #12

davidscn commented Sep 13, 2023 •

edited

Loading

davidscn commented Jun 6, 2024 •

edited

Loading

MarcelKoch commented Jun 10, 2024

davidscn commented Jun 11, 2024

MarcelKoch commented Jun 12, 2024

davidscn commented Jun 12, 2024

MarcelKoch commented Jun 19, 2024

davidscn commented Jun 19, 2024

MarcelKoch commented Jun 19, 2024

Replace ginkgo common kernels by Kokkos #12

Replace ginkgo common kernels by Kokkos #12

Conversation

davidscn commented Sep 13, 2023 • edited Loading

Main changes of this PR

davidscn commented Jun 6, 2024 • edited Loading

MarcelKoch commented Jun 10, 2024

davidscn commented Jun 11, 2024

MarcelKoch commented Jun 12, 2024

davidscn commented Jun 12, 2024

MarcelKoch commented Jun 19, 2024

davidscn commented Jun 19, 2024

MarcelKoch commented Jun 19, 2024

davidscn commented Sep 13, 2023 •

edited

Loading

davidscn commented Jun 6, 2024 •

edited

Loading