Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ginkgo common kernels by Kokkos #12

Closed
wants to merge 20 commits into from
Closed

Conversation

davidscn
Copy link
Owner

@davidscn davidscn commented Sep 13, 2023

Main changes of this PR

These changes are compatible with https://github.com/ginkgo-project/ginkgo/tree/kokkos-extension ginkgo-project/ginkgo#1358 and a (sufficiently recent) Kokkos version, I used version 4.1.

David Schneider added 3 commits June 6, 2024 10:05
Preventing error messages of the form:
error: An extended __host__ __device__ lambda cannot be defined inside a generic lambda expression("operator()").
@davidscn
Copy link
Owner Author

davidscn commented Jun 6, 2024

My setup:

  • Ginkgo at develop
  • Kokkos at develop
  • rebased preCICE (more or less) on v3.1.1
  • GCC 10.2
  • CUDA/NVCC 11.4.3
  • NVIDIA A100

Current status:
Compiling works now. I had to inline the lambdas, as the CUDA compiler doesn't allow for nested lambdas (error: An extended __host__ __device__ lambda cannot be defined inside a generic lambda expression("operator()").)

The most fundamental thing which doesn't work in the described setup is the runtime flexibility. If I compile the setup, the runtime check

if (std::dynamic_pointer_cast<const gko::ReferenceExecutor>(exec) ||
std::dynamic_pointer_cast<const gko::OmpExecutor>(exec)) {

always returns false, even if I use the reference executor. The test suite runs all available Ginkgo configurations. If I permute the indices for the column-major access, the reference executor tests pass and work nicely (they would then also work if the if branching with the dynamic casting would work).

However, the cuda executor tests don't work, unfortunately. This puzzles me a little bit at the moment as the executed kernel (a) should use the correct if branch and (b) should be the same as the reference executor.

I ran the following test

mpirun -np 4 ./testprecice --run_test=MappingTests/GinkgoRadialBasisFunctionSolver/cuSolver/MapCompactPolynomialC0

and printed out the rbf system matrix as well as the polynomial matrix (using the following code snippet)

  {
    auto tmp = gko::clone(_hostExecutor, _rbfSystemMatrix);
    std::cout << "System matrix" << std::endl;
    for (Eigen::Index i = 0; i < tmp->get_size()[0]; ++i) {
      for (Eigen::Index j = 0; j < tmp->get_size()[1]; ++j) {
        std::cout << tmp->at(i, j) << "   ";
      }
      std::cout << std::endl;
    }
  }

Comparing the very first test this suite executes, the polynomial matrix seems to be correct. For the kernel matrix, I get See last commit, CUDA works as well with the correct permutation.

matrix Q
0   0   1   
1   0   1   
1   1   1   
0   1   1   
matrix V
0   0   1   
System matrix
1   0.0277778   0.0277778   0.0277778   
0.0277778   1   1   1   
0.0277778   1   1   1   
0.0277778   1   1   1   
Output vertices
0   
0   
Evaluation matrix
0.0277778   0.0277778   0.0277778   0.0277778   
~/precice/src/mapping/tests/RadialBasisFctHelper.hpp(88): error: in "MappingTests/GinkgoRadialBasisFunctionSolver/cuSolver/MapCompactPolynomialC0": check value == 1.0 has failed [0.87500000000000178 != 1]. Relative difference exceeds tolerance [0.142857 > 1e-09]
Failure occurred in a following context:
    Test context represents "Serial" and runs on rank 0 out of 1. Initialized: { Events}

whereas for the reference executor I get (with the permuted indices/ the 'correct' if branch) the correct matrices:

matrix Q
0   0   1   
1   0   1   
1   1   1   
0   1   1   
matrix V
0   0   1   
System matrix
1   0.0277778   0   0.0277778   
0.0277778   1   0.0277778   0   
0   0.0277778   1   0.0277778   
0.0277778   0   0.0277778   1   
Output vertices
0   0   
Evaluation matrix
1   0.0277778   0   0.0277778

@MarcelKoch
Copy link

@davidscn I've looked a bit into the code and got it running on an AMD machine. Here are some notes to get it running with HIP:

  • the cmake setup has to be changed a bit, since cmake understands HIP as a language (I can provide a patch for that)
  • precice (and kokkos) have to be compiled with amdclang
    • I've compiled ginkgo with gcc, which I honestly didn't expect to work, but it didn't create issues
  • kokkos has to be compiled with -fPIC (precice already does that)
  • other TPL, e.g. boost, eigen, need also be compiled with amdclang
    • compiling everything with the same compiler is not surprising and just good practice anyway

Regarding the runtime check:

if (std::dynamic_pointer_cast<const gko::ReferenceExecutor>(exec) || 
     std::dynamic_pointer_cast<const gko::OmpExecutor>(exec)) { 

This will always be false, if Kokkos is build with a GPU device enabled. Since you are now using kokkos, maybe it would make sense to remove the runtime executor choice for the GinkgoParameter struct.

@davidscn
Copy link
Owner Author

Thanks!

the cmake setup has to be changed a bit, since cmake understands HIP as a language (I can provide a patch for that)

Yes, this was on the radar, we only recently updated our minimum CMake version to a compatible one.

I've compiled ginkgo with gcc, which I honestly didn't expect to work

You mean you compiled Ginkgo with GCC, whereas you compiled preCICE and Kokkos using amdclang? You are surprised that the overall setup is working? At least in your docs you write, that GCC is supported for Ginkgo.

other TPL, e.g. boost, eigen, need also be compiled with amdclang

Are you sure about that? Eigen is header-only and anyway compiled into preCICE, such that Eigen and preCICE are compatible by construction. Boost data structures are not interfacing with the device memory or similar. I could imagine that using a Boost compiled with another compiler could work. Having the same for Kokkos, Ginkgo and preCICE makes sense for me.

This will always be false, if Kokkos is build with a GPU device enabled.

Hm I have to think a bit about it. Having the reference executor for testing and validation has definitely value.

@MarcelKoch
Copy link

You mean you compiled Ginkgo with GCC, whereas you compiled preCICE and Kokkos using amdclang? You are surprised that the overall setup is working? At least in your docs you write, that GCC is supported for Ginkgo.

Yes, I've used different compilers for Ginkgo and preCICE+Kokkos. The issue is not using GCC for Ginkgo, but linking something that was compiled with GCC to something that was compiled with clang. In the past I've had some quite annoying linker errors because of that.

Are you sure about that? Eigen is header-only and anyway compiled into preCICE, such that Eigen and preCICE are compatible by construction. Boost data structures are not interfacing with the device memory or similar. I could imagine that using a Boost compiled with another compiler could work. Having the same for Kokkos, Ginkgo and preCICE makes sense for me.

Recompiling boost (also with amdclang) was necessary, since it's linked against. Regarding eigen, I'm not sure if it was necessary, but since I had to reinstall boost, I just also did the same with eigen (I installed both through spack).

@davidscn
Copy link
Owner Author

I see, thanks for the clarification.

Regarding

This will always be false, if Kokkos is build with a GPU device enabled.

I also just realized, that you ported the PP variables indicating the supported backend already to the Kokkos setup. Running the Ginkgo reference would essentially require to compile Kokkos with CPU support only, right? There is no chance to have the reference executor for reference testing next to another executor?!

@MarcelKoch
Copy link

So I've made both the reference executor and device executor work at the same time. This basically requires some more templates for the kernels, since now the kernels can be run either on the host or device.
enable_reference_and_device.patch.txt

@davidscn
Copy link
Owner Author

Thanks a lot, that's more than appreciated! I will integrate this next week. Do you have the HIP patch by chance available anyway? Otherwise, I would just copy-implement it from Ginkgo's main CMakeLists.txt

@MarcelKoch
Copy link

Sure, here is the cmake patch
fix_hip_cmake_setup.patch.txt

BTW, I noticed that the test doesn't succeed for the hipSolver case, i.e. when using the hip sparse QR solver. I'm looking a bit into it, but maybe it would be better if someone with more knowledge of what is supposed to happen looks at it.