11institutetext: National Advanced Computing Laboratory, National High Technology Center, San José, Costa Rica
11email: {jruzicka, casch, emeneses}@cenat.ac.cr
22institutetext: School of Computing, Costa Rica Institute of Technology, Costa Rica
33institutetext: Max Planck Computing and Data Facility, Max Planck Society, Garching, Germany
33email: {markus.rampp, erwin.laure}@mpcdf.mpg.de

A Study of Performance Portability in Plasma Physics Simulations

Josef Ruzicka 1122 0009-0003-4423-2612    Christian Asch 11 0000-0002-3111-4858    Esteban Meneses 1122 0000-0002-4307-6000    Markus Rampp 33 0000-0001-8177-8698    Erwin Laure 33 0000-0002-9901-9857
Abstract

The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance portability frameworks, i.e., software tools that allow programmers to write code once and run it on different computer architectures without sacrificing performance. We report here on the benefits and challenges of performance portability using a field-line tracing simulation and a particle-in-cell code, two relevant applications in computational plasma physics with applications to magnetically-confined nuclear-fusion energy research. For these applications we report performance results obtained on four HPC platforms with server-class CPUs from Intel (Xeon) and AMD (EPYC), and high-end GPUs from Nvidia and AMD, including the latest Nvidia H100 GPU and the novel AMD Instinct MI300A APU. Our results show that both Kokkos and OpenMP are powerful tools to achieve performance portability and decent “out-of-the-box” performance, even for the very latest hardware platforms. For our applications, Kokkos provided performance portability to the broadest range of hardware architectures from different vendors.

Keywords:
Parallel Programming Performance Portability Plasma Physics.

1 Introduction

As the energetic and environmental needs of humanity call for a turning away from fossil fuels, clean and renewable energy sources with little effect on global warming, such as nuclear fusion, receive increasing attention in research and society [8]. Fusion energy research, in particular, is well known to require vast amounts of computational power for modeling the complex plasma-physics phenomena spanning large ranges of spatial and temporal scales. Due to the long-term efforts invested in developing the simulation codes, it is highly desirable to write them in a performance-portable way, i.e., allowing the reuse of the same source code on different high-performance computing (HPC) systems and without sacrificing application performance.

Commonly, the “MPI+X” programming paradigm is adopted, which employs the Message Passing Standard (MPI) for handling parallelism across nodes, complemented by some other model (X) for intra-node parallelism (including accelerators), which is typically relying on some sort of shared memory. One of the most relevant models in this respect is OpenMP, a directive-based approach to parallelism that also supports offloading computation to accelerators while maintaining compatibility for shared memory parallelism in the CPU [9], enabling portability across computing architectures. It is a mature standard that requires a compiler capable of interpreting the directives and also provides additional functionality through a library API, allowing deeper optimizations. Another framework that focuses on performance portability is Kokkos [25], which is implemented as a library and enables portability of C++ codes between CPUs and GPUs. It allows to use the same data structures for different devices (e.g. CPU or GPU) by providing a polymorphic data layout that adjusts to the device that is being used, thereby, for example, addressing the often-encountered design decision of array of structures (AoS) versus structure of arrays (SoA), none of which leading to a memory-access pattern that is well-suited for different kinds of device at the same time. These polymorphic data structures are called Views and are described as “a potentially reference counted multidimensional array with compile-time layouts and memory space”. In other words, they account for blocked or coalesced data access patterns based on the specified memory and execution spaces defined for multicore or GPU systems, respectively [23, 6].

In this paper, we evaluate the Kokkos framework using two different plasma-physics simulation codes, a particle-in-cell (PIC) code and a field-line tracer code. We compare the computational performance of the Kokkos code variants with the corresponding baseline implementations on four HPC systems using two different CPU models and five different GPU models.

Previously, our team developed BS-SOLCTRA [13], an OpenMP code for multicore systems for tracing field-lines that is used for simulating different stellarator coil configurations. The data structures for representing the particles were based on the AoS approach. This work was later continued by porting the code to GPUs, still using OpenMP. However, for performance reasons, this required transforming the data layout to one-dimensional arrays, thereby compromising portability (in the sense of a single source code) between architectures, as the new one-dimensional array variant for the GPU showed the expected underperformance on the CPU. This experience specifically motivated us to consider Kokkos with its polymorphic data structures and its promise of maintaining a single-source code by transparently accounting for the most appropriate memory-access pattern on both CPUs and GPUs [14]. Our assessment is complemented and broadened by considering, in addition, a simulation code based on the particle-in-cell (PIC) method, which is one of the most commonly used methods in computational plasma physics.

The contributions of this paper are two-fold. First, we develop performance-portable variants of our field-line tracer code, BS-SOLCTRA, and a representative PIC code. Second, we offer a comparative evaluation of Kokkos and OpenMP for the different variants of the code.

The remainder of this paper is structured as follows: Section 2 describes the numerical methods used in our plasma physics simulations, provides the definitions used for performance portability, and lists some related work on performance portability and plasma physics simulations. In Section 3, we describe our baseline implementations of the two codes and the way we adapt them to Kokkos. Section 4 provides details on the setup, and the results of our performance-evaluation. Finally, Section 5 summarizes insights and key findings of our work.

2 Background

2.1 Basics of the Numerical Models

Refer to caption
(a) Field-line tracer
Refer to caption
(b) Particle-in-cell
Figure 1: 2D plasma physics simulation representations

Fusion energy is generated in form of the released binding energy when light atomic nuclei (e.g. Hydrogen) fuse to form a heavier nucleus (e.g. Helium). This process is controlled in thermonuclear fusion devices such as stellarators, which consist of a helically symmetric toroidal tube encircled by complex solenoidal coils that produce a strong magnetic field which confines the plasma within the vessel. The design, assembly, and actual experimental testing of a stellarator is technologically complex and expensive to carry out. Therefore, different scenarios and configurations are tested and optimized in simulations before they are actually manufactured [17, 21].

Field-line tracer. Magnetic fields can be represented and traced as a series of field-lines that exert a Lorentz force on charged particles in a plasma (visualized in Figure 1(a)), to analyze and gain insights into how the magnetic field influences their trajectories. This is usually done by approximating the differential equations that describe the behavior of the magnetic field in the simulated space. This simulation is particularly useful for studying spaces like those used in plasma confinement, such as stellarators and tokamaks.

Particle-in-cell. In this numerical method (cf. Figure 1(b)), plasma is represented as a collection of charged “macroparticles” of different kinds (typically ions and electrons) within a grid structure. The grid is used to numerically solve Maxwell’s equations [12], given the electric charge and current densities obtained by interpolation from the positions of the particles. The latter, in turn, are evolved in time by the acceleration computed from the (interpolated) electromagnetic forces, as obtained from the solution of the Maxwell equations [12, 7, 10, 17].

2.2 Performance Portability

The performance of an application can be measured by different metrics, for example, by the time-to-solution, the number of floating point operations per second (FLOP/s), or energy consumption, among others. In this paper, we adopt the definitions for performance, portability, and performance portability provided by Pennycook et al. [20]. Performance is “any measurable property of an application’s correct execution of a problem on a platform.” Portability is “the ability of an application to execute a problem correctly on a given set of platforms”. Finally, performance portability is “a measurement of an application’s performance efficiency for a given problem that can be executed correctly on all platforms in a given set”. For our different experiment configurations (platform, simulation, and programming model combination), we shall measure efficiency as their application efficiency, which is defined as the “achieved performance as a fraction of their best observed performance”, for example, the best achieved time-to-solution on a certain platform and simulation combination. Additionally, we use their proposed metric of performance portability \textipa\textqplig, which is given by the harmonic mean of i𝑖iitalic_i experiment configurations’ performance efficiency e𝑒eitalic_e observed across a set of platforms H𝐻Hitalic_H (in our case, HPC systems) when running our applications a𝑎aitalic_a (for this study, a programming model) that solve a problem p𝑝pitalic_p (one of the simulations methods). Note that if a certain experiment configuration is not available for a certain platform in H𝐻Hitalic_H, performance portability is nil. This is given by Equation 1.

\textipa\textqplig(a,p,H)={|H|iH1ei(a,p)if i is supported iH0otherwise\textipa\textqplig𝑎𝑝𝐻cases𝐻subscript𝑖𝐻1subscript𝑒𝑖𝑎𝑝if 𝑖 is supported for-all𝑖𝐻0otherwise\text{\textipa{\textqplig}}(a,p,H)=\begin{cases}\frac{|H|}{\sum_{i\in H}\frac{% 1}{e_{i}(a,p)}}&\text{if }i\text{ is supported }\forall\ i\in H\\ 0&\text{otherwise}\end{cases}( italic_a , italic_p , italic_H ) = { start_ROW start_CELL divide start_ARG | italic_H | end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_H end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_a , italic_p ) end_ARG end_ARG end_CELL start_CELL if italic_i is supported ∀ italic_i ∈ italic_H end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (1)

Different libraries or programming models have been developed to program performance portable code. We chose Kokkos to adapt our simulations because of the advantages it provides as a popular, well-documented, well-maintained, high-level parallel programming library meant for shared-memory systems.

2.3 Related Work

A number of previous studies in performance portability have employed the metric we defined in Equation 1 as a standard [3, 15, 5, 11, 24, 4, 18, 27]. Deakin et al. [3] obtained their best performance portability results using OpenMP and Kokkos in one of the largest studies of performance portability across different applications, architectures and programming models, on which they note that some overhead should be expected given that Kokkos’ CPU abstraction is implemented using OpenMP. Deakin et al[5] also report their best performance portability results when using Kokkos and OpenMP over various applications and architectures. Kwack et al[4] employed a variety of performance portability metrics to measure various programming models on different architectures across different applications. Their findings highlight that each model has distinct strengths and the optimal choice of model varies depending on the specific application. Muralikrishnan et al. [19] investigated performance portability and scalability in a series of mini-apps related to the PIC scheme, on which Kokkos structures were used with OpenMP and MPI.

Similar observations are noted by Wright et al[27] in their performance portability survey of plasma physics simulations. The findings from these studies collectively indicate that both our chosen performance portability metric and programming model (out of many more options, e.g. RAJA [2], SYCL [22], Alpaka [28], StarPU [1], among others) are reasonable and timely choices.

3 Code Implementation

3.1 Baseline Implementation

Field-line Tracer. The Biot-Savart Solver for Computing and Tracing Magnetic Field-Lines (BS-SOLCTRA) [13] has been used to study plasma confinement in fusion devices. The simulation applies the field-line tracing technique in a 3D vacuum magnetic field and uses Biot-Savart’s law to test distinct modular electromagnetic coil configurations.

The movement of each line, represented by a computational particle, is obtained by updating its current coordinates in each iteration of the simulation by computing the effect of the magnetic field that the set of modular coils generates, ignoring any impact by the interaction between particles. The integration employs a fourth-order Runge-Kutta method. This code was chosen due to its familiarity to the collaborating laboratories and its significance in their research.

The baseline code employed OpenMP to offload computations to the GPU, enabling the calculation of particle positions in parallel through shared memory. This code stores particles as one-dimensional arrays to better fit the GPU architectures, having been previously adapted from a CPU-only implementation that stored particles as an AoS and used SIMD-vectorization to optimize performance. A straightforward MPI implementation let us study the simulation’s behaviour when distributing the particles between different computer nodes and GPUs. The performance of BS-SOLCTRA is measured in terms of time-to-solution, with lower values indicating higher application performance.

Particle-in-cell. The Parallel Research Kernels (PRK) [26] is a suite of computational kernels related to linear algebra, stencil computations, graph algorithms, and particle simulations, among others. PRKs are designed to be tested in an exploratory manner with different HPC programming models, such as Kokkos, and OpenMP. One kernel of particular interest to this study is a PIC code that uses Coulomb’s law to calculate the forces exerted on each particle by the electromagnetic fields generated by the grid charges of its current cell. To induce load imbalance and control the evolution of the particles’ distribution, they can be initialized using four different patterns.

The baseline code uses OpenMP on CPU, and we used it here to incorporate Kokkos. The computational performance of PIC is conventionally measured as the number of particles moved per second, a metric which we adopt in the following, with larger values indicating higher application performance.

Note that the adaptations of the two codes employed in this study were developed in opposite directions, i.e., BS-SOLCTRA was ported from GPU to CPU, and the other way around with PIC. 111The code variants used can be found in this version control repository: https://gitlab.com/CNCA_CeNAT/MPCDF/kokkos-bs-solctra.

3.2 Kokkos Adaptation

Integration of Kokkos into the base codes required converting the data structures used to store the information about the particles, coils, among others, from C-style arrays of elements to Kokkos views. The conversion involved updating the source and header files to accommodate the new data structures, as certain functions in the modularized source code files relied on passing these data structures as parameters. Listing 1 exemplifies how data is handled in the different variants, and Listing 2 shows how the computation kernels were parallelized.

Listing 1: Data Management.
// OpenMP
// Memory allocation
particles = static_cast<double*>(aligned_alloc(alignment_size, sizeof(double)*particles_size));
// Initialize data normally...
// Device data transfer
#pragma omp target enter data map(to:particles[0:particles_size])
// --------
// Kokkos
// Memory space setting and memory allocation
#ifdef KOKKOS_ENABLE_CUDA
#define MemSpace Kokkos::CudaSpace
#endif
#ifdef KOKKOS_ENABLE_HIP
#define MemSpace Kokkos::HIPSpace
#endif
#ifndef MemSpace
#define MemSpace Kokkos::HostSpace
#endif
typedef Kokkos::View<double*, MemSpace> ViewVectorType;
ViewVectorType particles("particles", particles_size);
ViewVectorType::HostMirror h_particles = Kokkos::create_mirror_view(particles);
// Initialize data normally...
// Device data transfer
Kokkos::deep_copy(particles, h_particles);
Listing 2: Kernel Parallelization.
// OpenMP
#pragma omp parallel num_threads(2)
{
#pragma omp single
{
for (int i = 1; i <= steps; i++){
#pragma omp target teams distribute parallel for
for(int p=0; p < particle_count ; p++){
// simulation computation...
}}}}
// --------
// Kokkos
// Execution space and range policy setting
using ExecSpace = MemSpace::execution_space;
using Kokkos::RangePolicy<ExecSpace> range_policy;
Kokkos::parallel_for("run_particles_parallel_reduce", range_policy(0,particle_count), KOKKOS_LAMBDA (int p)
{
// simulation computation...
});

In the PIC OpenMP code, the particles are defined as structs and then dynamically allocated in an array using pointer arithmetic, while the grid structure is created using traditional malloc calls. To integrate Kokkos, these data structures were adapted by keeping the struct definition and storing it within a Kokkos view, and replacing the malloc with a Kokkos view allocation. Finally, keywords like KOKKOS_ARCH = "Ampere80" are used so that Kokkos can hint the compiler to optimize the code for a specific target architecture.

4 Experiments

4.1 Setup

The simulations for this study were conducted on four different HPC systems: Kabré at Costa Rica’s National High Technology Center (CeNAT), Raven at the Max Planck Computing and Data Facility (MPCDF), and benchmark systems at AMD and Nvidia, respectively. The specifications of the machines and the corresponding software configuration can be found in Table 1 and Table 2, respectively. Note that our GPU codes do not make use of the novel “unified” memory architecture connecting the GPU and the CPU parts of the latest Nvidia GH200 and AMD MI300A chips, i.e., for our benchmarks, only the GPU parts of these platforms are relevant.

Table 1: HPC Systems: hardware configuration
Machine Partition Nodes CPU
Cores
per node
SMT GPU
RAM (GB)
per node
Theoretical
FP64 peak TFlop/s
per node
Kabré CPU 8
Intel Xeon
Gold 6354
36 yes - 512 1.9*
GPU 4
Intel Xeon
Silver 4214R
24 - NVIDIA V100 32 7
Raven CPU 1592    
Intel Xeon
Platinum 8360Y
72 yes - 256 3*
GPU 192
Intel Xeon
Platinum 8360Y
72 yes    
NVIDIA
A100-SXM4-40GB
512 9.7
   AMD Accelerator Cloud GPU 1
2x AMD
EPYC 7F52
2x 16 - AMD MI210 512 22.6
GPU 1 AMD Zen 4 24 yes
AMD MI300A
(A0 stepping)
128 81.7
NVIDIA GH200 GPU 1 NVIDIA Grace 72 - NVIDIA H100 480 34
* Reported by Microway [16]
Table 2: HPC Systems: software configuration
Machine OS Compiler Kokkos MPI
Kabré CentOS 7    GCC 9.2 + CUDA 11.6.2 4.3.0-RC -
Raven    SUSE Linux Enterprise Server 15-SP3 NVIDIA HPC-SDK 22 4.3.0-RC    Open MPI 4
   AMD Accelerator Cloud Ubuntu 22.04.2 LTS ROCm 6.1.0    4.3.0-RC -
NVIDIA GH200 Rocky Linux 9.3 NVIDIA HPC-SDK 24.3 4.3.0-RC -

Computational performance for BS-SOLCTRA was measured for three different problem sizes over 1,000 iterations of the simulation, using the same setup as in our previous study [14]. For PIC, four particle initialization patterns were used to induce different load imbalance scenarios. The problem sizes and the initialization patterns are stated in Table 3. The performance metric is time-to-solution in seconds (lower is better) for BS-SOLCTRA, and the rate of particles moved per second (higher is better) for PIC. For each measurement we report the arithmetic mean of 10 executions, with no notable run-to-run variations.

Table 3: Simulation setup
Simulation Architecture Processing unit Code variant Problem size Initialization pattern
   BS-SOLCTRA CPU Xeon Platinum 8360Y     Kokkos, OpenMP 102400, 512000, 1024000 particles Random
GPU    V100, A100, MI210, MI300A, H100
GPU + MPI A100*
PIC CPU
Xeon Gold 6354,
Xeon Platinum 8360Y
   1000000 particles Geometric, Linear, Patch, Sinusoidal
GPU V100, A100
* For our multi-node scaling experiments, 2, 4, 8 & 16 A100 GPUs were used

4.2 Field-line Tracer

GPU results. The Kokkos adaptation of BS-SOLCTRA code resulted in a performance drop for the older V100 and A100 GPUs compared to the original OpenMP-based code variant. On the MI210 GPU, Kokkos performed as well as OpenMP. However, for the very latest generation of H100 GPU and MI300A APU, OpenMP shows an apparent underperformance when compared with the Kokkos results (cf. Figure 2(b)). This may hint at OpenMP compilers (contrary to the platform-native CUDA and ROCM backends employed by Kokkos) being not yet fully tuned for this very latest generation of high-end GPUs. For the MI300A, in addition, all our measured performance numbers have to be interpreted with some caution, as we had only access to a pre-production sample “A0 stepping”) which does not yet meet all the final specifications. At the same time it is remarkable that Kokkos delivers very decent out-of-the-box performance for hardware that was introduced only very recently. Not too unsurprisingly, and consistent with findings reported in the literature cited in Section 2, there is no single combination of programming model and computer architecture that is always optimal for all applications.

Refer to caption
(a) CPU adaptation on Xeon Platinum 8360Y with the OpenMP (dashed, and dashed-dotted lines) and Kokkos variants (solid lines).
Refer to caption
(b) GPU adaptation on different GPU models (colours) with the OpenMP variant (dashed lines) and the Kokkos variant (solid lines).
Figure 2: Time-to-solution for the three different problem sizes on a single CPU/GPU with the OpenMP and Kokkos code variants of BS-SOLCTRA.

When comparing BS-SOLCTRA’s code variants across GPU architectures, the H100 and MI300A GPUs demonstrated a clear performance hierarchy, outperforming the previous-generation A100 and MI210, respectively. In turn, the A100 and MI210 outperformed the older V100 GPU. These results are broadly consistent across code variants and all GPU architectures (except for the case on which the A100 performed better than the MI210 with OpenMP, while they achieved similar performance with Kokkos), and reflect the compute-bound nature of our application in relation to the peak Tflop/s ratings of the different GPUs (cf. Table 1).

The different setups resulted in a near-linear relation of problem size with time-to-solution on the various GPUs tested. The smaller problem sizes likely failed to fully saturate the more powerful GPUs, which explains the comparably better performance of the V100 for the smallest problem size.

We used the metric in Equation 1 to measure the performance portability of our simulation. For each application (code variant), we selected the best performance achieved on each architecture for the largest problem size. We then identified the overall best performance across both applications to calculate application efficiencies shown in Table 4. Considering all the GPUs used, the measured performance portability (Table 6) was 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG for our Kokkos implementation, and 78 %times78percent78\text{\,}\mathrm{\char 37\relax}start_ARG 78 end_ARG start_ARG times end_ARG start_ARG % end_ARG for OpenMP.

Table 4: BS-SOLCTRA application efficiency for the largest problem size (1024000 particles)
Platform Code variant Time-to-solution Application efficiency
V100 OpenMP 270.39 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 300.08 90 %times90percent90\text{\,}\mathrm{\char 37\relax}start_ARG 90 end_ARG start_ARG times end_ARG start_ARG % end_ARG
A100 OpenMP 216.05 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 242.91 89 %times89percent89\text{\,}\mathrm{\char 37\relax}start_ARG 89 end_ARG start_ARG times end_ARG start_ARG % end_ARG
MI210 OpenMP 241.53 99 %times99percent99\text{\,}\mathrm{\char 37\relax}start_ARG 99 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 241.35 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
MI300A OpenMP 166.52 61 %times61percent61\text{\,}\mathrm{\char 37\relax}start_ARG 61 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 102.09 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
H100 OpenMP 172.87 57 %times57percent57\text{\,}\mathrm{\char 37\relax}start_ARG 57 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 98.95 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Xeon Platinum 8630Y OpenMP 7763.10 37 %times37percent37\text{\,}\mathrm{\char 37\relax}start_ARG 37 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 2889.40 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG

CPU results. Using an Intel Xeon Platinum 8360Y CPU node, we compare our Kokkos adaptation of BS-SOLCTRA with the one-dimensional array and AoS OpenMP variants.222In our prior work [14], porting the OpenMP CPU code variant to the GPU favored one-dimensional arrays for performance, while AoS remained preferable for the CPU. For this reason, our AoS OpenMP code variant was not ported to the GPU. The AoS code variant relies on (guided) autovectorization by the compiler, i.e. we apply the #pragma omp simd directive, together with the relevant compiler flags (-mavx512f -mavx512pf -mavx512er -mavx512cd) in order to enable AVX-512 instructions for the Intel Xeon CPUs. The impact of using the different code variants can be observed in Figure 2(a).

Our Kokkos implementation achieved a 2.7x speedup compared to the sub-optimal OpenMP one-dimensional array variant, demonstrating portability benefits. However, it fell short of the performance of the AoS OpenMP variant. While Kokkos exhibited a 2.5x slowdown relative to AoS OpenMP, this is significantly less severe than the 6.7x slowdown observed with one-dimensional array OpenMP. When included in the set of platforms with the GPUs previously used, Kokkos achieved an application efficiency of 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG and a performance portability of 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG. By contrast, the one-dimensional array OpenMP variant’s application efficiency of 37 %times37percent37\text{\,}\mathrm{\char 37\relax}start_ARG 37 end_ARG start_ARG times end_ARG start_ARG % end_ARG resulted in a performance portability of 66 %times66percent66\text{\,}\mathrm{\char 37\relax}start_ARG 66 end_ARG start_ARG times end_ARG start_ARG % end_ARG. This comparison demonstrates Kokkos’s ability to deliver more consistent performance across diverse architectures. 333In this measurement, we intentionally used our sub-optimal one-dimensional array OpenMP variant for consistency with the code employed in our GPU runs. The CPU optimized SoA code variant was included to provide a perspective of the impact in loss of performance in the architecture for which the code was originally optimized by programming performance portable code.

Multi-node scalability. For completeness, Figure 3(a) presents strong and weak scaling results on up to 16 A100 GPUs (4 Raven nodes). We note that the code does not involve particle interactions which would introduce communication or synchronization across GPUs and hence ideal parallel scaling should be expected. While this is experimentally confirmed for the weak scaling (Figure 3(b)), the strong scaling speedups from 1 to 16 GPUs are only 12.6 (Kokkos) and 12.1 (OpenMP). This can be explained by a progressive under-utilization of the individual GPUs as the workload per GPU decreases with increasing number of GPUs. Such under-saturation can also explain the performance difference between the Kokkos and OpenMP variants narrowing as more GPUs are used. Overall, measured multi-node performances are consistent with the single-GPU findings discussed earlier, and no unexpected interference of the intra-node performance with inter-node MPI parallelization is found.

Refer to caption
(a) Strong Scaling
Refer to caption
(b) Weak Scaling
Figure 3: BS-SOLCTRA parallel scaling (runtime as a function of employed A100 GPUs) for the Kokkos (solid lines) and the OpenMP (dashed lines) code variants for three different problem sizes (colour coded).

4.3 Particle-in-cell

CPU results. Figure 4(a) shows that, across all initialization patterns, our Kokkos adaptation outperformed the OpenMP variant in terms of particle movement rates. On Kabré’s Intel Xeon Gold 6354 CPU, Kokkos achieved a particle movement rate 1.4 times faster than OpenMP. This advantage increased to 1.6 times on Raven’s Intel Xeon Platinum 8360Y CPUs. These results highlight the effectiveness of Kokkos’ automatic memory allocation layout and its data access patterns optimizations, overcoming the need to implement architecture-specific data structures. This performance difference likely stems from the OpenMP variant storing the simulation grid in a one-dimensional array format, and considering the nature of GPU memory accessing patterns. Overall, the Xeon Platinum CPU reached particle movement rates about 1.7 times higher than the Xeon Gold for both code variants.

Refer to caption
(a) CPU adaptation on Xeon Gold 6354 (blue) and Platinum 8360Y (green).
Refer to caption
(b) GPU adaptation on V100 (blue) and A100 (green).
Figure 4: Performance on a single CPU/GPU for the Kokkos (cross hatch) and OpenMP (solid fill) variants of the PIC code for different initialization patterns. The black error bars indicate the measured run-to-run variations.

As the performance difference between configurations remained consistent across load-balancing scenarios, we arbitrarily chose the geometric initialization results to measure the application’s performance portability. Based on the efficiencies in Table 5, Kokkos achieved 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG performance portability, while OpenMP reached only 67 %times67percent67\text{\,}\mathrm{\char 37\relax}start_ARG 67 end_ARG start_ARG times end_ARG start_ARG % end_ARG (Table 6).

GPU Results. When running the simulation on a single V100 or A100 node, the particle movement rate, visualized in Figure 4(b), increased by up to 13 times with Kokkos compared to the results previously obtained with the CPU implementation. OpenMP achieved an even higher improvement, reaching up to a 20 times increase in particle movement rate. This is expected for a compute-bound code like PIC, given the significant difference in Tflop/s between CPUs and GPUs.

Table 5: PIC application efficiency for the geometric initialization pattern
Platform Code variant
Particle
movement rate
Application
efficiency
V100 OpenMP 3665.62 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 3805.13 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
A100 OpenMP 8541.44 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 7867.13 92 %times92percent92\text{\,}\mathrm{\char 37\relax}start_ARG 92 end_ARG start_ARG times end_ARG start_ARG % end_ARG
    Xeon Platinum 8630Y    OpenMP 612.49 64 %times64percent64\text{\,}\mathrm{\char 37\relax}start_ARG 64 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 960.67 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Xeon Gold 6354 OpenMP 425.41 72 %times72percent72\text{\,}\mathrm{\char 37\relax}start_ARG 72 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 593.71 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Table 6: Performance portability measurements
Simulation p𝑝pitalic_p Platform set H𝐻Hitalic_H Code variant a𝑎aitalic_a \textipa\textqplig(a,p,H)𝑎𝑝𝐻(a,p,H)( italic_a , italic_p , italic_H )
   BS-SOLCTRA {V100, A100, MI210, MI300A, H100} OpenMP 78 %times78percent78\text{\,}\mathrm{\char 37\relax}start_ARG 78 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG
{Xeon Platinum 8630Y}    OpenMP 37 %times37percent37\text{\,}\mathrm{\char 37\relax}start_ARG 37 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
    {V100, A100, MI210, MI300A, H100, Xeon Platinum 8630Y} OpenMP 66 %times66percent66\text{\,}\mathrm{\char 37\relax}start_ARG 66 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG
PIC {Xeon Platinum 8630Y, Xeon Gold 6354} OpenMP 67 %times67percent67\text{\,}\mathrm{\char 37\relax}start_ARG 67 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG
{V100, A100} OpenMP 98 %times98percent98\text{\,}\mathrm{\char 37\relax}start_ARG 98 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG
{V100, A100, Xeon Platinum 8630Y, Xeon Gold 6354} OpenMP 80 %times80percent80\text{\,}\mathrm{\char 37\relax}start_ARG 80 end_ARG start_ARG times end_ARG start_ARG % end_ARG
Kokkos 98 %times98percent98\text{\,}\mathrm{\char 37\relax}start_ARG 98 end_ARG start_ARG times end_ARG start_ARG % end_ARG

Our experiments revealed contrasting performance trends for the V100 and A100 GPUs. Regardless of the load balancing strategy, Kokkos outperformed OpenMP on the V100. However, the opposite was true for the A100, where OpenMP edged out Kokkos. The performance gap between Kokkos and OpenMP was narrower on both GPUs compared to the CPU versions (where Kokkos excelled). This suggests that Kokkos exhibits better performance portability across architectures. Overall, the A100 produced particle movement rates over 2 times higher than the V100, which is consistent with their Tflop/s difference.

The application efficiencies obtained for these runs were 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG for Kokkos and 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG for OpenMP with the V100, and 92 %times92percent92\text{\,}\mathrm{\char 37\relax}start_ARG 92 end_ARG start_ARG times end_ARG start_ARG % end_ARG for Kokkos and 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG for OpenMP with the A100. The performance portability metric for the set of GPUs resulted in 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG for Kokkos and 98 %times98percent98\text{\,}\mathrm{\char 37\relax}start_ARG 98 end_ARG start_ARG times end_ARG start_ARG % end_ARG for OpenMP. Integrating these GPU results with the set of CPUs used earlier, the performance portability metrics obtained were 98 %times98percent98\text{\,}\mathrm{\char 37\relax}start_ARG 98 end_ARG start_ARG times end_ARG start_ARG % end_ARG for our Kokkos code and 80 %times80percent80\text{\,}\mathrm{\char 37\relax}start_ARG 80 end_ARG start_ARG times end_ARG start_ARG % end_ARG for our OpenMP code.

4.4 Discussion

Our experiments demonstrate that Kokkos’ ability to adapt memory layouts across architectures significantly enhances performance portability. Kokkos achieved performance portability between 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG and 100 %times100percent100\text{\,}\mathrm{\char 37\relax}start_ARG 100 end_ARG start_ARG times end_ARG start_ARG % end_ARG, with an average of 98 %times98percent98\text{\,}\mathrm{\char 37\relax}start_ARG 98 end_ARG start_ARG times end_ARG start_ARG % end_ARG. This significantly outperforms OpenMP, whose portability ranged from 37 %times37percent37\text{\,}\mathrm{\char 37\relax}start_ARG 37 end_ARG start_ARG times end_ARG start_ARG % end_ARG and 98 %times98percent98\text{\,}\mathrm{\char 37\relax}start_ARG 98 end_ARG start_ARG times end_ARG start_ARG % end_ARG, with an average of 71 %times71percent71\text{\,}\mathrm{\char 37\relax}start_ARG 71 end_ARG start_ARG times end_ARG start_ARG % end_ARG. While Kokkos achieved good performance portability in BS-SOLCTRA (never falling below 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG and outperforming OpenMP’s maximum of 78 %times78percent78\text{\,}\mathrm{\char 37\relax}start_ARG 78 end_ARG start_ARG times end_ARG start_ARG % end_ARG), it did introduce a slowdown of up to 2.5 times in some configurations compared to the original, optimized code (which required significant architecture-specific refactoring). This slowdown might be acceptable considering the larger slowdowns observed with OpenMP (up to 6.7x with one-dimensional arrays on CPU). We believe further optimizations, particularly CPU-focused SIMD vectorization, could improve the performance of the Kokkos variant. Similarly, in PIC, Kokkos achieved performance portability measurements that never fell below 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG and surpassed OpenMP’s in two out of three platform sets. While Kokkos may experience some slowdowns compared to optimized code, its portability benefits outweigh this drawback in many cases.

In Section 2, we discussed the established notion that the best programming model-architecture combination is application-specific. This is evident in our findings, where we observed better efficiencies with OpenMP on some architectures and with Kokkos on others across the two simulations we studied.

The initial BS-SOLCTRA GPU version relied on one-dimensional arrays for all architectures, which limited its portability. This limitation is effectively addressed by Kokkos’ polymorphic data structures, which automatically and transparently adapt the data layout to match the optimal memory access patterns for each architecture. That turned out to be a major portability advantage of the Kokkos approach over OpenMP, as it allows us maintaining a single source BS-SOLCTRA which performs well on both, CPU and GPU platforms, with an acceptable performance penalty. In a different scenario which does not require different data layouts, OpenMP is a very competitive alternative.

5 Final Remarks

In this paper, we presented performance portability measurements for a field-line tracer and a particle-in-cell code, both relevant for plasma physics simulations. We used OpenMP and Kokkos variants of those two codes on CPUs from Intel, and GPUs from NVIDIA and AMD. Our Kokkos implementations achieved overall better performance portability (in terms of the formal definition in Section 2) than OpenMP, never falling below 96 %times96percent96\text{\,}\mathrm{\char 37\relax}start_ARG 96 end_ARG start_ARG times end_ARG start_ARG % end_ARG, as opposed to OpenMP, which obtained a performance portability of 37 %times37percent37\text{\,}\mathrm{\char 37\relax}start_ARG 37 end_ARG start_ARG times end_ARG start_ARG % end_ARG in its worst case. Neither the Kokkos nor the OpenMP variants suffered from significant performance drops when transitioning between processing units (GPUs or CPUs) of the same kind, e.g. from a MI210 GPU to A100 GPU. Nevertheless, Kokkos did seem to make better use of the resources of the newer MI210, MI300A and H100 GPUs, as it achieved similar performance compared to the baseline OpenMP variant of the code with the MI210 and even outperformed the OpenMP variant on the MI300A and H100 GPUs. This contrasts with the results obtained with the older V100 and A100 GPUs on which OpenMP outperformed Kokkos.

Overall, our study adds to the experience mounting in the HPC community that Kokkos and OpenMP (among others) are mature and powerful solutions for achieving performance portability of non-trivial scientific application codes (written in C++) across a broadening range of relevant HPC platforms, and - notably - they are also standing the test of time over multiple hardware generations. However, such studies like this by nature cannot fully address the important question “which one to choose?”, developers of large-scale scientific application codes are faced with. While the application-specific requirements on the portability framework concerning the expressiveness, and flexibility to implement the relevant parallel algorithms, ease-of use, and maturity can be usually assessed a priori, the strategic question about continuing support for future HPC platforms, and thus sustainability, remains somewhat open. Based on our experiences we are inclined to slightly favour open frameworks like Kokkos, as they can ultimately rely on the platform-native programming model for their backends, which in turn can be expected to receive earliest and highest level of support by the vendors. In addition, for large code bases, the actual “application efficiency” cannot be judged a posteriori, as there is no option to implement multiple variants (including platform-native programming models) and to compare relative performances as done here and elsewhere. Instead, comprehensive performance modelling is required to relate the achieved performance with the capabilities of the hardware platform which we consider an important focus for future performance-portability studies.

Acknowledgments. This research was partially funded by the Max Planck Society (MPG) - Costa Rica National Council of Rectors (CONARE) joint research projects framework, and was supported by machine allocations at the Costa Rica National High Technology Center (Kabré), and at the Max Planck Computing and Data Facility (Raven). We are grateful to AMD and Nvidia for providing access and consulting for their benchmark systems.

References

  • [1] Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631, https://inria.hal.science/inria-00550877
  • [2] Beckingsale, D.A., Burmark, J., Hornung, R., Jones, H., Killian, W., Kunen, A.J., Pearce, O., Robinson, P., Ryujin, B.S., Scogland, T.R.: Raja: Portable performance for large-scale scientific applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 71–81 (2019). https://doi.org/10.1109/P3HPC49587.2019.00012
  • [3] Deakin, T., McIntosh-Smith, S., Price, J., Poenaru, A., Atkinson, P., Popa, C., Salmon, J.: Performance portability across diverse computer architectures. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 1–13 (2019). https://doi.org/10.1109/P3HPC49587.2019.00006
  • [4] Deakin, T., McIntosh-Smith, S., Price, J., Poenaru, A., Atkinson, P., Popa, C., Salmon, J.: Performance portability across diverse computer architectures. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 1–13 (2019)
  • [5] Deakin, T., Poenaru, A., Lin, T., McIntosh-Smith, S.: Tracking performance portability on the yellow brick road to exascale. In: 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE (2020). https://doi.org/10.1109/p3hpc51967.2020.00006
  • [6] Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12), 3202 – 3216 (2014), domain-Specific Languages and High-Level Frameworks for High-Performance Computing
  • [7] Fleisch, D.: A Student’s Guide to Maxwell’s Equations. Student’s Guides, Cambridge University Press (2008)
  • [8] Freidberg, J.P.: Plasma Physics and Fusion Energy. Cambridge University Press, Cambridge, England (2007)
  • [9] Gayatri, R., Yang, C., Kurth, T., Deslippe, J.: A Case Study for Performance Portability Using OpenMP 4.5. In: Chandrasekaran, S., Juckeland, G., Wienke, S. (eds.) Accelerator Programming Using Directives. pp. 75–95. Lecture Notes in Computer Science, Springer International Publishing, Cham (2019)
  • [10] Georganas, E., Wijngaart, R.F.V.D., Mattson, T.G.: Design and implementation of a parallel research kernel for assessing dynamic load-balancing capabilities. In: International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2016)
  • [11] Harrell, S.L., Kitson, J., Bird, R., Pennycook, S.J., Sewall, J., Jacobsen, D., Asanza, D.N., Hsu, A., Carrillo, H.C., Kim, H., Robey, R.: Effective performance portability. In: 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 24–36 (2018). https://doi.org/10.1109/P3HPC.2018.00006
  • [12] Jackson, J.D.: Classical electrodynamics. Wiley, New York, NY, 3rd ed. edn. (1999), https://cdsweb.cern.ch/record/490457
  • [13] Jiménez, D., Campos-Duarte, L., Solano-Piedra, R., Araya-Solano, L.A., Meneses, E., Vargas, I.: Bs-solctra: Towards a parallel magnetic plasma confinement simulation framework for modular stellarator devices. In: Crespo-Mariño, J.L., Meneses-Rojas, E. (eds.) High Performance Computing. pp. 33–48. Springer International Publishing, Cham (2020)
  • [14] Jiménez, D., Herrera-Mora, J., Rampp, M., Laure, E., Meneses, E.: Implementing a gpu-portable field line tracing application with openmp offload. In: Navaux, P., Barrios H., C.J., Osthoff, C., Guerrero, G. (eds.) High Performance Computing. pp. 31–46. Springer International Publishing, Cham (2022)
  • [15] Kwack, J., Tramm, J., Bertoni, C., Ghadar, Y., Homerding, B., Rangel, E., Knight, C., Parker, S.: Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus. In: 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 45–56 (2021). https://doi.org/10.1109/P3HPC54578.2021.00008
  • [16] Microway, Inc.: Detailed specifications of the ice lake-sp intel xeon processor scalable family cpus (2024), https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/, accessed: 2024-08-23
  • [17] Miloch, W.: Plasma physics and numerical simulations (2014)
  • [18] Muralikrishnan, S., Frey, M., Vinciguerra, A., Ligotino, M., Cerfon, A.J., Stoyanov, M., Gayatri, R., Adelmann, A.: Scaling and performance portability of the particle-in-cell scheme for plasma physics applications through mini-apps targeting exascale architectures (2022)
  • [19] Muralikrishnan, S., Frey, M., Vinciguerra, A., Ligotino, M., Cerfon, A.J., Stoyanov, M., Gayatri, R., Adelmann, A.: Scaling and performance portability of the particle-in-cell scheme for plasma physics applications through mini-apps targeting exascale architectures (2022)
  • [20] Pennycook, S.J., Sewall, J.D., Lee, V.W.: A metric for performance portability (2016)
  • [21] Picot, W.: Magnetic fusion confinement with tokamaks and stellarators. Organismo Internacional de Energía Atómica Boletin 62(2),  6–7 (2021)
  • [22] Reyes, R., Lomüller, V.: Sycl: Single-source c++ accelerator programming. In: International Conference on Parallel Computing (2015), https://api.semanticscholar.org/CorpusID:6979118
  • [23] National Technology & Engineering Solutions of Sandia, L.N.: Kokkos documentation. https://kokkos.github.io/kokkos-core-wiki/API/core (2014)
  • [24] Sedova, A., Eblen, J.D., Budiardja, R., Tharrington, A., Smith, J.C.: High-performance molecular dynamics simulation for biological and materials sciences: Challenges of performance portability. In: 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 1–13 (2018). https://doi.org/10.1109/P3HPC.2018.00004
  • [25] Trott, C.R., Lebrun-Grandié, D., Arndt, D., Ciesko, J., Dang, V., Ellingwood, N., Gayatri, R., Harvey, E., Hollman, D.S., Ibanez, D., Liber, N., Madsen, J., Miles, J., Poliakoff, D., Powell, A., Rajamanickam, S., Simberg, M., Sunderland, D., Turcksin, B., Wilke, J.: Kokkos 3: Programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems 33(4), 805–817 (2022)
  • [26] der Wijngaart, R.F.V., Mattson, T.G.: The parallel research kernels. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC). IEEE (2014)
  • [27] Wright, S.A., Ridgers, C.P., Mudalige, G.R., Lantra, Z., Williams, J., Sunderland, A., Thorne, H.S., Arter, W.: Developing performance portable plasma edge simulations: A survey. Computer Physics Communications 298, 109123 (2024). https://doi.org/https://doi.org/10.1016/j.cpc.2024.109123
  • [28] Zenker, E., Worpitz, B., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W., Bussmann, M.: Alpaka - an abstraction library for parallel kernel acceleration (05 2016). https://doi.org/10.1109/IPDPSW.2016.50