¹¹institutetext: National Advanced Computing Laboratory, National High Technology Center, San José, Costa Rica
¹¹email: {jruzicka, casch, emeneses}@cenat.ac.cr
²²institutetext: School of Computing, Costa Rica Institute of Technology, Costa Rica
³³institutetext: Max Planck Computing and Data Facility, Max Planck Society, Garching, Germany
³³email: {markus.rampp, erwin.laure}@mpcdf.mpg.de

A Study of Performance Portability in Plasma Physics Simulations

Josef Ruzicka 1122 0009-0003-4423-2612 Christian Asch 11 0000-0002-3111-4858 Esteban Meneses 1122 0000-0002-4307-6000 Markus Rampp 33 0000-0001-8177-8698 Erwin Laure 33 0000-0002-9901-9857

Abstract

The high-performance computing (HPC) community has recently seen a substantial diversification of hardware platforms and their associated programming models. From traditional multicore processors to highly specialized accelerators, vendors and tool developers back up the relentless progress of those architectures. In the context of scientific programming, it is fundamental to consider performance portability frameworks, i.e., software tools that allow programmers to write code once and run it on different computer architectures without sacrificing performance. We report here on the benefits and challenges of performance portability using a field-line tracing simulation and a particle-in-cell code, two relevant applications in computational plasma physics with applications to magnetically-confined nuclear-fusion energy research. For these applications we report performance results obtained on four HPC platforms with server-class CPUs from Intel (Xeon) and AMD (EPYC), and high-end GPUs from Nvidia and AMD, including the latest Nvidia H100 GPU and the novel AMD Instinct MI300A APU. Our results show that both Kokkos and OpenMP are powerful tools to achieve performance portability and decent “out-of-the-box” performance, even for the very latest hardware platforms. For our applications, Kokkos provided performance portability to the broadest range of hardware architectures from different vendors.

Keywords:

Parallel Programming Performance Portability Plasma Physics.

1 Introduction

As the energetic and environmental needs of humanity call for a turning away from fossil fuels, clean and renewable energy sources with little effect on global warming, such as nuclear fusion, receive increasing attention in research and society [8]. Fusion energy research, in particular, is well known to require vast amounts of computational power for modeling the complex plasma-physics phenomena spanning large ranges of spatial and temporal scales. Due to the long-term efforts invested in developing the simulation codes, it is highly desirable to write them in a performance-portable way, i.e., allowing the reuse of the same source code on different high-performance computing (HPC) systems and without sacrificing application performance.

Commonly, the “MPI+X” programming paradigm is adopted, which employs the Message Passing Standard (MPI) for handling parallelism across nodes, complemented by some other model (X) for intra-node parallelism (including accelerators), which is typically relying on some sort of shared memory. One of the most relevant models in this respect is OpenMP, a directive-based approach to parallelism that also supports offloading computation to accelerators while maintaining compatibility for shared memory parallelism in the CPU [9], enabling portability across computing architectures. It is a mature standard that requires a compiler capable of interpreting the directives and also provides additional functionality through a library API, allowing deeper optimizations. Another framework that focuses on performance portability is Kokkos [25], which is implemented as a library and enables portability of C++ codes between CPUs and GPUs. It allows to use the same data structures for different devices (e.g. CPU or GPU) by providing a polymorphic data layout that adjusts to the device that is being used, thereby, for example, addressing the often-encountered design decision of array of structures (AoS) versus structure of arrays (SoA), none of which leading to a memory-access pattern that is well-suited for different kinds of device at the same time. These polymorphic data structures are called Views and are described as “a potentially reference counted multidimensional array with compile-time layouts and memory space”. In other words, they account for blocked or coalesced data access patterns based on the specified memory and execution spaces defined for multicore or GPU systems, respectively [23, 6].

In this paper, we evaluate the Kokkos framework using two different plasma-physics simulation codes, a particle-in-cell (PIC) code and a field-line tracer code. We compare the computational performance of the Kokkos code variants with the corresponding baseline implementations on four HPC systems using two different CPU models and five different GPU models.

Previously, our team developed BS-SOLCTRA [13], an OpenMP code for multicore systems for tracing field-lines that is used for simulating different stellarator coil configurations. The data structures for representing the particles were based on the AoS approach. This work was later continued by porting the code to GPUs, still using OpenMP. However, for performance reasons, this required transforming the data layout to one-dimensional arrays, thereby compromising portability (in the sense of a single source code) between architectures, as the new one-dimensional array variant for the GPU showed the expected underperformance on the CPU. This experience specifically motivated us to consider Kokkos with its polymorphic data structures and its promise of maintaining a single-source code by transparently accounting for the most appropriate memory-access pattern on both CPUs and GPUs [14]. Our assessment is complemented and broadened by considering, in addition, a simulation code based on the particle-in-cell (PIC) method, which is one of the most commonly used methods in computational plasma physics.

The contributions of this paper are two-fold. First, we develop performance-portable variants of our field-line tracer code, BS-SOLCTRA, and a representative PIC code. Second, we offer a comparative evaluation of Kokkos and OpenMP for the different variants of the code.

The remainder of this paper is structured as follows: Section 2 describes the numerical methods used in our plasma physics simulations, provides the definitions used for performance portability, and lists some related work on performance portability and plasma physics simulations. In Section 3, we describe our baseline implementations of the two codes and the way we adapt them to Kokkos. Section 4 provides details on the setup, and the results of our performance-evaluation. Finally, Section 5 summarizes insights and key findings of our work.

2 Background

2.1 Basics of the Numerical Models

Refer to caption — (a) Field-line tracer

Fusion energy is generated in form of the released binding energy when light atomic nuclei (e.g. Hydrogen) fuse to form a heavier nucleus (e.g. Helium). This process is controlled in thermonuclear fusion devices such as stellarators, which consist of a helically symmetric toroidal tube encircled by complex solenoidal coils that produce a strong magnetic field which confines the plasma within the vessel. The design, assembly, and actual experimental testing of a stellarator is technologically complex and expensive to carry out. Therefore, different scenarios and configurations are tested and optimized in simulations before they are actually manufactured [17, 21].

Field-line tracer. Magnetic fields can be represented and traced as a series of field-lines that exert a Lorentz force on charged particles in a plasma (visualized in Figure 1(a)), to analyze and gain insights into how the magnetic field influences their trajectories. This is usually done by approximating the differential equations that describe the behavior of the magnetic field in the simulated space. This simulation is particularly useful for studying spaces like those used in plasma confinement, such as stellarators and tokamaks.

Particle-in-cell. In this numerical method (cf. Figure 1(b)), plasma is represented as a collection of charged “macroparticles” of different kinds (typically ions and electrons) within a grid structure. The grid is used to numerically solve Maxwell’s equations [12], given the electric charge and current densities obtained by interpolation from the positions of the particles. The latter, in turn, are evolved in time by the acceleration computed from the (interpolated) electromagnetic forces, as obtained from the solution of the Maxwell equations [12, 7, 10, 17].

2.2 Performance Portability

The performance of an application can be measured by different metrics, for example, by the time-to-solution, the number of floating point operations per second (FLOP/s), or energy consumption, among others. In this paper, we adopt the definitions for performance, portability, and performance portability provided by Pennycook et al. [20]. Performance is “any measurable property of an application’s correct execution of a problem on a platform.” Portability is “the ability of an application to execute a problem correctly on a given set of platforms”. Finally, performance portability is “a measurement of an application’s performance efficiency for a given problem that can be executed correctly on all platforms in a given set”. For our different experiment configurations (platform, simulation, and programming model combination), we shall measure efficiency as their application efficiency, which is defined as the “achieved performance as a fraction of their best observed performance”, for example, the best achieved time-to-solution on a certain platform and simulation combination. Additionally, we use their proposed metric of performance portability \textipa\textqplig, which is given by the harmonic mean of $i$ experiment configurations’ performance efficiency $e$ observed across a set of platforms $H$ (in our case, HPC systems) when running our applications $a$ (for this study, a programming model) that solve a problem $p$ (one of the simulations methods). Note that if a certain experiment configuration is not available for a certain platform in $H$ , performance portability is nil. This is given by Equation 1.

\text{\textipa{\textqplig}}(a,p,H)=\begin{cases}\frac{|H|}{\sum_{i\in H}\frac{% 1}{e_{i}(a,p)}}&\text{if }i\text{ is supported }\forall\ i\in H\\ 0&\text{otherwise}\end{cases}

(1)

Different libraries or programming models have been developed to program performance portable code. We chose Kokkos to adapt our simulations because of the advantages it provides as a popular, well-documented, well-maintained, high-level parallel programming library meant for shared-memory systems.

2.3 Related Work

A number of previous studies in performance portability have employed the metric we defined in Equation 1 as a standard [3, 15, 5, 11, 24, 4, 18, 27]. Deakin et al. [3] obtained their best performance portability results using OpenMP and Kokkos in one of the largest studies of performance portability across different applications, architectures and programming models, on which they note that some overhead should be expected given that Kokkos’ CPU abstraction is implemented using OpenMP. Deakin et al. [5] also report their best performance portability results when using Kokkos and OpenMP over various applications and architectures. Kwack et al. [4] employed a variety of performance portability metrics to measure various programming models on different architectures across different applications. Their findings highlight that each model has distinct strengths and the optimal choice of model varies depending on the specific application. Muralikrishnan et al. [19] investigated performance portability and scalability in a series of mini-apps related to the PIC scheme, on which Kokkos structures were used with OpenMP and MPI.

Similar observations are noted by Wright et al. [27] in their performance portability survey of plasma physics simulations. The findings from these studies collectively indicate that both our chosen performance portability metric and programming model (out of many more options, e.g. RAJA [2], SYCL [22], Alpaka [28], StarPU [1], among others) are reasonable and timely choices.

3 Code Implementation

3.1 Baseline Implementation

Field-line Tracer. The Biot-Savart Solver for Computing and Tracing Magnetic Field-Lines (BS-SOLCTRA) [13] has been used to study plasma confinement in fusion devices. The simulation applies the field-line tracing technique in a 3D vacuum magnetic field and uses Biot-Savart’s law to test distinct modular electromagnetic coil configurations.

The movement of each line, represented by a computational particle, is obtained by updating its current coordinates in each iteration of the simulation by computing the effect of the magnetic field that the set of modular coils generates, ignoring any impact by the interaction between particles. The integration employs a fourth-order Runge-Kutta method. This code was chosen due to its familiarity to the collaborating laboratories and its significance in their research.

The baseline code employed OpenMP to offload computations to the GPU, enabling the calculation of particle positions in parallel through shared memory. This code stores particles as one-dimensional arrays to better fit the GPU architectures, having been previously adapted from a CPU-only implementation that stored particles as an AoS and used SIMD-vectorization to optimize performance. A straightforward MPI implementation let us study the simulation’s behaviour when distributing the particles between different computer nodes and GPUs. The performance of BS-SOLCTRA is measured in terms of time-to-solution, with lower values indicating higher application performance.

Particle-in-cell. The Parallel Research Kernels (PRK) [26] is a suite of computational kernels related to linear algebra, stencil computations, graph algorithms, and particle simulations, among others. PRKs are designed to be tested in an exploratory manner with different HPC programming models, such as Kokkos, and OpenMP. One kernel of particular interest to this study is a PIC code that uses Coulomb’s law to calculate the forces exerted on each particle by the electromagnetic fields generated by the grid charges of its current cell. To induce load imbalance and control the evolution of the particles’ distribution, they can be initialized using four different patterns.

The baseline code uses OpenMP on CPU, and we used it here to incorporate Kokkos. The computational performance of PIC is conventionally measured as the number of particles moved per second, a metric which we adopt in the following, with larger values indicating higher application performance.

Note that the adaptations of the two codes employed in this study were developed in opposite directions, i.e., BS-SOLCTRA was ported from GPU to CPU, and the other way around with PIC. ¹¹1The code variants used can be found in this version control repository: https://gitlab.com/CNCA_CeNAT/MPCDF/kokkos-bs-solctra.

3.2 Kokkos Adaptation

Integration of Kokkos into the base codes required converting the data structures used to store the information about the particles, coils, among others, from C-style arrays of elements to Kokkos views. The conversion involved updating the source and header files to accommodate the new data structures, as certain functions in the modularized source code files relied on passing these data structures as parameters. Listing 1 exemplifies how data is handled in the different variants, and Listing 2 shows how the computation kernels were parallelized.

Listing 1: Data Management.

⬇

// OpenMP

// Memory allocation

particles = static_cast<double*>(aligned_alloc(alignment_size, sizeof(double)*particles_size));

// Initialize data normally...

// Device data transfer

#pragma omp target enter data map(to:particles[0:particles_size])

// --------

// Kokkos

// Memory space setting and memory allocation

#ifdef KOKKOS_ENABLE_CUDA

#define MemSpace Kokkos::CudaSpace

#endif

#ifdef KOKKOS_ENABLE_HIP

#define MemSpace Kokkos::HIPSpace

#endif

#ifndef MemSpace

#define MemSpace Kokkos::HostSpace

#endif

typedef Kokkos::View<double*, MemSpace> ViewVectorType;

ViewVectorType particles("particles", particles_size);

ViewVectorType::HostMirror h_particles = Kokkos::create_mirror_view(particles);

// Initialize data normally...

// Device data transfer

Kokkos::deep_copy(particles, h_particles);

Listing 2: Kernel Parallelization.

⬇

// OpenMP

#pragma omp parallel num_threads(2)

{

#pragma omp single

{

for (int i = 1; i <= steps; i++){

#pragma omp target teams distribute parallel for

for(int p=0; p < particle_count ; p++){

// simulation computation...

}}}}

// --------

// Kokkos

// Execution space and range policy setting

using ExecSpace = MemSpace::execution_space;

using Kokkos::RangePolicy<ExecSpace> range_policy;

Kokkos::parallel_for("run_particles_parallel_reduce", range_policy(0,particle_count), KOKKOS_LAMBDA (int p)

{

// simulation computation...

});

In the PIC OpenMP code, the particles are defined as structs and then dynamically allocated in an array using pointer arithmetic, while the grid structure is created using traditional malloc calls. To integrate Kokkos, these data structures were adapted by keeping the struct definition and storing it within a Kokkos view, and replacing the malloc with a Kokkos view allocation. Finally, keywords like KOKKOS_ARCH = "Ampere80" are used so that Kokkos can hint the compiler to optimize the code for a specific target architecture.

4 Experiments

4.1 Setup

The simulations for this study were conducted on four different HPC systems: Kabré at Costa Rica’s National High Technology Center (CeNAT), Raven at the Max Planck Computing and Data Facility (MPCDF), and benchmark systems at AMD and Nvidia, respectively. The specifications of the machines and the corresponding software configuration can be found in Table 1 and Table 2, respectively. Note that our GPU codes do not make use of the novel “unified” memory architecture connecting the GPU and the CPU parts of the latest Nvidia GH200 and AMD MI300A chips, i.e., for our benchmarks, only the GPU parts of these platforms are relevant.

Table 1: HPC Systems: hardware configuration

Machine

Partition

Nodes

CPU

Cores

per node

SMT

GPU

RAM (GB)

per node

Theoretical

FP64 peak TFlop/s

per node

Kabré

CPU

Intel Xeon

Gold 6354

yes

512

1.9*

GPU

Intel Xeon

Silver 4214R

NVIDIA V100

Raven

CPU

1592

Intel Xeon

Platinum 8360Y

yes

256

GPU

192

Intel Xeon

Platinum 8360Y

yes

NVIDIA

A100-SXM4-40GB

512

9.7

AMD Accelerator Cloud

GPU

2x AMD

EPYC 7F52

2x 16

AMD MI210

512

22.6

GPU

AMD Zen 4

yes

AMD MI300A

(A0 stepping)

128

81.7

NVIDIA GH200

GPU

NVIDIA Grace

NVIDIA H100

480

* Reported by Microway [16]

Table 2: HPC Systems: software configuration

Machine	OS	Compiler	Kokkos	MPI
Kabré	CentOS 7	GCC 9.2 + CUDA 11.6.2	4.3.0-RC	-
Raven	SUSE Linux Enterprise Server 15-SP3	NVIDIA HPC-SDK 22	4.3.0-RC	Open MPI 4
AMD Accelerator Cloud	Ubuntu 22.04.2 LTS	ROCm 6.1.0	4.3.0-RC	-
NVIDIA GH200	Rocky Linux 9.3	NVIDIA HPC-SDK 24.3	4.3.0-RC	-

Computational performance for BS-SOLCTRA was measured for three different problem sizes over 1,000 iterations of the simulation, using the same setup as in our previous study [14]. For PIC, four particle initialization patterns were used to induce different load imbalance scenarios. The problem sizes and the initialization patterns are stated in Table 3. The performance metric is time-to-solution in seconds (lower is better) for BS-SOLCTRA, and the rate of particles moved per second (higher is better) for PIC. For each measurement we report the arithmetic mean of 10 executions, with no notable run-to-run variations.

Table 3: Simulation setup

Simulation

Architecture

Processing unit

Code variant

Problem size

Initialization pattern

BS-SOLCTRA

CPU

Xeon Platinum 8360Y

Kokkos, OpenMP

102400, 512000, 1024000 particles

Random

GPU

V100, A100, MI210, MI300A, H100

GPU + MPI

A100*

PIC

CPU

Xeon Gold 6354,

Xeon Platinum 8360Y

1000000 particles

Geometric, Linear, Patch, Sinusoidal

GPU

V100, A100

* For our multi-node scaling experiments, 2, 4, 8 & 16 A100 GPUs were used

4.2 Field-line Tracer

GPU results. The Kokkos adaptation of BS-SOLCTRA code resulted in a performance drop for the older V100 and A100 GPUs compared to the original OpenMP-based code variant. On the MI210 GPU, Kokkos performed as well as OpenMP. However, for the very latest generation of H100 GPU and MI300A APU, OpenMP shows an apparent underperformance when compared with the Kokkos results (cf. Figure 2(b)). This may hint at OpenMP compilers (contrary to the platform-native CUDA and ROCM backends employed by Kokkos) being not yet fully tuned for this very latest generation of high-end GPUs. For the MI300A, in addition, all our measured performance numbers have to be interpreted with some caution, as we had only access to a pre-production sample “A0 stepping”) which does not yet meet all the final specifications. At the same time it is remarkable that Kokkos delivers very decent out-of-the-box performance for hardware that was introduced only very recently. Not too unsurprisingly, and consistent with findings reported in the literature cited in Section 2, there is no single combination of programming model and computer architecture that is always optimal for all applications.

When comparing BS-SOLCTRA’s code variants across GPU architectures, the H100 and MI300A GPUs demonstrated a clear performance hierarchy, outperforming the previous-generation A100 and MI210, respectively. In turn, the A100 and MI210 outperformed the older V100 GPU. These results are broadly consistent across code variants and all GPU architectures (except for the case on which the A100 performed better than the MI210 with OpenMP, while they achieved similar performance with Kokkos), and reflect the compute-bound nature of our application in relation to the peak Tflop/s ratings of the different GPUs (cf. Table 1).

The different setups resulted in a near-linear relation of problem size with time-to-solution on the various GPUs tested. The smaller problem sizes likely failed to fully saturate the more powerful GPUs, which explains the comparably better performance of the V100 for the smallest problem size.

We used the metric in Equation 1 to measure the performance portability of our simulation. For each application (code variant), we selected the best performance achieved on each architecture for the largest problem size. We then identified the overall best performance across both applications to calculate application efficiencies shown in Table 4. Considering all the GPUs used, the measured performance portability (Table 6) was $96\text{\,}\mathrm{\char 37\relax}$ for our Kokkos implementation, and $78\text{\,}\mathrm{\char 37\relax}$ for OpenMP.

Table 4: BS-SOLCTRA application efficiency for the largest problem size (1024000 particles)

Platform	Code variant	Time-to-solution	Application efficiency
V100	OpenMP	270.39	$100\text{\,}\mathrm{\char 37\relax}$
V100	Kokkos	300.08	$90\text{\,}\mathrm{\char 37\relax}$
A100	OpenMP	216.05	$100\text{\,}\mathrm{\char 37\relax}$
A100	Kokkos	242.91	$89\text{\,}\mathrm{\char 37\relax}$
MI210	OpenMP	241.53	$99\text{\,}\mathrm{\char 37\relax}$
MI210	Kokkos	241.35	$100\text{\,}\mathrm{\char 37\relax}$
MI300A	OpenMP	166.52	$61\text{\,}\mathrm{\char 37\relax}$
MI300A	Kokkos	102.09	$100\text{\,}\mathrm{\char 37\relax}$
H100	OpenMP	172.87	$57\text{\,}\mathrm{\char 37\relax}$
H100	Kokkos	98.95	$100\text{\,}\mathrm{\char 37\relax}$
Xeon Platinum 8630Y	OpenMP	7763.10	$37\text{\,}\mathrm{\char 37\relax}$
Xeon Platinum 8630Y	Kokkos	2889.40	$100\text{\,}\mathrm{\char 37\relax}$

CPU results. Using an Intel Xeon Platinum 8360Y CPU node, we compare our Kokkos adaptation of BS-SOLCTRA with the one-dimensional array and AoS OpenMP variants.²²2In our prior work [14], porting the OpenMP CPU code variant to the GPU favored one-dimensional arrays for performance, while AoS remained preferable for the CPU. For this reason, our AoS OpenMP code variant was not ported to the GPU. The AoS code variant relies on (guided) autovectorization by the compiler, i.e. we apply the #pragma omp simd directive, together with the relevant compiler flags (-mavx512f -mavx512pf -mavx512er -mavx512cd) in order to enable AVX-512 instructions for the Intel Xeon CPUs. The impact of using the different code variants can be observed in Figure 2(a).

Our Kokkos implementation achieved a 2.7x speedup compared to the sub-optimal OpenMP one-dimensional array variant, demonstrating portability benefits. However, it fell short of the performance of the AoS OpenMP variant. While Kokkos exhibited a 2.5x slowdown relative to AoS OpenMP, this is significantly less severe than the 6.7x slowdown observed with one-dimensional array OpenMP. When included in the set of platforms with the GPUs previously used, Kokkos achieved an application efficiency of $100\text{\,}\mathrm{\char 37\relax}$ and a performance portability of $96\text{\,}\mathrm{\char 37\relax}$ . By contrast, the one-dimensional array OpenMP variant’s application efficiency of $37\text{\,}\mathrm{\char 37\relax}$ resulted in a performance portability of $66\text{\,}\mathrm{\char 37\relax}$ . This comparison demonstrates Kokkos’s ability to deliver more consistent performance across diverse architectures. ³³3In this measurement, we intentionally used our sub-optimal one-dimensional array OpenMP variant for consistency with the code employed in our GPU runs. The CPU optimized SoA code variant was included to provide a perspective of the impact in loss of performance in the architecture for which the code was originally optimized by programming performance portable code.

Multi-node scalability. For completeness, Figure 3(a) presents strong and weak scaling results on up to 16 A100 GPUs (4 Raven nodes). We note that the code does not involve particle interactions which would introduce communication or synchronization across GPUs and hence ideal parallel scaling should be expected. While this is experimentally confirmed for the weak scaling (Figure 3(b)), the strong scaling speedups from 1 to 16 GPUs are only 12.6 (Kokkos) and 12.1 (OpenMP). This can be explained by a progressive under-utilization of the individual GPUs as the workload per GPU decreases with increasing number of GPUs. Such under-saturation can also explain the performance difference between the Kokkos and OpenMP variants narrowing as more GPUs are used. Overall, measured multi-node performances are consistent with the single-GPU findings discussed earlier, and no unexpected interference of the intra-node performance with inter-node MPI parallelization is found.

4.3 Particle-in-cell

CPU results. Figure 4(a) shows that, across all initialization patterns, our Kokkos adaptation outperformed the OpenMP variant in terms of particle movement rates. On Kabré’s Intel Xeon Gold 6354 CPU, Kokkos achieved a particle movement rate 1.4 times faster than OpenMP. This advantage increased to 1.6 times on Raven’s Intel Xeon Platinum 8360Y CPUs. These results highlight the effectiveness of Kokkos’ automatic memory allocation layout and its data access patterns optimizations, overcoming the need to implement architecture-specific data structures. This performance difference likely stems from the OpenMP variant storing the simulation grid in a one-dimensional array format, and considering the nature of GPU memory accessing patterns. Overall, the Xeon Platinum CPU reached particle movement rates about 1.7 times higher than the Xeon Gold for both code variants.

As the performance difference between configurations remained consistent across load-balancing scenarios, we arbitrarily chose the geometric initialization results to measure the application’s performance portability. Based on the efficiencies in Table 5, Kokkos achieved $100\text{\,}\mathrm{\char 37\relax}$ performance portability, while OpenMP reached only $67\text{\,}\mathrm{\char 37\relax}$ (Table 6).

GPU Results. When running the simulation on a single V100 or A100 node, the particle movement rate, visualized in Figure 4(b), increased by up to 13 times with Kokkos compared to the results previously obtained with the CPU implementation. OpenMP achieved an even higher improvement, reaching up to a 20 times increase in particle movement rate. This is expected for a compute-bound code like PIC, given the significant difference in Tflop/s between CPUs and GPUs.

Table 5: PIC application efficiency for the geometric initialization pattern

Platform

Code variant

Particle

movement rate

Application

efficiency

V100

OpenMP

3665.62

96\text{\,}\mathrm{\char 37\relax}

Kokkos

3805.13

100\text{\,}\mathrm{\char 37\relax}

A100

OpenMP

8541.44

100\text{\,}\mathrm{\char 37\relax}

Kokkos

7867.13

92\text{\,}\mathrm{\char 37\relax}

Xeon Platinum 8630Y

OpenMP

612.49

64\text{\,}\mathrm{\char 37\relax}

Kokkos

960.67

100\text{\,}\mathrm{\char 37\relax}

Xeon Gold 6354

OpenMP

425.41

72\text{\,}\mathrm{\char 37\relax}

Kokkos

593.71

100\text{\,}\mathrm{\char 37\relax}

Table 6: Performance portability measurements

Simulation $p$	Platform set $H$	Code variant $a$	\textipa\textqplig $(a,p,H)$
BS-SOLCTRA	{V100, A100, MI210, MI300A, H100}	OpenMP	$78\text{\,}\mathrm{\char 37\relax}$
	{V100, A100, MI210, MI300A, H100}	Kokkos	$96\text{\,}\mathrm{\char 37\relax}$
	{Xeon Platinum 8630Y}	OpenMP	$37\text{\,}\mathrm{\char 37\relax}$
	{Xeon Platinum 8630Y}	Kokkos	$100\text{\,}\mathrm{\char 37\relax}$
	{V100, A100, MI210, MI300A, H100, Xeon Platinum 8630Y}	OpenMP	$66\text{\,}\mathrm{\char 37\relax}$
	{V100, A100, MI210, MI300A, H100, Xeon Platinum 8630Y}	Kokkos	$96\text{\,}\mathrm{\char 37\relax}$
PIC	{Xeon Platinum 8630Y, Xeon Gold 6354}	OpenMP	$67\text{\,}\mathrm{\char 37\relax}$
	{Xeon Platinum 8630Y, Xeon Gold 6354}	Kokkos	$100\text{\,}\mathrm{\char 37\relax}$
	{V100, A100}	OpenMP	$98\text{\,}\mathrm{\char 37\relax}$
	{V100, A100}	Kokkos	$96\text{\,}\mathrm{\char 37\relax}$
	{V100, A100, Xeon Platinum 8630Y, Xeon Gold 6354}	OpenMP	$80\text{\,}\mathrm{\char 37\relax}$
	{V100, A100, Xeon Platinum 8630Y, Xeon Gold 6354}	Kokkos	$98\text{\,}\mathrm{\char 37\relax}$

Our experiments revealed contrasting performance trends for the V100 and A100 GPUs. Regardless of the load balancing strategy, Kokkos outperformed OpenMP on the V100. However, the opposite was true for the A100, where OpenMP edged out Kokkos. The performance gap between Kokkos and OpenMP was narrower on both GPUs compared to the CPU versions (where Kokkos excelled). This suggests that Kokkos exhibits better performance portability across architectures. Overall, the A100 produced particle movement rates over 2 times higher than the V100, which is consistent with their Tflop/s difference.

The application efficiencies obtained for these runs were $100\text{\,}\mathrm{\char 37\relax}$ for Kokkos and $96\text{\,}\mathrm{\char 37\relax}$ for OpenMP with the V100, and $92\text{\,}\mathrm{\char 37\relax}$ for Kokkos and $100\text{\,}\mathrm{\char 37\relax}$ for OpenMP with the A100. The performance portability metric for the set of GPUs resulted in $96\text{\,}\mathrm{\char 37\relax}$ for Kokkos and $98\text{\,}\mathrm{\char 37\relax}$ for OpenMP. Integrating these GPU results with the set of CPUs used earlier, the performance portability metrics obtained were $98\text{\,}\mathrm{\char 37\relax}$ for our Kokkos code and $80\text{\,}\mathrm{\char 37\relax}$ for our OpenMP code.

4.4 Discussion

Our experiments demonstrate that Kokkos’ ability to adapt memory layouts across architectures significantly enhances performance portability. Kokkos achieved performance portability between $96\text{\,}\mathrm{\char 37\relax}$ and $100\text{\,}\mathrm{\char 37\relax}$ , with an average of $98\text{\,}\mathrm{\char 37\relax}$ . This significantly outperforms OpenMP, whose portability ranged from $37\text{\,}\mathrm{\char 37\relax}$ and $98\text{\,}\mathrm{\char 37\relax}$ , with an average of $71\text{\,}\mathrm{\char 37\relax}$ . While Kokkos achieved good performance portability in BS-SOLCTRA (never falling below $96\text{\,}\mathrm{\char 37\relax}$ and outperforming OpenMP’s maximum of $78\text{\,}\mathrm{\char 37\relax}$ ), it did introduce a slowdown of up to 2.5 times in some configurations compared to the original, optimized code (which required significant architecture-specific refactoring). This slowdown might be acceptable considering the larger slowdowns observed with OpenMP (up to 6.7x with one-dimensional arrays on CPU). We believe further optimizations, particularly CPU-focused SIMD vectorization, could improve the performance of the Kokkos variant. Similarly, in PIC, Kokkos achieved performance portability measurements that never fell below $96\text{\,}\mathrm{\char 37\relax}$ and surpassed OpenMP’s in two out of three platform sets. While Kokkos may experience some slowdowns compared to optimized code, its portability benefits outweigh this drawback in many cases.

In Section 2, we discussed the established notion that the best programming model-architecture combination is application-specific. This is evident in our findings, where we observed better efficiencies with OpenMP on some architectures and with Kokkos on others across the two simulations we studied.

The initial BS-SOLCTRA GPU version relied on one-dimensional arrays for all architectures, which limited its portability. This limitation is effectively addressed by Kokkos’ polymorphic data structures, which automatically and transparently adapt the data layout to match the optimal memory access patterns for each architecture. That turned out to be a major portability advantage of the Kokkos approach over OpenMP, as it allows us maintaining a single source BS-SOLCTRA which performs well on both, CPU and GPU platforms, with an acceptable performance penalty. In a different scenario which does not require different data layouts, OpenMP is a very competitive alternative.

5 Final Remarks

In this paper, we presented performance portability measurements for a field-line tracer and a particle-in-cell code, both relevant for plasma physics simulations. We used OpenMP and Kokkos variants of those two codes on CPUs from Intel, and GPUs from NVIDIA and AMD. Our Kokkos implementations achieved overall better performance portability (in terms of the formal definition in Section 2) than OpenMP, never falling below $96\text{\,}\mathrm{\char 37\relax}$ , as opposed to OpenMP, which obtained a performance portability of $37\text{\,}\mathrm{\char 37\relax}$ in its worst case. Neither the Kokkos nor the OpenMP variants suffered from significant performance drops when transitioning between processing units (GPUs or CPUs) of the same kind, e.g. from a MI210 GPU to A100 GPU. Nevertheless, Kokkos did seem to make better use of the resources of the newer MI210, MI300A and H100 GPUs, as it achieved similar performance compared to the baseline OpenMP variant of the code with the MI210 and even outperformed the OpenMP variant on the MI300A and H100 GPUs. This contrasts with the results obtained with the older V100 and A100 GPUs on which OpenMP outperformed Kokkos.

Overall, our study adds to the experience mounting in the HPC community that Kokkos and OpenMP (among others) are mature and powerful solutions for achieving performance portability of non-trivial scientific application codes (written in C++) across a broadening range of relevant HPC platforms, and - notably - they are also standing the test of time over multiple hardware generations. However, such studies like this by nature cannot fully address the important question “which one to choose?”, developers of large-scale scientific application codes are faced with. While the application-specific requirements on the portability framework concerning the expressiveness, and flexibility to implement the relevant parallel algorithms, ease-of use, and maturity can be usually assessed a priori, the strategic question about continuing support for future HPC platforms, and thus sustainability, remains somewhat open. Based on our experiences we are inclined to slightly favour open frameworks like Kokkos, as they can ultimately rely on the platform-native programming model for their backends, which in turn can be expected to receive earliest and highest level of support by the vendors. In addition, for large code bases, the actual “application efficiency” cannot be judged a posteriori, as there is no option to implement multiple variants (including platform-native programming models) and to compare relative performances as done here and elsewhere. Instead, comprehensive performance modelling is required to relate the achieved performance with the capabilities of the hardware platform which we consider an important focus for future performance-portability studies.

Acknowledgments. This research was partially funded by the Max Planck Society (MPG) - Costa Rica National Council of Rectors (CONARE) joint research projects framework, and was supported by machine allocations at the Costa Rica National High Technology Center (Kabré), and at the Max Planck Computing and Data Facility (Raven). We are grateful to AMD and Nvidia for providing access and consulting for their benchmark systems.

References

[1] Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23(2), 187–198 (2011). https://doi.org/10.1002/cpe.1631, https://inria.hal.science/inria-00550877
[2] Beckingsale, D.A., Burmark, J., Hornung, R., Jones, H., Killian, W., Kunen, A.J., Pearce, O., Robinson, P., Ryujin, B.S., Scogland, T.R.: Raja: Portable performance for large-scale scientific applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 71–81 (2019). https://doi.org/10.1109/P3HPC49587.2019.00012
[3] Deakin, T., McIntosh-Smith, S., Price, J., Poenaru, A., Atkinson, P., Popa, C., Salmon, J.: Performance portability across diverse computer architectures. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 1–13 (2019). https://doi.org/10.1109/P3HPC49587.2019.00006
[4] Deakin, T., McIntosh-Smith, S., Price, J., Poenaru, A., Atkinson, P., Popa, C., Salmon, J.: Performance portability across diverse computer architectures. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 1–13 (2019)
[5] Deakin, T., Poenaru, A., Lin, T., McIntosh-Smith, S.: Tracking performance portability on the yellow brick road to exascale. In: 2020 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE (2020). https://doi.org/10.1109/p3hpc51967.2020.00006
[6] Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74(12), 3202 – 3216 (2014), domain-Specific Languages and High-Level Frameworks for High-Performance Computing
[7] Fleisch, D.: A Student’s Guide to Maxwell’s Equations. Student’s Guides, Cambridge University Press (2008)
[8] Freidberg, J.P.: Plasma Physics and Fusion Energy. Cambridge University Press, Cambridge, England (2007)
[9] Gayatri, R., Yang, C., Kurth, T., Deslippe, J.: A Case Study for Performance Portability Using OpenMP 4.5. In: Chandrasekaran, S., Juckeland, G., Wienke, S. (eds.) Accelerator Programming Using Directives. pp. 75–95. Lecture Notes in Computer Science, Springer International Publishing, Cham (2019)
[10] Georganas, E., Wijngaart, R.F.V.D., Mattson, T.G.: Design and implementation of a parallel research kernel for assessing dynamic load-balancing capabilities. In: International Parallel and Distributed Processing Symposium (IPDPS). IEEE (2016)
[11] Harrell, S.L., Kitson, J., Bird, R., Pennycook, S.J., Sewall, J., Jacobsen, D., Asanza, D.N., Hsu, A., Carrillo, H.C., Kim, H., Robey, R.: Effective performance portability. In: 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 24–36 (2018). https://doi.org/10.1109/P3HPC.2018.00006
[12] Jackson, J.D.: Classical electrodynamics. Wiley, New York, NY, 3rd ed. edn. (1999), https://cdsweb.cern.ch/record/490457
[13] Jiménez, D., Campos-Duarte, L., Solano-Piedra, R., Araya-Solano, L.A., Meneses, E., Vargas, I.: Bs-solctra: Towards a parallel magnetic plasma confinement simulation framework for modular stellarator devices. In: Crespo-Mariño, J.L., Meneses-Rojas, E. (eds.) High Performance Computing. pp. 33–48. Springer International Publishing, Cham (2020)
[14] Jiménez, D., Herrera-Mora, J., Rampp, M., Laure, E., Meneses, E.: Implementing a gpu-portable field line tracing application with openmp offload. In: Navaux, P., Barrios H., C.J., Osthoff, C., Guerrero, G. (eds.) High Performance Computing. pp. 31–46. Springer International Publishing, Cham (2022)
[15] Kwack, J., Tramm, J., Bertoni, C., Ghadar, Y., Homerding, B., Rangel, E., Knight, C., Parker, S.: Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus. In: 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 45–56 (2021). https://doi.org/10.1109/P3HPC54578.2021.00008
[16] Microway, Inc.: Detailed specifications of the ice lake-sp intel xeon processor scalable family cpus (2024), https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-ice-lake-sp-intel-xeon-processor-scalable-family-cpus-2/, accessed: 2024-08-23
[17] Miloch, W.: Plasma physics and numerical simulations (2014)
[18] Muralikrishnan, S., Frey, M., Vinciguerra, A., Ligotino, M., Cerfon, A.J., Stoyanov, M., Gayatri, R., Adelmann, A.: Scaling and performance portability of the particle-in-cell scheme for plasma physics applications through mini-apps targeting exascale architectures (2022)
[19] Muralikrishnan, S., Frey, M., Vinciguerra, A., Ligotino, M., Cerfon, A.J., Stoyanov, M., Gayatri, R., Adelmann, A.: Scaling and performance portability of the particle-in-cell scheme for plasma physics applications through mini-apps targeting exascale architectures (2022)
[20] Pennycook, S.J., Sewall, J.D., Lee, V.W.: A metric for performance portability (2016)
[21] Picot, W.: Magnetic fusion confinement with tokamaks and stellarators. Organismo Internacional de Energía Atómica Boletin 62(2), 6–7 (2021)
[22] Reyes, R., Lomüller, V.: Sycl: Single-source c++ accelerator programming. In: International Conference on Parallel Computing (2015), https://api.semanticscholar.org/CorpusID:6979118
[23] National Technology & Engineering Solutions of Sandia, L.N.: Kokkos documentation. https://kokkos.github.io/kokkos-core-wiki/API/core (2014)
[24] Sedova, A., Eblen, J.D., Budiardja, R., Tharrington, A., Smith, J.C.: High-performance molecular dynamics simulation for biological and materials sciences: Challenges of performance portability. In: 2018 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). pp. 1–13 (2018). https://doi.org/10.1109/P3HPC.2018.00004
[25] Trott, C.R., Lebrun-Grandié, D., Arndt, D., Ciesko, J., Dang, V., Ellingwood, N., Gayatri, R., Harvey, E., Hollman, D.S., Ibanez, D., Liber, N., Madsen, J., Miles, J., Poliakoff, D., Powell, A., Rajamanickam, S., Simberg, M., Sunderland, D., Turcksin, B., Wilke, J.: Kokkos 3: Programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems 33(4), 805–817 (2022)
[26] der Wijngaart, R.F.V., Mattson, T.G.: The parallel research kernels. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC). IEEE (2014)
[27] Wright, S.A., Ridgers, C.P., Mudalige, G.R., Lantra, Z., Williams, J., Sunderland, A., Thorne, H.S., Arter, W.: Developing performance portable plasma edge simulations: A survey. Computer Physics Communications 298, 109123 (2024). https://doi.org/https://doi.org/10.1016/j.cpc.2024.109123
[28] Zenker, E., Worpitz, B., Widera, R., Huebl, A., Juckeland, G., Knüpfer, A., Nagel, W., Bussmann, M.: Alpaka - an abstraction library for parallel kernel acceleration (05 2016). https://doi.org/10.1109/IPDPSW.2016.50