👉 Organisation notes:
- 💡 The short course material (available on this repository) was just updated - fetch the latest versions!
- 👉 Make sure to go through Getting started before the start of the course.
- Further interests in solving PDEs with Julia on GPUs❓
- Sign-up (free) for a hands-on workshop at JuliaCon2021
- Check out this online geo-HPC tutorial
This short course covers trendy areas in modern geocomputing with broad geoscientific applications. The physical processes governing natural systems' evolution are often mathematically described as systems of differential equations. A performant numerical implementation of the solving algorithm leveraging modern hardware is key and permits to tackle problems that were technically not possible a decade ago.
- Objectives
- About this repository
- Getting started (not part of the live course)
- 👉 Short course material
- Extras (not part of the live course)
- Further reading
The goal of this short course is to offer an interactive overview on how to solve systems of (partial) differential equations in parallel on GPUs using the Julia language. Julia combines high-level language simplicity and low-level language performance. The resulting codes and applications are fast, short and readable [1, 2, 3].
We will design and implement a numerical algorithm that predicts ice flow dynamics over mountainous topography (Greenland) using GPU computing (e.g. Fig. below). We will discretise the shallow ice approximation (SIA) equations in our ice flow solver to assess Greenland's ice cap evolution as function of a climate scenario.
The online course consists of 2 parts:
- Part 1 - You will learn about the Julia language and iterative PDE solvers.
- Part 2 - You will implement a GPU parallel PDE solver to predict ice flow dynamics on real topography.
By the end of this short course, you will:
- Have an iterative GPU PDE solver that predicts ice-flow;
- Have a Julia code that achieves similar performance than legacy codes (C, CUDA, MPI);
- Know how the Julia language solves the "two-language problem";
- Be able to leverage the computing power of modern GPU accelerated servers and supercomputers.
💡 Disclaimer
- The solvers presented in this short course, based on the pseudo-transient method, enable to solve PDEs iteratively and are well-suited for parallel execution (on GPUs). It is not the purpose of this course to provide an extensive overview of various solution techniques.
- The performance assessment is done using the time / iteration metric which reflects the ability of the algorithm to efficiently exploit the memory bandwidth of the (parallel) hardware. Further performance considerations regarding the metric can be found here.
The course repository lists following folders and items:
- the data folder contains various low resolution Greenland input data (bedrock topography, surface elevation, ice thickness, masks, ...) downscaled from BedMachine Greenland v3 - note the filenames include grid resolution information
(nx, ny)
; - the docs folder contains documentation linked in the README;
- the various output folder will contain the codes output, mainly figures in png format;
- the scripts folder contains the scripts this course is about 🎉
- the extras folder contains supporting course material (not discussed live during the course);
- the
Project.toml
file is a Julia project file, tracking the used packages and enabling a reproducible environment.
👉 This repository is an interactive and dynamic source of information related to the short course.
- Check out the Discussion tab if you have general comments, ideas to share or for Q&A.
- File an Issue if you encounter any technical problems with the distributed codes.
- Interact in a open-minded, respectful and inclusive manner.
⚠️ Due to the time limitation, the short course will not cover the Getting started steps. These are meant to provide directions to the participant willing to actively try out the examples during the short course. It is warmly recommended to perform the Getting started steps before the beginning of the workshop.
Please follow the detailed steps in: docs/getting-started.md
The provided directions will get you started with:
- Installing Julia v1.6
- Running the scripts from the course repository.
👉 Note: This course was developed and tested on Julia v1.6. It should work with any Julia version ≥v1.6. The install configuration were tested on a MacBook Pro running macOS 10.15.7, a Linux GPU server running Ubuntu 20.04 LTS and a Linux GPU server running CentOS 8.
This section lists the material discussed within this 60 min. short course:
💡 In this course we will implement a 2D nonlinear diffusion equation on GPUs in Julia using the finite-difference method and an iterative solving approach. The goal is to resolve the shallow ice approximation (SIA) and predict ice flow over Greenland.
by Mauro Werder
Julia is a modern, general-purpose programming language unifying interactive, high productivity features (like Python, Matlab, etc.) with high performance (like C, Fortran, etc.). This removes the need to have a separate prototype and production languages (the two-language problem).
The main reason to use Julia for scientific computing is:
- high performance & productivity, ditto
- a good package manager (see, e.g., the
Project.toml
of this course) making reproducible science easier - a rapidly expanding number of packages, with many at the forefront of research (e.g. GPU-packages, differential equations, machine learning, automatic differentiation)
- a friendly community with a large number of scientific users
The main reason to use Julia for GPU computing is that it indeed solves the two-language problem in this domain: it works well from prototyping an idea with a simple serial code to massively parallel, multi-node GPU production code.
A short introduction to Julia will be given using the first numerical example of this course (next section). A very short summary of features covered:
- third-party packages can be installed with the package manager (see Package installation)
- use a package with
using XYZ
- run the code in a file with
include("abc.jl")
- index into an array with
[ ]
and starts at 1 - vectorized function application do with the dot-notation, e.g.
sin.(x) ./ y
for vectorsx
andy
- macros do funky stuff with your code (aka code-transformations) and call them with
@
, e.g.@time sin(1)
prints the time it takes to evaluatesin(1)
- for raster-data handling we use GeoData.jl (but other packages also exist)
For more info see https://docs.julialang.org.
Let's start with a 1D linear diffusion example to implement both an explicit and iterative implicit PDE solver:
dH/dt = ∇.(D ∇H)
The diffusion of a quantity H
over time t
can be described as (1a) a diffusive flux, (1b) a flux balance and (1c) an update rule:
qH = -D*dH/dx (1a)
dHdt = -dqH/dx (1b)
dH/dt = dHdt (1c)
The diffusion_1D_expl.jl
code implements an iterative and explicit solution of eq. (1) for an initial Gaussian profile:
H0 = exp(-(x-lx/2.0)^2)
But now, you may ask: can we use an implicit algorithm to side-step the CFL-condition, control the (physically motivated) time steps dt
and keep it "matrix-free" ?
by Ludovic Räss
The diffusion_1D_impl.jl
code implements an iterative, implicit solution of eq. (1). How ? We include the physical time derivative dH/dt=(H-Hold)/dt
in the previous rate of change dHdt
to define the residual ResH
ResH = -(H-Hold)/dt -dqH/dx
and iterate until the values of ResH
(the residual of the eq. (1)) drop below a defined tolerance level tol
.
It works, but the "naive" Picard iteration count seems to be pretty high (niter>7000
). A efficient way to circumvent this is to add "damping" (damp
) to the rate-of-change dHdt
, analogous to add friction enabling faster convergence [4]
dHdt = ResH + damp*dHdt
The diffusion_1D_damp.jl
code implements a damped iterative implicit solution of eq. (1). The iteration count drops to niter~700
. This pseudo-transient approach enables fast as the iteration count scales close to O(N) and not O(N^2).
Performance evaluation is a complex topic as different metrics would lead to different conclusions. Ultimately, efficient algorithms should minimise the time to solution. For iterative algorithms this means:
- Ensure fast iterations (minimise the time per iteration).
- Keep the iteration count as low as possible and, in particular, that iteration count scales aroung O(n) with the numerical resolution n.
We will here report (1) for various implementations on various computer architectures.
Let's move from the simple 1D linear diffusion example to the shallow ice approximation (SIA) equation, a 2D nonlinear diffusion equation for ice thickness H:
dH/dt = ∇.(D ∇S) + M
with surface elevation S the sum of ice thickness H and bed elevation B
S = H + B,
surface mass-balance M, and non-linear diffusion coefficient
D = a Hⁿ⁺² √(∇S.∇S)ⁿ⁻¹
where n is Glen's n (take =3) and a is the ice flow factor (1.5e-24 Pa⁻ⁿ s⁻¹).
Writing the equation in pseudo-code:
qHx = -D*dS/dx (2a)
qHy = -D*dS/dy (2b)
dHdt = -(dqHx/dx + dqHy/dy) + M (2c)
dH/dt = dHdt (2d)
with the nonlinear diffusion coefficient D
with the power-law exponent npow
(aka Glen's exponent):
D = a*H^(npow+2)*sqrt((dS/dx)^2 + (dS/dy)^2)^(npow-1)
The topography data, bedrock elevation and ice thickness, is from the
BedMachine Greenland v3 dataset (Morlighem et al., (2017)) and is loaded via the GeoData.jl package (see helpers.jl
).
We implement climate forcing using a simple mass balance (accumulation and ablation) M
formulation:
M = min(grad_b*(S - z_ELA), b_max)
as function of the surface elevation S
and capped by the maximal accumulation rate b_max
. The mass balance gradient grad_b
is defined as
grad_b = (1.3517 - 0.014158*LAT)/100.0*0.91
where LAT
is the latitude (taken from [5]). The equilibrium line altitude z_ELA
(where accumulation = ablation) is latitude dependent, ranging from 1300m (South) to 1000m (North) as suggested by [5].
The iceflow.jl
code implements the 2D SIA equations using the iterative implicit damped formulation as in diffusion_1D_damp.jl
using the pseudo-transient approach. The calculation of the SIA PDEs resumes in these 13 lines of Julia code:
# mass balance
M .= min.(grad_b.*(S .- z_ELA), b_max)
# compute diffusivity
dSdx .= diff(S, dims=1)/dx
dSdy .= diff(S, dims=2)/dy
gradS .= sqrt.(av_ya(dSdx).^2 .+ av_xa(dSdy).^2)
D .= a*av(H).^(npow+2) .* gradS.^(npow-1)
# compute flux
qHx .= .-av_ya(D).*diff(S[:,2:end-1], dims=1)/dx
qHy .= .-av_xa(D).*diff(S[2:end-1,:], dims=2)/dy
# update ice thickness
dtau .= dtausc*min.(10.0, cfl./(epsi .+ av(D)))
ResH .= .-(diff(qHx, dims=1)/dx .+ diff(qHy, dims=2)/dy) .+ inn(M)
dHdt .= dHdt.*damp .+ ResH
H[2:end-1,2:end-1] .= max.(0.0, inn(H) .+ dtau.*dHdt)
# apply mask
H[Mask.==0] .= 0.0
# update surface
S .= B .+ H
💡 Note that the here discussed SIA codes do not implement any flux limiter scheme to circumvent known accuracy and stability issues. Check out [6, 7] for further references (the
iceflow_bench.jl
script implements the benchmark [6] that reflects this issue).
The model output is the ice surface elevation, the ice thickness, the ice velocity magnitude and the mass balance:
This implementation of the SIA equations solves the steady-state (i.e. the physical time derivative being removed as dt->∞
). The last part of this course (Greenland's ice cap evolution) will show how to achieve an (implicit) ice flow predictions over a specific time span dt
by including the physical time derivative in the ResH
term.
The figure below depicts the time 1000 iterations (or pseudo-time steps) take running the iceflow.jl
code on a Intel Xeon Gold 6244 (1 thread - plain Julia):
So now we have a cool iterative and implicit SIA solver in less than 100 lines of code 🎉. Good enough for low resolution calculations. What if we need higher resolution and faster time to solution ? Parallel and GPU computing makes it possible. Let's start from the iceflow.jl
code and port it to GPU (with some intermediate steps).
The main idea of GPU parallelisation is to calculate each grid point concurently by a different GPU thread (instaed of the more serial CPU execution) as depicted hereafter:
- Extract the flux
qHx, qHy
from the physics calculations iniceflow.jl
:
# [...] skipped lines
# compute flux
qHx .= .-av_ya(D).*diff(S[:,2:end-1], dims=1)/dx
qHy .= .-av_xa(D).*diff(S[2:end-1,:], dims=2)/dy
# [...] skipped lines
- Let's modify these lines into a
compute_flux
function we will then be able to turn into a GPU kernel.
# [...] skipped lines
function compute_flux!(qHx, qHy, S, D, dx, dy, nx, ny)
Threads.@threads for iy=1:ny
for ix=1:nx
if (ix<=nx-1 && 2<=iy<=ny-1) qHx[ix,iy] = -0.5*(D[ix,iy]+D[ix+1,iy])*(S[ix+1,iy]-S[ix,iy])/dx end
if (2<=ix<=nx-1 && iy<=ny-1) qHy[ix,iy] = -0.5*(D[ix,iy]+D[ix,iy+1])*(S[ix,iy+1]-S[ix,iy])/dy end
end
end
return
end
# [...] skipped lines
compute_flux!(qHx, qHy, S, D, dx, dy, nx, ny)
# [...] skipped lines
💡 Julia enables multi-threading capabilities by simply adding
Threads.@threads
to the outermost loop (here overiy
).
- The last step is to replace the (multi-threaded) loops by a vectorised GPU index
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
specific to GPU execution. Each ix
and iy
are executed concurrently on a different GPU thread:
using CUDA
# [...] skipped lines
compute_flux!(qHx, qHy, S, D, dx, dy, nx, ny)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
if (ix<=nx-1 && 2<=iy<=ny-1) qHx[ix,iy] = -0.5*(D[ix,iy]+D[ix+1,iy])*(S[ix+1,iy]-S[ix,iy])/dx end
if (2<=ix<=nx-1 && iy<=ny-1) qHy[ix,iy] = -0.5*(D[ix,iy]+D[ix,iy+1])*(S[ix,iy+1]-S[ix,iy])/dy end
return
end
# [...] skipped lines
@cuda blocks=cublocks threads=cuthreads compute_flux!(qHx, qHy, S, D, dx, dy, nx, ny)
synchronize()
# [...] skipped lines
💡 We use
@cuda blocks=cublocks threads=cuthreads
to launch the GPU function on the appropriate number of threads, i.e. "parallel workers". The numerical grid resolutionnx
andny
must now be chosen accordingly to the number of parallel workers.
The here described porting procedure steps are exemplified for the 1D diffusion equation (Porting the diffusion equation to GPUs) available in the extra material:
- (1)
diffusion_1D_damp.jl
- (2)
diffusion_1D_damp_fun.jl
- (3)
diffusion_1D_damp_gpu.jl
- (4)
diffusion_1D_damp_xpu.jl
Wouldn't it be great to be able to combine the multi-thread CPU and GPU implementations into a single "XPU" code to be able to run on various hardware with only changing a USE_GPU
switch ? Using ParallelStencil.jl enables this, as many more cool features. The iceflow_xpu.jl
script uses ParallelStencil.jl for an XPU implementation on various backends:
const USE_GPU = false
using ParallelStencil
using ParallelStencil.FiniteDifferences2D
@static if USE_GPU
@init_parallel_stencil(CUDA, Float64, 2)
else
@init_parallel_stencil(Threads, Float64, 2)
end
# [...] skipped lines
@parallel function compute_flux!(qHx, qHy, D, S, dx, dy)
@all(qHx) = -@av_ya(D)*@d_xi(S)/dx
@all(qHy) = -@av_xa(D)*@d_yi(S)/dy
return
end
# [...] skipped lines
@parallel compute_flux!(qHx, qHy, D, S, dx, dy)
# [...] skipped lines
💡 Various macros
@(...)
permit to deal with the low-level technicalities and theUSE_GPU
flag enables to switch between CPU and GPU backend.
The resulting code is short and readable and solves the "two-language problem"; development and production code implementations are regrouped into a single code.
The figure below depicts the time 1000 iterations (or pseudo-time steps) take running the iceflow_xpu.jl
code on:
- an Intel Xeon Gold 6244 (4 threads - cores)
- an Nvidia RTX 2070 GPU
We can finally use our XPU ice flow solver to simulate the evolution of Greenland's ice cap for a specific climate scenario. The steps are following:
- Compute the steady-state ice thickness for current climate forcing (mass balance function as described in SIA equation section)
- Apply a time-dependent linear increase of the ELA (the equilibrium line where accumulation = ablation) of 1m/yr over the next 2500yrs. We here assume that an increase in annual mean temperature of 0.6°C is equivalent to a 100m shift of the ELA, thus 1m/yr represents 0.3°C per 50yrs. The
iceflow_xpu_evo.jl
code implements this climate evolution model and produces the following predictions for the Greenland ice cap evolution:
Note that our climate scenario may reflect some extreme warming over the next 2500yrs.
⚠️ Due to the time limitation, the short course will not cover the extras material. This information complements the course's main content and may provide intermediate and/or additional development steps.
Extra material comprises:
This extra example complements the GPU SIA implementation and the XPU SIA implementation.
Let's start from the diffusion_1D_damp.jl
code, a cool iterative and implicit solver in about 30 lines of code 🎉, and port it to GPU (with some intermediate steps).
- Extract the physics calculations from
diffusion_1D_damp.jl
, i.e. the time loop:
# [...] skipped lines
qH .= -D*diff(H)/dx # flux
ResH .= -(H[2:end-1] - Hold[2:end-1])/dt - diff(qH)/dx # residual of the PDE
dHdtau .= ResH + damp*dHdtau # damped rate of change
H[2:end-1] .= H[2:end-1] + dtau*dHdtau # update rule, sets the BC as H[1]=H[end]=0
# [...] skipped lines
- Split the calculations into separate functions (or kernels) and call those functions within the time loop. The
diffusion_1D_damp_fun.jl
implements those modifications:
function compute_flux!(qH, H, D, dx, nx)
Threads.@threads for ix=1:nx
if (ix<=nx-1) qH[ix] = -D*(H[ix+1]-H[ix])/dx end
end
return
end
function compute_rate!(ResH, dHdt, H, Hold, qH, dt, damp, dx, nx)
Threads.@threads for ix=1:nx
if (2<=ix<=nx-1) ResH[ix-1] = -(H[ix] - Hold[ix])/dt -(qH[ix]-qH[ix-1])/dx end
if (2<=ix<=nx-1) dHdt[ix-1] = ResH[ix-1] + damp*dHdt[ix-1] end
end
return
end
function compute_update!(H, dHdt, dtau, nx)
Threads.@threads for ix=1:nx
if (2<=ix<=nx-1) H[ix] = H[ix] + dtau*dHdt[ix-1] end
end
return
end
# [...] skipped lines
compute_flux!(qH, H, D, dx, nx)
compute_rate!(ResH, dHdt, H, Hold, qH, dt, damp, dx, nx)
compute_update!(H, dHdt, dtau, nx)
# [...] skipped lines
💡 Julia enables multi-threading capabilities by simply adding
Threads.@threads
to the outermost loop (here overix
).
- The last step is to replace the (multi-threaded) loop by a vectorised index
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
specific to GPU execution. Eachix
is executed concurrently on a different GPU thread. Thediffusion_1D_damp_gpu.jl
implements those modifications to run on GPUs:
using CUDA
# [...] skipped lines
function compute_flux!(qH, H, D, dx, nx)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
if (ix<=nx-1) qH[ix] = -D*(H[ix+1]-H[ix])/dx end
return
end
function compute_rate!(ResH, dHdt, H, Hold, qH, dt, damp, dx, nx)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
if (2<=ix<=nx-1) ResH[ix-1] = -(H[ix] - Hold[ix])/dt -(qH[ix]-qH[ix-1])/dx end
if (2<=ix<=nx-1) dHdt[ix-1] = ResH[ix-1] + damp*dHdt[ix-1] end
return
end
function compute_update!(H, dHdt, dtau, nx)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
if (2<=ix<=nx-1) H[ix] = H[ix] + dtau*dHdt[ix-1] end
return
end
# [...] skipped lines
@cuda blocks=cublocks threads=cuthreads compute_flux!(qH, H, D, dx, nx)
synchronize()
@cuda blocks=cublocks threads=cuthreads compute_rate!(ResH, dHdt, H, Hold, qH, dt, damp, dx, nx)
synchronize()
@cuda blocks=cublocks threads=cuthreads compute_update!(H, dHdt, dtau, nx)
synchronize()
# [...] skipped lines
💡 We use
@cuda blocks=cublocks threads=cuthreads
to launch the GPU function on the appropriate number of threads, i.e. "parallel workers". The numerical grid resolutionnx
must now be chosen accordingly to the number of workersnx=cublocks*cuthreads
.
Wouldn't it be great to be able to combine the multi-thread CPU and GPU implementations into a single "XPU" code to be able to run on various hardware with only changing a USE_GPU
switch ? Using ParallelStencil.jl enables this, as well more other cool features. The diffusion_1D_damp_xpu.jl
uses ParallelStencil.jl for an XPU implementation on various backends:
const USE_GPU = false
using ParallelStencil
using ParallelStencil.FiniteDifferences1D
@static if USE_GPU
@init_parallel_stencil(CUDA, Float64, 1)
else
@init_parallel_stencil(Threads, Float64, 1)
end
# [...] skipped lines
@parallel function compute_flux!(qH, H, D, dx)
@all(qH) = -D*@d(H)/dx
return
end
@parallel function compute_rate!(ResH, dHdt, H, Hold, qH, dt, damp, dx)
@all(ResH) = -(@inn(H) - @inn(Hold))/dt -@d(qH)/dx
@all(dHdt) = @all(ResH) + damp*@all(dHdt)
return
end
@parallel function compute_update!(H, dHdt, dtau)
@inn(H) = @inn(H) + dtau*@all(dHdt)
return
end
# [...] skipped lines
@parallel compute_flux!(qH, H, D, dx)
@parallel compute_rate!(ResH, dHdt, H, Hold, qH, dt, damp, dx)
@parallel compute_update!(H, dHdt, dtau)
# [...] skipped lines
Various macros @(...)
permit to deal with the low-level technicalities and the USE_GPU
flag enables to switch between CPU and GPU backend.
Using the inversion approach proposed by [7], our ice flow solver iceflow.jl
can be embedded into an inversion framework to retrieve spatially variable maximum accumulation rate b_max
in order to constrain ice thickness distribution over Greenland. The following animation depicts the evolution of the inversion procedure as function of the 30 inversion steps and was produced using the iceflow_inverse.jl
code. Gam
represents the misfit between the observed Hice
and the calculated H
ice thickness, B_max
represents the spatially variable maximal accumulation.
The ice thickness obtained from the inversion procedure can be further compared to the BedMachine Greenland v3 ice thickness data:
💡 Note that the inversion procedure serves here as proof of concept, as higher resolution and finer tuning may be needed to further improve the misfit minimisation.
Majority of stencil based codes as in this course are memory bounded, meaning the limiting factor in performance is the rate at which memory is transferred from and back between the memory and the arithmetic units. The maximal rate at which the memory transfers occur is the memory copy rate, in the order of 50 GB/s for CPUs and about 1 TB/s for modern GPUs. The effective memory throughput metric (T_eff) measures how good an iterative stencil-based algorithm performs in terms of memory throughput, to be compared to the memory copy rate.
Check out the performance metric section from the ParallelStencil.jl module and this JuliaCon2020 presentation [1].
Check out this material to figure out how combining ImplicitGlobalGrid.jl to ParallelStencil.jl enables efficient distributed memory parallelisation on multiple XPUs.