The Spectral Element Libraries in Fortran is a library written in modern Fortran to help scientists and researchers employ spectral element methods for the solutions to partial differential equations. In the last year, I’ve been working on cleaning up SELF and transitioning the GPU accelerated back-end to a more portable framework.
For the GPU acceleration, I decided to expose AMD’s HIP to Fortran through ISO_C_Binding. This initially led to the development of hip-fortran, which was transitioned into hipfort and is now part of AMD’s ROCm ecosystem. By using hipfort, I am now able to push forward with portable GPU programming for SELF and offer support to end-users for both AMD and Nvidia GPUs.
This year, I hit a milestone for SELF of having the core spectral element routines completely implemented for serial and single GPU platforms. Essentially, these core routines enable users to execute grid interpolation, divergence, gradient, and curl methods for scalars, vectors, and tensors in one, two, and three dimensions. Thanks to the FORD project, documentation for the SELF API is available on the SELF github.io. This version of SELF is encapsulated in the Version 0.0.1-alpha release on Github.
With GPU cycles made available through OS Hackathon’s Spring HIP Hackathon, I was able to get through a complete set of benchmarks on both Nvidia and AMD platforms. In this article, I’ll share some of the benchmarks of the most computationally intensive routines in SELF on AMD and Nvidia GPUs for single and double precision floating point arithmetic. -Brief summary of results-
Vector Divergence in 3-D Mapped Geometries
For my original use case with SELF, I was building a Compressible Navier-Stokes CFD solver. This solver was designed to run on unstructured meshes with isoparametric geometries. Essentially, this provided high order accuracy when working in complex geometries, like oceans with variable topography and irregular coastlines, or cities with terrain and buildings.
When discussing the computational complexity of a DGSEM, we need to be familiar with the following variables
Together, the number of floating point operations and data stores and loads are a measure of the amount of work that needs to be done to execute an algorithm. In this article, I’m going to be expressing the amount of work in terms of N and nEl.
The most expensive part of this solver involved calculating the divergence of momentum, mass, and energy fluxes. In the Compressible Navier-Stokes solver, I was primarily using a particular class of spectral element methods called “Discontinuous Galerkin Spectral Element Methods (DGSEM)”. In DGSEM, calculating the divergence of a vector in complex geometries involves
Boundary Interpolation : Interpolate functions to element boundaries [ O(N3 * nEl) ]
Riemann Solver : Computing approximate mass, momentum, and energy fluxes between elements [ O(N2 * nEl) ]
Contravariant Projection : Project the mass, momentum, and energy flux from physical space to computational space [ O(N3 * nEl) ]
DG Divergence : Compute the divergence in computational space [ O(N4 * nEl) ]
In practice, SEM practitioners will work with N=7 or higher. For this choice, the last two steps typically dominate runtimes. When working with MPI and MPI+GPU implementations, step 2 is primarily where MPI point-to-point communications are handled.
In SELF, Step 3 is called Contravariant Projection. Within each element, for each variable, and for each nodal point, we compute a matrix vector product to project a vector from physical space to computational space. You can take a look at the HIP Kernel for this routine on github.
As with most SEM routines, we often work with deep tightly nested loops that map cleanly onto GPUs. For the contravariant projection in 3-D, each hip block (CUDA block) consists of (N+1)3 threads that operate in SIMD fashion. Each thread concurrently calculates its own 3x3 matrix-vector product and stores the result back in global memory.
In the DG Divergence, within each element, for each variable, and for each nodal point we compute three matrix-vector products ( matrix size is (N+1)2 ) and add element boundary contributions. Just like the Contravariant Projection routine, each hip block (CUDA block) consists of (N+1)3 threads that operate in SIMD fashion. Each thread concurrently calculates its own matrix-vector product and boundary contributions for a nodal point and stores the result back in global memory.
In case you’re interested in seeing the CPU counterparts for these HIP kernels you can check out the following pages of the SELF documentation.
For SELF, I’ve set up a couple CPP variables that can be set at compile time to control features that are baked into the library and the resulting executable.
GPU : Passing -cpp -DGPU to the compiler is used to enable GPU acceleration. When used with the hipfc compiler wrapper and the HIP_PLATFORM environment variable, you can build for CPU-only and AMD and Nvidia GPU accelerated platforms.
DOUBLE_PRECISION: Passing -cpp -DDOUBLE to the compiler can be used to toggle on double precision floating point arithmetic. By default, the code builds with single precision for all routines, except for a few setup routines that are hard-coded to double precision.
To present a clean build interface for selecting target architecture, GPU microarchitecture, precision, and Fortran compiler, I’ve put together a fairly clean Makefile that was inspired by LKedward/focal. Essentially, the various build options can be set by a small set of environment variables.
When the user sets the FC environment variable to “hipfc” (the hipfort compiler wrapper), the Makefile automatically applies the -DGPU CPP flag. Additionally, we expose environment variables that hipfc uses, such as HIPFORT_COMPILER (to set the host-side compiler) and HIPFORT_GPU (to set the GPU microarchitecture). The PREC environment variable can be set to “single” for 32-bit floating point arithmetic and “double” for 64-bit.
With this in mind, a build recipe for SELF typically looks like (for ROCm 4.1.0)