SELF - Single GPU Benchmarks

This work was supported by OS Hackathon, AMD, and Fluid Numerics.

The Spectral Element Libraries in Fortran is a library written in modern Fortran to help scientists and researchers employ spectral element methods for the solutions to partial differential equations. In the last year, I’ve been working on cleaning up SELF and transitioning the GPU accelerated back-end to a more portable framework.

For the GPU acceleration, I decided to expose AMD’s HIP to Fortran through ISO_C_Binding. This initially led to the development of hip-fortran, which was transitioned into hipfort and is now part of AMD’s ROCm ecosystem. By using hipfort, I am now able to push forward with portable GPU programming for SELF and offer support to end-users for both AMD and Nvidia GPUs.

This year, I hit a milestone for SELF of having the core spectral element routines completely implemented for serial and single GPU platforms. Essentially, these core routines enable users to execute grid interpolation, divergence, gradient, and curl methods for scalars, vectors, and tensors in one, two, and three dimensions. Thanks to the FORD project, documentation for the SELF API is available on the SELF github.io. This version of SELF is encapsulated in the Version 0.0.1-alpha release on Github.

With GPU cycles made available through OS Hackathon’s Spring HIP Hackathon, I was able to get through a complete set of benchmarks on both Nvidia and AMD platforms. In this article, I’ll share some of the benchmarks of the most computationally intensive routines in SELF on AMD and Nvidia GPUs for single and double precision floating point arithmetic. -Brief summary of results-

Vector Divergence in 3-D Mapped Geometries

For my original use case with SELF, I was building a Compressible Navier-Stokes CFD solver. This solver was designed to run on unstructured meshes with isoparametric geometries. Essentially, this provided high order accuracy when working in complex geometries, like oceans with variable topography and irregular coastlines, or cities with terrain and buildings.

When discussing the computational complexity of a DGSEM, we need to be familiar with the following variables

N - The polynomial degree of the function approximation
nEl - The number of elements in the mesh

Together, the number of floating point operations and data stores and loads are a measure of the amount of work that needs to be done to execute an algorithm. In this article, I’m going to be expressing the amount of work in terms of N and nEl.

The most expensive part of this solver involved calculating the divergence of momentum, mass, and energy fluxes. In the Compressible Navier-Stokes solver, I was primarily using a particular class of spectral element methods called “Discontinuous Galerkin Spectral Element Methods (DGSEM)”. In DGSEM, calculating the divergence of a vector in complex geometries involves

Boundary Interpolation : Interpolate functions to element boundaries [ O(N3 * nEl) ]
Riemann Solver : Computing approximate mass, momentum, and energy fluxes between elements [ O(N2 * nEl) ]
Contravariant Projection : Project the mass, momentum, and energy flux from physical space to computational space [ O(N3 * nEl) ]
DG Divergence : Compute the divergence in computational space [ O(N4 * nEl) ]

In practice, SEM practitioners will work with N=7 or higher. For this choice, the last two steps typically dominate runtimes. When working with MPI and MPI+GPU implementations, step 2 is primarily where MPI point-to-point communications are handled.

In SELF, Step 3 is called Contravariant Projection. Within each element, for each variable, and for each nodal point, we compute a matrix vector product to project a vector from physical space to computational space. You can take a look at the HIP Kernel for this routine on github.

As with most SEM routines, we often work with deep tightly nested loops that map cleanly onto GPUs. For the contravariant projection in 3-D, each hip block (CUDA block) consists of (N+1)3 threads that operate in SIMD fashion. Each thread concurrently calculates its own 3x3 matrix-vector product and stores the result back in global memory.

In the DG Divergence, within each element, for each variable, and for each nodal point we compute three matrix-vector products ( matrix size is (N+1)2 ) and add element boundary contributions. Just like the Contravariant Projection routine, each hip block (CUDA block) consists of (N+1)3 threads that operate in SIMD fashion. Each thread concurrently calculates its own matrix-vector product and boundary contributions for a nodal point and stores the result back in global memory.

In case you’re interested in seeing the CPU counterparts for these HIP kernels you can check out the following pages of the SELF documentation.

Build System

For SELF, I’ve set up a couple CPP variables that can be set at compile time to control features that are baked into the library and the resulting executable.

GPU : Passing -cpp -DGPU to the compiler is used to enable GPU acceleration. When used with the hipfc compiler wrapper and the HIP_PLATFORM environment variable, you can build for CPU-only and AMD and Nvidia GPU accelerated platforms.
DOUBLE_PRECISION: Passing -cpp -DDOUBLE to the compiler can be used to toggle on double precision floating point arithmetic. By default, the code builds with single precision for all routines, except for a few setup routines that are hard-coded to double precision.

To present a clean build interface for selecting target architecture, GPU microarchitecture, precision, and Fortran compiler, I’ve put together a fairly clean Makefile that was inspired by LKedward/focal. Essentially, the various build options can be set by a small set of environment variables.

When the user sets the FC environment variable to “hipfc” (the hipfort compiler wrapper), the Makefile automatically applies the -DGPU CPP flag. Additionally, we expose environment variables that hipfc uses, such as HIPFORT_COMPILER (to set the host-side compiler) and HIPFORT_GPU (to set the GPU microarchitecture). The PREC environment variable can be set to “single” for 32-bit floating point arithmetic and “double” for 64-bit.

With this in mind, a build recipe for SELF typically looks like (for ROCm 4.1.0)

SELF_PREFIX=$(pwd)/apps/SELF \

PREC=double \

BUILD=release \

FC=hipfc \

HIPFORT_COMPILER=gfortran \

HIPFORT_GPU=gfx908 \

HIP_PLATFORM=amd \

ROCM_DIR=/opt/rocm \

make install

Benchmarks

For the benchmarks presented here, we look at runtimes for the combination of the Contravariant Projection and DG Divergence routines while varying the HIPFORT_GPU and PREC build variables.

For the target GPU platforms, we consider the following

AMD MI50
1. HIPFORT_GPU=gfx906 \
2. HIP_PLATFORM=amd \
AMD MI100
1. HIPFORT_GPU=gfx908 \
2. HIP_PLATFORM=amd \
Nvidia Tesla V100
1. HIPFORT_GPU=sm_72 \
2. HIP_PLATFORM=nvidia \

Benchmarks are run on two separate systems. For AMD GPU Platforms, we use the AMD Accelerator Cloud (AAC). The AAC is a heterogeneous cluster with a mix of compute nodes with MI50 and MI100 GPUs. For Nvidia GPU platforms, I used the OSHPC cluster from OS Hackathon. OSHPC is a cloud-native auto-scaling HPC cluster that runs on Google Cloud Platform. The V100 compute nodes use an n1-standard-8 machine type with a single Nvidia Tesla V100 GPU attached and CUDA 11.1 . For all of the benchmarks, I use ROCm 4.1.0 and the b101f7 commit of hipfort.

The runtimes for the Contravariant Projection and DG Divergence are shown below for each GPU type, single and double precision, and a range of mesh sizes. As expected, 64-bit arithmetic is the most expensive for all mesh sizes and across all hardware. Across all mesh sizes tested, performance is best on the Nvidia Tesla V100 GPU, followed by the MI100, then the MI50.

It's worth pointing out that these benchmarks are obtained without special performance tuning for each target hardware. I expect that with more detailed profiling and kernel optimization we can bring down runtimes.

Next Steps

Outside of the comparison of hardware, I am excited to have SELF operating on multiple platforms while using the same code. This is a clear demonstration of the portability enhancement enabled by ROCm and specifically HIPFort and HIP.

This is simply the first step on a journey that only gets better with the contributions of the community behind the hardware. Having open access to machine code is ideal for working on a roadmap towards optimization that ultimately puts us as close to the hardware as possible while also using it as intended. From here, dependencies largely fall on the relevant, stable software releases that are readily available and obvious enhancements that optimize how the application interfaces with the hardware and systems it runs on.

With the serial and single GPU routines in place, I’m now ready to tackle bringing back in MPI for parallel CPU-only and multi-GPU implementations. When we reach this next milestone, I’ll be back to share how we accomplish the GPU-GPU communication with MPI+HIPFort along with some scaling benchmarks. Stay tuned for future releases of SELF !