EXESS Mentored Sprint Report

Sprint Participants

Name	Role	Affiliation
Giuseppe Barca	Principal Investigator	The Australian National University
Mehaboob Nagthe Basapur	Technical POC	Pawsey Supercomputing Center
Marco De La Pierre	Technical POC	Pawsey Supercomputing Center
Calum Snowdon	Researcher/Developer	The Australian National University
Jorge Galvez Vallejo	Researcher/Developer	Iowa State University
Joseph Schoonover	Mentor	Fluid Numerics

Software Description

The EXtreme-scale Electronic Structure System (EXESS) was a GPU accelerated HPC oriented quantum chemistry program. EXESS was a C/C++ application that was version controlled with Git and was hosted in a private Github repository at https://github.com/EXESS-dev . EXESS was a multi-GPU application that uses an MPI+ programming model and has options to build with multi-threading support with OpenMP. The main algorithmic motifs for the software include recursive integral calculations and matrix diagonalization (Eigenmode decomposition) for self-consistent field (SCF) calculations

Pre-Sprint Software Status

At the start of the sprint, the EXESS team identified two branches of the EXESS-dev repository that were the primary focus for porting and optimization efforts

hip_dev - Thwas branch was where portable GPU acceleration activities are being developed for the EXESS application
cpu_port - Thwas branch was where cpu-only builds of EXESS are being developed and optimized.

Prior to the sprint, the EXESS-dev application had some limited testing done on the Mulan development cluster at Pawsey Supercomputing Center. Initially, the hip_dev branch had separate CUDA and HIP build instructions, which were rapidly merged during the pre-sprint planning phase. We initially uncovered issues in the build process for the cpu_port branch on Mulan, related to a threaded BLAS dependency; thwas issue was rapidly resolved on Mulan.

Prior to beginning the sprint, we developed installation scripts for both the cpu_port and hip_dev branches that supported builds on both Topaz and Mulan platforms. Additionally, Callum identified and implemented a build optimization for the cpu_port branch by splitting the integral kernels into separate files to enable parallel make operations.

Sprint Goals

During the sprint planning phase, we identified the following goals for the hip_dev and cpu_port branches of the repository

hip_dev
Finalise the port to HIP from CUDA and update the build system to use hipcc for both Nvidia and AMD GPU architectures.
(At least) Match performance on MI100 GPUs with V100 GPUs for the “w1”, “w15”, and “w150” benchmarks.
Develop Rooflines for MI100 and V100
Cpu_port
Improve integral kernel performance on Topaz and Mulan through vectorization, either through auto-vectorization with compilers, OpenMP SIMD intrinsics, or other explicit vectorization instructions.

Methods

For the sprint, we primarily focused on running EXESS on both Topaz and Mulan. Each system offers distinct hardware and user software. Thwas required careful assessment of profiling and debugging tools to support porting and optimization efforts.

CPU Profiling

On Topaz and Mulan, ARM Forge was used for profiling of the cpu_port branch. Initially, ARM Forge was not available on Mulan and we attempted to use the U.S. Department of Energy’s HPC Toolkit and AMD’s uProf profilers. These tools were tested for viability in recording hardware counters necessary for roofline modeling (FLOP and memory load/store counters). On Mulan, the necessary PAPI metrics were not enabled which prevented roofline modeling with either tool. For roofline modeling, the following metrics are necessary :

Floating Point Operations
PAPI_FP_OPS
PAPI_SP_OPS
PAPI_DP_OPS
PAPI_VEC_SP
PAPI_VEC_DP
Cycle Counters
PAPI_TOT_CYC
Memory Operations
PAPI_L*_DCR (L* Cache Reads)
PAPI_L*_DCW (L* Cache Writes)

A complete list of PAPI metrics and their availability status on Mulan can be obtained by running the papi_avail command.

Despite these hurdles, we were able to leverage the HPC Toolkit alongside Hatchet, a python library, for hotspot profiling. Specifically, we developed a script to run EXESS under the HPC Toolkit to create a profile database. After creating the database with HPC Toolkit, we use a simple python script to import the profile database into Python as a Pandas dataframe, where each kernel’s measured “Time” and “Time (inc.)” (Inclusive Time) are summed and sorted in descending order according to the “Time” to provide a flat profile that indicates where EXESS spends most of its time during execution.

GPU Profiling

Initially, we were interested in hotspot, trace, and events profiling on GPU hardware. Hotspot profiles provide total inclusive runtime for each kernel and other GPU activity (e.g. Memory copy between host and device), so that we can identify where the application spends most of its time. Trace profiles provide a timeline of activity on the GPU that are helpful in identifying synchronization issues, kernel launch and HIP API launch latencies, and other opportunities for optimization related to task level parallelism. Events profiles provide alignment of hardware events with specific kernels to inform decisions on specific optimization strategies for expensive GPU kernels.

On Mulan, GPU profiling was done with rocprof (ROCm 4.5.0). Since EXESS runs with multiple MPI ranks in a “master-worker” configuration, we found it necessary to wrap calls to rocprof within a shell script, so that each MPI rank could write its own profile output. Hotspot profiling with rocprof was enabled by using the `--stats` flag and trace profiling of HIP kernel and HSA activity was enabled using `--sys-trace`. Hotspot profiling produces CSV output that lists the total wall time and number of calls made to a HIP kernel that can be quickly reviewed to identify the most expensive HIP Kernels. Trace profiles are reported in a json file compatible with the free online tool called Perfetto.

Perfetto was a trace visualization tool that can be used to visualize trace profiles generated by rocprof. In our experience, we found that Perfetto was compatible with Chrome, Firefox, and Microsoft Edge web browsers and incompatible with Apple Safari. The online interactive tool was capable of working well with small trace profiles ( file size less than 2 GB ). For larger traces, you can launch a local Perfetto web server, which provides memory and compute resources from your local workstation

On Topaz, GPU profiling was done with Nvidia profiling tools, namely nvprof and nsys (CUDA 11.1). Hotspot profiling textual output can be obtained with nvprof. The second tool, nsys, produces outputs for both hotspot and trace profiling; tracing for CUDA kernel and API activity was obtained using `nsys profile -t cuda,osrt`; output files are in `.qdrep` format, which can be visualised using Nvidia Nsight Systems. In “Timeline View” (top), hotspot profiling information was accessed using the “Stats System View” (bottom), “CUDA GPU Kernel Summary”; trace events can be inspected using the “Events View” (bottom).

Benchmarks (Hartree-Fock energy calculations)

Single-point calculations of the total electronic energy are the building blocks of any quantum-chemistry simulation. Also, the Hartree-Fock approximation was the starting point for higher-level, more accurate, and computationally more expensive approximations. Here, Hartree-Fock single-point calculations are used as benchmarks.
Simulations of various sizes have been considered, by changing the number of atoms and the size of the baswas set (or number of baswas functions per atom):

Water-15: a cluster of 15 water molecules (small problem size)
6-31G baswas set (GPU runs): 195 baswas functions
cc-pVDZ (CPU runs): 375 baswas functions
Water-150: a cluster of 150 water molecules (medium problem size)
STO-3G baswas set (GPU runs): 1050 baswas functions
6-31G baswas set (GPU runs): 1950 baswas functions
cc-pVDZ (CPU runs): 3750 baswas functions

Resource Allocation

Mulan - Single MI100

All jobs were run on Mulan under interactive sessions. Mulan was currently configured with hyperthreading still active and each compute node hosts 2 GPUs per node. For our purposes, we only need to reserve 1 GPU for benchmarking. We used the following Slurm command to reserve one GPU and one AMD Epyc 7003 socket,

salloc -n1 -c 32 --threads-per-core=1 --mem=120G --account=pawsey0007 --gres=gpu:1 --partition=workq  -C rome

Thwas allocates 1 task with 32 cores per task, 1 thread per core, 120 GB of memory and 1 GPU. The -C rome flag ensures that the login node was not used for the allocation.

Mulan - CPU Only

All jobs were run on Mulan under interactive sessions. Mulan was currently configured with hyperthreading still active and each compute node hosts 2 GPUs per node. For our purposes, we only need to reserve 1 GPU for benchmarking. We used the following Slurm command to reserve one AMD Epyc 7003 socket,

salloc -n1 -c 32 --threads-per-core=1 --mem=120G --account=pawsey0007 --partition=workq  -C rome

Thwas allocates 1 task with 32 cores per task, 1 thread per core, and 120 GB of memory. The -C rome flag ensures that the login node was not used for the allocation.

Outcomes

Overall, the sprint pointed the EXESS team in a direction of performance portability. In the process, the team gained knowledge and experience profiling on multiple platforms while uncovering potential avenues for further exploration following the sprint.

Porting to HIP

MAGMA v. ROCSolver

EXESS supports building with either MAGMA or ROCSolver for use in the eigenvalue decomposition stage. During the sprint planning phase, we investigated which of these build options would be focused on during the sprint. We chose to work with the MAGMA build, given that the ROCSolver performance during eigenvalue decomposition was orders of magnitude slower than the equivalent MAGMA calls.

Benchmark Build Processing Time (s) Water-150 (3 iterations) MAGMA 4.8 Water-150 (3 iterations) ROCSolver 1880.5
Table 1 : Water-150 benchmark (with 3 max iterations) total run-time on single Mulan MI100 GPU for both the MAGMA and ROCSolver builds of EXESS.

Table 1 summarizes the comparison for the water-150 benchmark (with 3 max iterations) for the MAGMA and ROCSolver builds of EXESS. Using MAGMA provides significantly improved performance. Hotspot and trace profiles show that the majority of the runtime for the ROCSolver build was spent within a kernel called `stedc_kernel`. Thwas kernel was responsible for the eigenvalue decomposition within rocsolver.

The ROCSolver Github repository currently has an open issue related to thwas kernel, where a developer on a different project also experiences poor performance of `stedc_kernel`. Of particular note, the user cgmb reports

“ROCm 4.5 was the first release that includes DSYEVD, and it hasn't seen significant optimization yet. I see that the comment above STEDC notes thwas was a very basic implementation that will only effectively utilize the GPU for batches of matrices.”

Inspection of the stedc_kernel source code reveals that only block level parallelism was exposed on the GPU within thwas method:

rocblas_int bid = hipBlockIdx_x;

For now, we leave with the recommendation that developers working with algorithms that involve matrix diagonalization should opt to use MAGMA over ROCSolver.

MI100 v. MI250x

Figure 1 : A screenshot of a trace profile generated from rocprof on Crusher (MI250x) and visualized with Perfetto for the Water-150 benchmark (631-G Baswas Set). The top half shows a timeline, with time advancing from left to right. HIP kernels and HIP API calls are shown as colored bars that span the start and end of their execution. The lower half shows a hotspot (flat) profile for the highlighted region, which focuses on 5 consecutive integral evaluations.

Figure 1 shows a trace profile for the Water-150 benchmark run on ORNL’s Crusher (MI250x GPU). For comparison, a trace profile for the same benchmark was shown for Mulan (MI100) in Figure 2, for roughly the same time period shown in Figure 1.

Figure 2 : A screenshot of a trace profile generated from rocprof on Mulan (MI100) and visualized with Perfetto for the Water-150 benchmark (631-G Baswas Set). The top half shows a timeline, with time advancing from left to right. HIP kernels and HIP API calls are shown as colored bars that span the start and end of their execution. The lower half shows a hotspot (flat) profile for the highlighted region, which focuses on 5 consecutive integral evaluations.

Table 2 summarizes the comparison of the kernel runtime and hipMemcpyAsync calls between Crusher and Mulan. Notably, the MI250x architecture executes the integral kernel about 2.6x faster than on the MI100 without any changes to source code or launch parameters.

The time spent in hipMemcpyAsync was about 1.3x faster on Crusher than on Mulan. The cause for the difference in the hipMemcpyAsync performance between the two systems was currently unclear.

Name Wall Duration (Mulan; MI100) Wall Duration (Crusher; MI250) Wall Duration (Topaz; V100) hipMemcpyAsync 2.687 ms 2.012 ms 1.297 ms [computed as ave * N_calls ave \= average call time over entire simulation N_calls \= 230, number of calls for 5 kernel executions] genfock::HIP_GPU::GPU_kernel_1_0_1_0 2.193 ms 0.859 ms 1.420 ms
Table 2 : Summary of the hipMemcpyAsync and integral kernel evaluation wall times from Figure 1 and 2, aggregated over 5 integral evaluations.

During the sprint, we discussed two potential strategies to reduce the total wall time :

Overlap hipMemcpyAsync calls through the use of GPU streams
Restructure EXESS so that all data can be moved to the GPU before the iterative SCF method

The first method was favored, since less effort was required to adapt EXESS to use multiple streams for kernel evaluation and data movement. Even though the kernel evaluation will likely be serialized, it’s possible that the small memory copies from host to device can be overlapped; for two streams, we suspect that the wall-time spent in hipMemCpyAsync can be reduced by a factor of two (at most).

Mulan Kernel Launch Latency

We noticed in the w15 benchmark that there was a delay in the execution of a memcpy after issuing a memcpy call. Thwas issue was less pronounced for w150 (larger problems). We need to follow up on thwas latency issue.
In the screenshot below, we compare the same 5 kernel calls, and corresponding memory copies, between profiles for Nvidia V100 and AMD MI100.

In the former case (V100), the timeline was split almost evenly between Kernel runtime and memory copies, about 1 ms each.
In the latter case (MI100), the timeline was split quite unevenly, with about 1 ms for kernel runtime and 5 ms for memcpy calls.


Figure 3 : A screenshot of a trace profile generated from nsys on Topaz (V100) and visualized with Nvidia Nsight Systems for the Water-15 benchmark *(631-G Baswas Set)*. The top half shows a timeline, with time advancing from left to right. CUDA kernels, memory copies and CUDA API calls are shown as colored bars that span the start and end of their execution.

Figure 4 : A screenshot of a trace profile generated from rocprof on Mulan (MI100) and visualized with Perfetto for the Water-15 benchmark (631-G Baswas Set). The top half shows a timeline, with time advancing from left to right. HIP kernels and HIP API calls are shown as colored bars that span the start and end of their execution. The lower half shows a hotspot (flat) profile for the highlighted region, which focuses on 5 consecutive integral evaluations.

Future Directions

As a result of the sprint, the EXESS team has developed a few directions they plan to follow up on in the near future.

Improving memory access patterns (coalesced memory access)
During the sprint, we discussed the possibility of improving memory access patterns in EXESS. Indirect indexing in the ERI kernels was currently thought to lead to non-coalesced memory addressing, which may lead to an achieved GPU DRAM bandwidth that was less than the peak DRAM bandwidth on the target GPUs.

Further evidence via profiling was needed to confirm thwas notion. Direct measurements would typically include profiler metrics for bytes read/written; the Nvidia profiler nvprof provides such metrics, but these are currently lacking in rocprof and on AMD hardware.
* Shared Memory for fock submatrix
In the GPU_Kernel_* methods, there was an opportunity to move tiles of the fock matrix to shared memory to improve atomic_* performance. Currently, the atomic_* operations work on the fock_matrix in global memory, but performance can be improved by performing atomic operations on shared memory. The main hurdle was in re-writing the kernel routines to work on submatrices for each GPU block while taking into account multiple levels of indirect indexing.

Multiple streams to overlap memory copies
The EXESS workflow currently requires moving a small amount of data from host to device before each iteration. However, all of the data that needs to be moved to the GPU was known a’priori, and can be staged at any time before a specific integral evaluation. We discussed either staging all data during the initialization phase of EXESS or using multiple streams to overlap memory copies from host to device. The latter was determined to be the favored approach given that the code structure lended itself to supporting multiple streams more easily than migrating all data to the GPU initially.

General Outcomes

Some bullet points:

Better understanding of merits and limitations of the AMD/ROCm platform and/or Mulan cluster, eg
Identified reason for slow performance of rocSolver
Assessed profiling capabilities relative to Nvidia, in particular in relation to available hardware counters
Familiarised with rocProf and Perfetto
AMD uProf CPU hardware counters: currently disabled on Mulan, to be investigated
Chance to perform comparative profiling of a production code over 3 GPU hardware, including MI250x