The Weather Research and Forecasting (WRF) Model v4 CONUS Benchmarks on Google Cloud Platform
The Weather Research and Forecasting Model (WRF) is an atmospheric fluid dynamics modeling tool used in both research and operational settings. WRF® development began in the late 1990s through a collaboration between the National Center for Atmospheric Research (NCAR), National Oceanic and Atmospheric Administration (NOAA), U.S. Air Force, Naval Research Laboratory, University of Oklahoma, and the Federal Aviation Administration. With the goal of meeting demands in both atmospheric research and in operational forecasting situations, WRF has created a global community of more than 48,000 users spanning over 160 countries.
Cloud computing is becoming increasingly prevalent across gov, edu, and commercial organizations. For those involved in atmospheric research or operations, it’s possible that WRF is a part of your computational toolkit. Otherwise, you’re likely working with other finite-volume based computational fluid dynamics (CFD) / numerical weather prediction (NWP) that have similar computational and MPI communication patterns. With this in mind, understanding how to squeeze the most performance out of these kinds of applications is critical for optimizing operational expenses when considering transitioning to cloud resources.
In this article, we discuss a benchmarking study of WRF on Google Cloud Platform. We’ll show that optimal performance is currently achieved using the c2 GCE machine types with optimal cost-performance achieved by subscribing MPI ranks to each vCPU (hyperthread). Additionally, we demonstrate clear advantages of using WRF’s asynchronous parallel IO option in combination with a Lustre file system hosted on GCP over single-point NFS file-server solutions. For the WRF CONUS 12km benchmark, we show how disabling hyperthreading produces comparable results to MPI+OpenMP configurations when the minimum domain size per rank is reached. For the WRF CONUS 2.5km benchmark, we’ll present scaling efficiency results out to 7,680 ranks on Google Cloud Platform.
The result of this study includes artifacts that allow others to reproduce these results as well as quickly get started with WRF on Google Cloud. Specifically, this includes a click-to-deploy RCC-WRF solution as well as open-source VM image baking scripts on github.com/fluidnumerics/rcc-apps.
Parallelism in WRF
The numerical weather prediction algorithms in WRF are based on discretizations of 3-D space into a structured mesh of grid cells. A mesh is specified by its latitude and longitude extents, a maximum altitude, and the number of subdivisions in each direction. Within each grid cell, WRF approximates the equations of conservation of mass, momentum, and thermodynamic quantities (temperature and humidity).
While this allows the end user to spread the computational workload across more compute resources, MPI ranks must communicate their subdomain state on halo-cells to MPI ranks with neighboring grid cells. In the schematic, each block of cells is owned by a unique MPI process. Data associated with each MPI process is private. To compute changes in the atmospheric state, neighboring MPI ranks exchange information in the halo cells, indicated by the blue grid cells.
Although domain decomposition can bring down the overall runtime of a simulation by parallelizing computations, inefficiencies in MPI communications can result in increased simulation costs since the total CPU-hour requirement is higher. Because of this, the scaling efficiency of tightly coupled HPC applications is important for cost-benefit analysis for research or operational projects.
Process Affinity and Hyperthreading
Software applications can be characterized by their computational and communication patterns. When leveraging MPI, the choices of how many cores and which cores to reserve for each rank can influence the overall performance and scalability of an application. Modern compute architectures offer core virtualization, where a single physical core can operate two “hyperthreads” with rapid context switching. This further complicates the discovery process for determining optimal process affinity for HPC applications.
Some applications perform better by taking advantage of hyperthreading, while others do not. A paper in the early aughts (T. Leng et. al, 2002) provided empirical evidence that indicated that performance gains or degradations due to hyperthreading are dependent on the application.
Applying Hyper-Threading and doubling the processes that simultaneously run on the cluster will increase the utilization rate of the processors’ execution resources. Therefore, the performance can be improved.
On the other hand, overheads might be introduced in the following ways:
Logical processes may compete for access to the caches, and thus could generate more cache-miss situations
More processes running on the same node may create additional memory contention
More processes on each node increase the communication traffic (message passing) between nodes, which can oversubscribe the communication capacity of the shared memory, the I/O bus or the interconnect networking, and thus create performance bottlenecks.
Whether the performance benefits of Hyper-Threading – better resource utilization – can nullify these overhead conditions depends on the application’s characteristics. T. Leng et. al, 2002
At many HPC centers, the option to enable or disable hyperthreading on the hardware is left in the hands of system administrators. In cloud environments, users are able to control the visibility of virtual CPUs (hardware threads/hyperthreads) on the guest operating systems. However, the processor mapping from guest OS to host OS on public cloud platforms is not always one-to-one. When running tightly coupled MPI applications, we can leverage process affinity flags to map MPI processes to either physical or virtual CPUs, providing a proxy for “disabling” or “leveraging” hyperthreads.
Controlling Process Affinity
OpenMPI provides flags for specifying how MPI ranks should be mapped onto the underlying hardware. The --bind-to option option is used to specify which hardware components an MPI task must reside on for the duration of a program’s execution. Available options are hwthread, core, socket, and node. Binding to a hwthread, for example, keeps an MPI process from migrating between vCPUs during its lifetime. In contrast, binding to a core keeps an MPI process bound to a physical core, but allows the process to migrate between the vCPUs (hyperthreads) during execution.
The --map-by option is used to specify the order in which MPI processes are mapped to hardware. Available options are hwthread, core, socket, and node.
In addition to OpenMPI flags, the workload managers and job schedulers, like Slurm, can be used to specify affinity constraints as well. You can dictate how many tasks are required for a job in addition to how many tasks are permitted to run on each compute node in a batch script’s batch header. As an example, the batch header below, specifies that we need to run 60 tasks (equivalent here to MPI processes), with 30 tasks per node and 2 vCPUs per task. This job would require a total of 120 vCPUs. Each MPI rank in this job would have access to 2 vCPUs and only 30 MPI ranks would be allowed on each compute node.
To give an example of how --map-by and --bind-to options can be used together with the Slurm job scheduler to map MPI processes to specific hardware components, consider the c2-standard-60 instances on Google Cloud Platform. Each node has 30 physical cores, numbered from 0-29 with two hardware threads per core, numbered 0-1, giving a total of 60 vCPU. Additionally, each node has two sockets with 15 cores each. This can be seen by running lscpu on a c2-standard-60 VM.
Notice that the CPU numbering corresponds to vCPUs. Specifically, vCPUs 0-14 correspond to physical cores 0-14, hardware thread 0; vCPUs 30-44 correspond to physical cores 0-14 hardware thread 1.
Suppose we use the following Slurm batch file
In this case, each MPI rank has access to 1 vCPU and only 30 MPI ranks are allowed on each compute node. When using c2-standard-60 instances, this job requires 2 virtual machines. The flags --np 60 --bind-to hwthread --map-by core assigns
In this case, each MPI process is bound to hardware thread 0 on each physical core. This scenario is similar to running with hyperthreading disabled.
Alternatively, we can run a job with 1 c2-standard-60 instance by setting and ntasks-per-node=60.
In this example, the flags --np 60 --bind-to hwthread --map-by core
Rank 0 to physical core 0, hardware thread 0 on node 0 (vCPU 0)
Rank 1 to physical core 0, hardware thread 1 on node 0 (vCPU 30)
Rank 2 to physical core 1, hardware thread 0 on node 0 (vCPU 1)
Rank 2 to physical core 1, hardware thread 1 on node 0 (vCPU 31)
Rank 58 to physical core 29, hardware thread 0 on node 0 (vCPU 29)
Rank 59 to physical core 29, hardware thread 1 on node 0 (vCPU 59)
On the outset of this work, it is unknown whether or not WRF benefits from hyperthreading. When discussing benefit, we consider both measurements of time-to-solution and computational expense. The time-to-solution is defined as the amount of wall-time needed to complete a WRF simulation. While the computational expense is proportional to the product of the number of compute nodes multiplied by the wall time.
For this work, we use Version 4.2.1 of WRF. Unlike version 3 of WRF, there are fewer published results on the performance of WRF v4. In 2018, a student at NCAR’s Summer Internship in Parallel Computational Science (SIParCS) worked on adapting the WRF v3 benchmarks for v4.0.0 to assess scaling and performance on the Cheyenne supercomputer. The results of this study are available online at https://github.com/akirakyle/WRF_benchmarks and the scripts used to generate input decks for the CONUS 2.5km, CONUS 12km, and Hurricane Maria benchmarks can be found at https://github.com/akirakyle/WRF_benchmarks. In our work we’ve adapted these scripts for building WRF and WPS and for creating the input decks for the CONUS 12km and CONUS 2.5km benchmarks.
To build WRF, we use a VM image baking approach that combines Google Cloud Build, Packer, and Spack. All of the build scripts are publicly available at https://github.com/FluidNumerics/rcc-apps/tree/main/wrf . For the starting VM image, we use the RCC-CentOS VM image. We’ve defined a template spack environment file that is used to concretize the target architecture and chosen compiler at build time. This allows the user to specify build substitution variables to define the target architecture and compiler using Google Cloud Build. Once the rcc-apps repository is cloned, the user can build a WRF VM image using the following command from the root directory of the repository :
Initial Conditions, Boundary Conditions, and Forcing
For each simulation, we parse the rsl.out.0000 file that contains timing information for forward stepping WRF, writing output files, and initialization. The total amount of time for each category is accumulated and reported for each simulation. The sum of the time spent in each category (initialization, file IO, and forward-stepping) is an estimate of the total execution time of the simulation.
Figure 3.1.1 : The total amount of time spent in forward stepping the WRF CONUS 2.5km benchmark with a 2 hour forecast is shown as a function of the machine type. For each machine type, the best (lowest) compute time observed is shown.
Table 3.2.1 : The results of three CONUS 12km runs where we compare the effects of task affinity. The configuration with the lowest compute time is highlighted in blue and the configuration with the lowest cost is highlighted in green.
The experiments shown in Table 3.2.1 show that we can achieve the lowest (optimal) runtime by providing each MPI rank 2 vCPUs, but binding each MPI rank to a hardware thread on their own physical cores. However, by leveraging hyperthreading we are able to obtain a lower compute cost, by a factor of 1.8. This reduction in compute cost comes at the price of a simulation that is about 11.6% slower.
To fully utilize all of the hyperthreads of 16 c2-standard-60 vCPU, we need to expose more parallelism in WRF. The CONUS 12km benchmark has too few grid points to be able to run with double (960) the number of MPI ranks. However, WRF is able to operate in a hybrid MPI+OpenMP configuration. With this setup, each MPI rank can run with a specified number of OpenMP threads.
To run with MPI+OpenMP, WRF is built using the dmpar+smpar option which effectively adds the -fopenmp flag during compilation. At runtime, we bind MPI processes to physical cores and give each rank two OpenMP threads that are in turn bound to hyperthreads. This is specified with the following MPI flags and batch header.
With this MPI+OpenMP configuration, the compute time is reduced by about 5.4% ( 36.96 s ) in comparison to the MPI only configuration on 16 nodes.
First, we look at three configurations for the process affinity :
Table 3.2.2 : The results of three CONUS 2.5km runs where we compare the effects of task affinity. The configuration with the lowest compute time is highlighted in blue and the configuration with the lowest cost is highlighted in green.
For the CONUS 2.5km benchmark, we find that binding MPI ranks to hardware threads and allowing 60 MPI ranks per node results in the lowest cost simulation. However, this configuration yields the largest wall-time. When providing 2 vCPUs per task and allowing 30 tasks per node, the compute time can be reduced (improved) by about 35.7%. However, the simulation cost has increased by 28.5% due to the fact that this configuration requires twice as many compute nodes.
Figure 3.2.1 : Summary figure showing the estimated simulation cost (left) and measured total wall-time (right) for the CONUS 2.5km benchmark as a function of the number of MPI ranks when binding MPI ranks to cores (blue) and vCPU (green).
Choosing the optimal compiler
Figure 3.3.1 : The relative simulation runtime for CONUS 2.5km at 480 MPI ranks comparing email@example.com, firstname.lastname@example.org, and the Intel OneAPI Compilers (v2021.2.0). The relative runtimes are references to the slowest simulation (email@example.com).
Optimizing File IO
Table 3.4.1 below is adapted from WRF’s documentation on the description of namelist variables, specifically focusing on the definition of the io_form_restart and io_form_history parameters.
For all options, except for 102, WRF will aggregate all of data from all MPI ranks to the “root rank” (rank 0) and write to a single file. With these options, file IO can become a major bottleneck that becomes more noticeable as the number of MPI ranks increases. There are two causes for this.
In this study, we compare file IO performance of the io_form 11 and io_form 102 options. For filesystem options, we look at two options
Figure 3.4.1 : The total time spent in File IO for the CONUS 12km benchmarks compared between simulations using 480 ranks on 8 compute nodes and 16 compute nodes. Both runs use the NFS filesystem and io_form=11.
Figure 3.4.1 shows that spreading the MPI ranks across 16 compute nodes, rather than 8, improves (reduces) the time spent in file IO.
Figure 3.4.2 shows that using io_form=102 significantly reduces the time spent in file IO from 5083.8s to 15.9s ( a factor of about 319 ). With 480 MPI ranks, this corresponds to a scaling efficiency of about 66%. By changing the filesystem to Lustre, we are able to further reduce the time spent in file IO to 0.997s ( another factor of 15.9 ).
We see similar behavior in the CONUS 2.5km benchmark, where we have 960 MPI ranks across 16 nodes. When using io_form=11 on the NFS filesystem, we spend 8004.3s in file IO for the CONUS 2.5km benchmark. Transitioning to io_form=102 provides improved file IO performance and using a Lustre file system provides additional gains.
Figure 3.4.2: The speedup of the File IO, relative to the best io_form=11 CONUS 12km simulation, is shown for io_form=102 on the NFS filesystem and on the Lustre filesystem. By using io_form=102, we observe a 319x improvement in time spent in file IO. By transitioning to the Lustre file system, we observe an additional 15x improvement in file IO time.
Figure 3.4.3: The speedup of the File IO, relative to the io_form=11 CONUS 2.5km (960 rank) simulation, is shown for io_form=102 on the NFS filesystem and on the Lustre filesystem. By using io_form=102, we observe a 192x improvement in time spent in file IO. By transitioning to the Lustre file system, we observe an additional 68x improvement in file IO time.
With the 960 rank CONUS 2.5km benchmark, the scaling efficiency is only 20%. This observation is consistent with the notion that, at 960 ranks, we have exceeded the capacity of the NFS filesystem to keep up with parallel file IO demands. By transitioning to Lustre, however, we are able to gain an additional 68x speedup, bringing the file IO time down to 0.6s.
Scaling CONUS 2.5km
Figure 3.5.1 : The total wall-time (blue), compute/forward stepping time (red), initialization time (yellow), and file IO time (green) are shown for the CONUS 2.5km benchmarks as a function of the number of MPI ranks.
Figure 3.5.1 shows the wall times for the CONUS 2.5km simulation for 480 to 3840 MPI ranks on the c2-standard-60 instances. This figure shows that the majority of the total wall-time can be accounted for by the time spent in forward stepping the model. Generally, the wall-time decreases as the number of MPI ranks increases. However, the scaling is not linear, indicative of scaling inefficiencies that grow as the number of MPI tasks increases.
Where T480 is the wall-time for the 480 rank benchmark and Nis the number of MPI ranks for a given simulation. A scaling efficiency of one represents ideal scaling and no contribution of MPI overhead to the overall runtime. Lower scaling efficiencies indicate an increasing impact of MPI overhead on the runtime.
Figure 3.5.2 : The scaling efficiency for the CONUS 2.km benchmarks, relative to the 480 rank run, are shown for the compute/forward stepping time.
Figure 3.5.2 shows the scaling efficiency of the CONUS 2.5km benchmarks, relative to the 480 rank run, as a function of the number of MPI tasks. As the number of MPI tasks increases the scaling efficiency decreases, reaching about 43 % for the total wall-time.