The Weather Research and Forecasting (WRF) Model v4 CONUS Benchmarks on Google Cloud Platform

Introduction

The Weather Research and Forecasting Model (WRF) is an atmospheric fluid dynamics modeling tool used in both research and operational settings. WRF® development began in the late 1990s through a collaboration between the National Center for Atmospheric Research (NCAR), National Oceanic and Atmospheric Administration (NOAA), U.S. Air Force, Naval Research Laboratory, University of Oklahoma, and the Federal Aviation Administration. With the goal of meeting demands in both atmospheric research and in operational forecasting situations, WRF has created a global community of more than 48,000 users spanning over 160 countries.

Cloud computing is becoming increasingly prevalent across gov, edu, and commercial organizations. For those involved in atmospheric research or operations, it’s possible that WRF is a part of your computational toolkit. Otherwise, you’re likely working with other finite-volume based computational fluid dynamics (CFD) / numerical weather prediction (NWP) that have similar computational and MPI communication patterns. With this in mind, understanding how to squeeze the most performance out of these kinds of applications is critical for optimizing operational expenses when considering transitioning to cloud resources.

In this article, we discuss a benchmarking study of WRF on Google Cloud Platform. We’ll show that optimal performance is currently achieved using the c2 GCE machine types with optimal cost-performance achieved by subscribing MPI ranks to each vCPU (hyperthread). Additionally, we demonstrate clear advantages of using WRF’s asynchronous parallel IO option in combination with a Lustre file system hosted on GCP over single-point NFS file-server solutions. For the WRF CONUS 12km benchmark, we show how disabling hyperthreading produces comparable results to MPI+OpenMP configurations when the minimum domain size per rank is reached. For the WRF CONUS 2.5km benchmark, we’ll present scaling efficiency results out to 7,680 ranks on Google Cloud Platform.

The result of this study includes artifacts that allow others to reproduce these results as well as quickly get started with WRF on Google Cloud. Specifically, this includes a click-to-deploy RCC-WRF solution as well as open-source VM image baking scripts on github.com/fluidnumerics/rcc-apps.

Parallelism in WRF

The numerical weather prediction algorithms in WRF are based on discretizations of 3-D space into a structured mesh of grid cells. A mesh is specified by its latitude and longitude extents, a maximum altitude, and the number of subdivisions in each direction. Within each grid cell, WRF approximates the equations of conservation of mass, momentum, and thermodynamic quantities (temperature and humidity).

WRF uses lateral domain decomposition to expose parallelism with MPI. This means that, given a number of available MPI processes, WRF will divide the domain in the latitude and longitude dimensions as equally as possible and distribute the work of each subdivision to an MPI rank.

Image generated using cartopy and Inkscape.

Image generated in Inkscape

While this allows the end user to spread the computational workload across more compute resources, MPI ranks must communicate their subdomain state on halo-cells to MPI ranks with neighboring grid cells. In the schematic, each block of cells is owned by a unique MPI process. Data associated with each MPI process is private. To compute changes in the atmospheric state, neighboring MPI ranks exchange information in the halo cells, indicated by the blue grid cells.

Although domain decomposition can bring down the overall runtime of a simulation by parallelizing computations, inefficiencies in MPI communications can result in increased simulation costs since the total CPU-hour requirement is higher. Because of this, the scaling efficiency of tightly coupled HPC applications is important for cost-benefit analysis for research or operational projects.

Process Affinity and Hyperthreading

Software applications can be characterized by their computational and communication patterns. When leveraging MPI, the choices of how many cores and which cores to reserve for each rank can influence the overall performance and scalability of an application. Modern compute architectures offer core virtualization, where a single physical core can operate two “hyperthreads” with rapid context switching. This further complicates the discovery process for determining optimal process affinity for HPC applications.

Some applications perform better by taking advantage of hyperthreading, while others do not. A paper in the early aughts (T. Leng et. al, 2002) provided empirical evidence that indicated that performance gains or degradations due to hyperthreading are dependent on the application.

Applying Hyper-Threading and doubling the processes that simultaneously run on the cluster will increase the utilization rate of the processors’ execution resources. Therefore, the performance can be improved.

On the other hand, overheads might be introduced in the following ways:

Logical processes may compete for access to the caches, and thus could generate more cache-miss situations
More processes running on the same node may create additional memory contention
More processes on each node increase the communication traffic (message passing) between nodes, which can oversubscribe the communication capacity of the shared memory, the I/O bus or the interconnect networking, and thus create performance bottlenecks.

Whether the performance benefits of Hyper-Threading – better resource utilization – can nullify these overhead conditions depends on the application’s characteristics. T. Leng et. al, 2002

At many HPC centers, the option to enable or disable hyperthreading on the hardware is left in the hands of system administrators. In cloud environments, users are able to control the visibility of virtual CPUs (hardware threads/hyperthreads) on the guest operating systems. However, the processor mapping from guest OS to host OS on public cloud platforms is not always one-to-one. When running tightly coupled MPI applications, we can leverage process affinity flags to map MPI processes to either physical or virtual CPUs, providing a proxy for “disabling” or “leveraging” hyperthreads.

Controlling Process Affinity

OpenMPI provides flags for specifying how MPI ranks should be mapped onto the underlying hardware. The --bind-to option option is used to specify which hardware components an MPI task must reside on for the duration of a program’s execution. Available options are hwthread, core, socket, and node. Binding to a hwthread, for example, keeps an MPI process from migrating between vCPUs during its lifetime. In contrast, binding to a core keeps an MPI process bound to a physical core, but allows the process to migrate between the vCPUs (hyperthreads) during execution.

The --map-by option is used to specify the order in which MPI processes are mapped to hardware. Available options are hwthread, core, socket, and node.

In addition to OpenMPI flags, the workload managers and job schedulers, like Slurm, can be used to specify affinity constraints as well. You can dictate how many tasks are required for a job in addition to how many tasks are permitted to run on each compute node in a batch script’s batch header. As an example, the batch header below, specifies that we need to run 60 tasks (equivalent here to MPI processes), with 30 tasks per node and 2 vCPUs per task. This job would require a total of 120 vCPUs. Each MPI rank in this job would have access to 2 vCPUs and only 30 MPI ranks would be allowed on each compute node.

#!/bin/bash

#SBATCH --ntasks=60

#SBATCH --ntasks-per-node=30

#SBATCH --cpus-per-task=2

To give an example of how --map-by and --bind-to options can be used together with the Slurm job scheduler to map MPI processes to specific hardware components, consider the c2-standard-60 instances on Google Cloud Platform. Each node has 30 physical cores, numbered from 0-29 with two hardware threads per core, numbered 0-1, giving a total of 60 vCPU. Additionally, each node has two sockets with 15 cores each. This can be seen by running lscpu on a c2-standard-60 VM.

$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 60

On-line CPU(s) list: 0-59

Thread(s) per core: 2

Core(s) per socket: 15

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 85

Model name: Intel(R) Xeon(R) CPU

Stepping: 7

CPU MHz: 3100.308

BogoMIPS: 6200.61

Hypervisor vendor: KVM

Virtualization type: full

L1d cache: 32K

L1i cache: 32K

L2 cache: 1024K

L3 cache: 25344K

NUMA node0 CPU(s): 0-14,30-44

NUMA node1 CPU(s): 15-29,45-59

Notice that the CPU numbering corresponds to vCPUs. Specifically, vCPUs 0-14 correspond to physical cores 0-14, hardware thread 0; vCPUs 30-44 correspond to physical cores 0-14 hardware thread 1.

Suppose we use the following Slurm batch file

#!/bin/bash

#SBATCH --ntasks=60

#SBATCH --ntasks-per-node=30

#SBATCH --cpus-per-task=2

mpirun -np 60 --bind-to hwthread --map-by core ./hello_world

In this case, each MPI rank has access to 1 vCPU and only 30 MPI ranks are allowed on each compute node. When using c2-standard-60 instances, this job requires 2 virtual machines. The flags --np 60 --bind-to hwthread --map-by core assigns

Rank 0 to physical core 0, hardware thread 0 on node 0 (vCPU 0)
Rank 1 to physical core 1, hardware thread 0 on node 0 (vCPU 1)
Rank 2 to physical core 2, hardware thread 0 on node 0 (vCPU 2)
…
Rank 29 to physical core 29, hardware thread 0 on node 0 (vCPU 29)
Rank 30 to physical core 0, hardware thread 0 on node 1 (vCPU 0)
Rank 31 to physical core 1, hardware thread 0 on node 1 (vCPU 1)
…
Rank 59 to physical core 29, hardware thread 0 on node 1 (vCPU 29)

In this case, each MPI process is bound to hardware thread 0 on each physical core. This scenario is similar to running with hyperthreading disabled.

Alternatively, we can run a job with 1 c2-standard-60 instance by setting and ntasks-per-node=60.

#!/bin/bash

#SBATCH --ntasks=60

#SBATCH --ntasks-per-node=60

#SBATCH --cpus-per-task=1

mpirun -np 60 --bind-to hwthread --map-by core ./hello_world

In this example, the flags --np 60 --bind-to hwthread --map-by core

Rank 0 to physical core 0, hardware thread 0 on node 0 (vCPU 0)
Rank 1 to physical core 0, hardware thread 1 on node 0 (vCPU 30)
Rank 2 to physical core 1, hardware thread 0 on node 0 (vCPU 1)
Rank 2 to physical core 1, hardware thread 1 on node 0 (vCPU 31)
…
Rank 58 to physical core 29, hardware thread 0 on node 0 (vCPU 29)
Rank 59 to physical core 29, hardware thread 1 on node 0 (vCPU 59)

On the outset of this work, it is unknown whether or not WRF benefits from hyperthreading. When discussing benefit, we consider both measurements of time-to-solution and computational expense. The time-to-solution is defined as the amount of wall-time needed to complete a WRF simulation. While the computational expense is proportional to the product of the number of compute nodes multiplied by the wall time.

Methods

For this work, we use Version 4.2.1 of WRF. Unlike version 3 of WRF, there are fewer published results on the performance of WRF v4. In 2018, a student at NCAR’s Summer Internship in Parallel Computational Science (SIParCS) worked on adapting the WRF v3 benchmarks for v4.0.0 to assess scaling and performance on the Cheyenne supercomputer. The results of this study are available online at https://github.com/akirakyle/WRF_benchmarks and the scripts used to generate input decks for the CONUS 2.5km, CONUS 12km, and Hurricane Maria benchmarks can be found at https://github.com/akirakyle/WRF_benchmarks. In our work we’ve adapted these scripts for building WRF and WPS and for creating the input decks for the CONUS 12km and CONUS 2.5km benchmarks.

Building WRF

To build WRF, we use a VM image baking approach that combines Google Cloud Build, Packer, and Spack. All of the build scripts are publicly available at https://github.com/FluidNumerics/rcc-apps/tree/main/wrf . For the starting VM image, we use the RCC-CentOS VM image. We’ve defined a template spack environment file that is used to concretize the target architecture and chosen compiler at build time. This allows the user to specify build substitution variables to define the target architecture and compiler using Google Cloud Build. Once the rcc-apps repository is cloned, the user can build a WRF VM image using the following command from the root directory of the repository :

$ gcloud builds submit . –config=wrf/cloudbuild.yaml –substitutions=_TARGET_ARCH=”cascadelake”,_COMPILER=”intel-oneapi-compilers”

Under the hood, this build process leverages Packer to create a VM that executes a set of scripts that install spack (if it is not present) and then use spack to install WRF and its dependencies for the specified architecture with the specified compiler.

In this study, we use this build system to create a suite of VM images from the cross-product of the target architecture and compiler. For the target architectures, we consider the following

Cascadelake - Google Cloud c2 and n2 instances
Broadwell - Google Cloud n1 instances
Zen3 - Google Cloud n2d instances

Benchmarks

Initial Conditions, Boundary Conditions, and Forcing

Initial and boundary conditions are derived from the NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive, using June 17, 2018 forecast hours 0, 3, 6, and 9. High resolution static geographical data are taken from UCAR’s WPS V4 Geographical Static Data Downloads Page. These fields are processed with WPS’s geogrid.exe, metgrid.exe, and real.exe applications to produce the necessary wrfbdy_d01 and wrfinput_d01 files for WRF.

CONUS 12km

The CONUS 12km benchmark uses a grid that spans the Continental United States with 425 x 300 x 35 ( longitude x latitude x altitude; approximately 3.8 million ) grid cells. We use a timestep of 72 seconds and run for 6 hours, starting from June 17, 2018 00Z. With this setup, we execute 300 timesteps. The model state is saved for hours 0 and 6, providing two time levels of model output.

CONUS 2.5km

The CONUS 12km benchmark uses a grid that spans the Continental United States with 1901 x 1301 x 35 ( longitude x latitude x altitude; approximately 86.5 million ) grid cells. We use a timestep of 15 seconds and run for 6 hours, starting from June 17, 2018 00Z. With this setup, we execute 14,400 timesteps. The model state is saved for hours 0 and 6, providing two time levels of model output.

Timing Measurement

The total runtime in WRF is broken down into three contributions :

Initialization time
File IO time
Compute / Forward Stepping time

The initialization time is the amount of time the code spends in “startup” activities, such as allocating memory and reading in initial conditions and static boundary information. The File IO time specifically refers to the amount of time spent writing simulation history and restart files to disk. The compute or forward stepping is the amount of time spent forward stepping the equations of motion to advance the simulation.

For each simulation, we parse the rsl.out.0000 file that contains timing information for forward stepping WRF, writing output files, and initialization. The total amount of time for each category is accumulated and reported for each simulation. The sum of the time spent in each category (initialization, file IO, and forward-stepping) is an estimate of the total execution time of the simulation.

Compute Platform

To run WRF, we use an image build on top of Fluid Numerics’ RCC-CentOS solution le Cloud Platform. The RCC-CentOS solution is an autoscaling HPC cluster with a Slurm job scheduler and CentOS-7 operating system. When this cluster is deployed, two virtual machines are provisioned : a login node and a controller. The login node is the primary point of access (over ssh) for interacting with the Slurm job scheduler. The controller hosts the Slurm job scheduler and the /home directories, which are mounted over NFS.

The RCC-CentOS VM image comes with modifications that mostly follow the HPC Best Practices on Google Cloud Platform, with the exception that hyperthreading is not disabled by default :

When running benchmarks out of the home directories, the file IO performance of WRF is directly impacted by (1) network limitations between VM’s on GCP and (2) the single disk performance. Our controller instance is set up as an n1-standard-8 VM with a 1 TB PD-Standard disk.

We also provision a Lustre file system on GCP using a hybrid MDS/MGS node and 4 OSS servers. The MDS/MGS node is an n2-standard-16 VM with a 1 TB PD-SSD MDT disk. Each OSS node is an n2-standard-16 VM with 4 x 375GB Local-SSD OST disks. We opted to use Local-SSD disks for the OST’s since the peak write bandwidth is approximately 4,000 MB/s, a factor of 5 greater than persistent disks.

Results

Results can be explored in the following interactive DataStudio Dashboard: https://datastudio.google.com/reporting/bca81aa5-4df6-4a50-8209-33ad664eab7c

Compute Platform Comparison

On Google Cloud Platform, there are a variety of machine types available. These machine types differ in the CPU platform available and in terms of the available vCPU and memory for each virtual machine. Here, we compare the machine types :

c2-standard-60 : 60 vCPU (Intel Cascade Lake) and 240 GB RAM per node
n2-standard-80 : 80 vCPU (Intel Cascade Lake) and 320 GB RAM per node
n2d-standard-224 : 224 vCPU (AMD EPYC Rome) and 896 GB RAM per node
n1-standard-96 : 96 vCPU (Intel Broadwell) and 360 GB RAM per node

On each machine type, we run the CONUS 2.5km benchmark with a two hour forecast using 480 MPI ranks. To obtain optimal performance, with respect to the compute / forward-stepping time, we create VM images with WRF pre-installed for each machine type using compiler options as described in Section 2.1. At 480 MPI ranks, binding MPI ranks to hardware threads, for the CONUS2.5km benchmark, provides optimal performance. For 480 ranks, this requires :

8 x c2-standard-60 VMs
6 x n2-standard-80 VMs
3 x n2d-standard-224 VMs
5 x n1-standard-96 VMs

At 480 MPI ranks, with 1 rank per vCPU, each of the machine types can be fully subscribed, except for the n2d-standard-224. For the n2d instances, the best results were observed when evenly distributing 160 MPI ranks across each of the three VMs.

Figure 3.1.1 shows the total amount of time spent in compute / forward stepping for each machine type. For the machine types compared, the c2-standard-60 instances result in the shortest runtime for the 2 hour forecast.

Figure 3.1.1 : The total amount of time spent in forward stepping the WRF CONUS 2.5km benchmark with a 2 hour forecast is shown as a function of the machine type. For each machine type, the best (lowest) compute time observed is shown.

Affinity Testing

From Section 3.1, we found that the c2-standard-60 VM instances yield the best performance, relative to other GCE instance types. In this section, we examine the impacts of the MPI process affinity on the c2-standard-60 instances for both the CONUS 12km and CONUS 2.5km benchmarks. Specifically, we are interested in understanding how the binding of MPI ranks to vCPU’s (hardware threads) or physical cores impacts the computational runtime and the cost.

CONUS 12km

The CONUS 12km benchmark uses a mesh with 425 x 300 x 35 grid points. For this benchmark, we use 480 MPI ranks; WRF assigns most ranks a 22 x 13 x 35 block (some have 21 x 12 x 35 ). With this setup, we are interested in uncovering affinity configurations that results in the lowest compute / forward stepping time and the lowest cost. It should be noted that a single configuration does not necessarily provide the lowest wall-time and the lowest cost.

First, we look at three configurations:

480 MPI ranks are bound to hardware threads, mapped by cores, and provided 2 vCPUs per task with 30 tasks permitted per node.
480 MPI ranks are bound to hardware threads, mapped by cores, and provided 1 vCPU per task with 30 tasks permitted per node.
480 MPI ranks are bound to hardware threads, mapped by cores, and provided 1 vCPU per task with 60 tasks permitted per node.

The SBATCH and MPI flags with the compute forward-stepping time and node-seconds for these configurations are summarized in Table 3.2.1.

Table 3.2.1 : The results of three CONUS 12km runs where we compare the effects of task affinity. The configuration with the lowest compute time is highlighted in blue and the configuration with the lowest cost is highlighted in green.

The experiments shown in Table 3.2.1 show that we can achieve the lowest (optimal) runtime by providing each MPI rank 2 vCPUs, but binding each MPI rank to a hardware thread on their own physical cores. However, by leveraging hyperthreading we are able to obtain a lower compute cost, by a factor of 1.8. This reduction in compute cost comes at the price of a simulation that is about 11.6% slower.

To fully utilize all of the hyperthreads of 16 c2-standard-60 vCPU, we need to expose more parallelism in WRF. The CONUS 12km benchmark has too few grid points to be able to run with double (960) the number of MPI ranks. However, WRF is able to operate in a hybrid MPI+OpenMP configuration. With this setup, each MPI rank can run with a specified number of OpenMP threads.

To run with MPI+OpenMP, WRF is built using the dmpar+smpar option which effectively adds the -fopenmp flag during compilation. At runtime, we bind MPI processes to physical cores and give each rank two OpenMP threads that are in turn bound to hyperthreads. This is specified with the following MPI flags and batch header.

#!/bin/bash

#SBATCH --ntasks=480

#SBATCH --ntasks-per-node=30

#SBATCH --cpus-per-task=2

mpirun -np ${SLURM_NTASKS} --bind-to core \

--map-by core \

-x OMP_NUM_THREADS=2 \

-x OMP_PROC_BIND=true \

-x OMP_PLACES=threads \

./wrf.exe

With this MPI+OpenMP configuration, the compute time is reduced by about 5.4% ( 36.96 s ) in comparison to the MPI only configuration on 16 nodes.

CONUS 2.5km

The CONUS 2.5km benchmark uses a mesh with 1901 x 1301 x 35 grid points. The higher resolution requires a time step that is at least a factor of six smaller than CONUS 12km. For a six hour forecast, this translates to roughly 200x more work than CONUS 12km. With the increased grid size, there is more opportunity for exposing parallelism with MPI. For the runs presented here, we manually changed the forecast duration to 2 hours (479 timesteps).

First, we look at three configurations for the process affinity :

480 MPI ranks are bound to hardware threads, mapped by cores, and provided 2 vCPUs per task with 30 tasks permitted per node.
480 MPI ranks are bound to hardware threads, mapped by cores, and provided 1 vCPU per task with 60 tasks permitted per node.
960 MPI ranks are bound to hardware threads, mapped by cores, and provided 1 vCPU per task with 60 tasks permitted per node.

Table 3.2.2 : The results of three CONUS 2.5km runs where we compare the effects of task affinity. The configuration with the lowest compute time is highlighted in blue and the configuration with the lowest cost is highlighted in green.

For the CONUS 2.5km benchmark, we find that binding MPI ranks to hardware threads and allowing 60 MPI ranks per node results in the lowest cost simulation. However, this configuration yields the largest wall-time. When providing 2 vCPUs per task and allowing 30 tasks per node, the compute time can be reduced (improved) by about 35.7%. However, the simulation cost has increased by 28.5% due to the fact that this configuration requires twice as many compute nodes.

By leveraging the hyperthreads on the 16 node configuration and doubling the number of MPI ranks, we are able to reduce the wall time and the simulation cost. This is consistent with the notion that, for the CONUS 2.5km simulation, WRF benefits from leveraging hyperthreading, both in run-time and simulation cost.

The results of this section can be neatly summarized in Figure 3.2.1, which shows the estimated simulation cost (left) and measured total wall-time (right) for the CONUS 2.5km benchmark as a function of the number of MPI ranks. For a given number of MPI ranks, binding MPI ranks to physical cores (2 vCPU per rank) provides improved performance. However, since binding to cores does not provide a 2x improvement in the simulation runtime, in comparison to binding to vCPU, this increases the overall simulation cost.

Figure 3.2.1 : Summary figure showing the estimated simulation cost (left) and measured total wall-time (right) for the CONUS 2.5km benchmark as a function of the number of MPI ranks when binding MPI ranks to cores (blue) and vCPU (green).

On a per-MPI-rank basis, binding to cores provides better performance; this should be expected. However, on a per-node basis, binding to vCPU provides better performance and cost. Consider the comparison between the 480 MPI rank simulation with ranks bound to cores and the 960 MPI rank simulation with ranks bound to vCPU. Each of these runs uses the same amount of resources on Google Cloud. However, the 960 rank simulation with ranks bound to vCPU’s provides a shorter runtime. Since the same number of nodes is allocated from Google Cloud, this configuration also provides a better cost.

Choosing the optimal compiler

In a separate round of CONUS 2.5km simulations, we compare the impact of the compiler on the simulation runtime. Choosing the right compiler can lead to reductions in simulation time and cloud costs. This argument, of course, is only true if you’re not having to pay for commercially licensed compilers. Because of this, we only focus on the open-source GNU compilers and the free-to-use closed source Intel OneAPI Compilers.

Here, we consider the 480 MPI rank simulation, with the MPI ranks bound to each vCPU across 8x c2-standard-60 instances. Figure 3.3.1 shows the relative simulation runtime for CONUS 2.5km at 480 MPI ranks comparing gcc@10.3.0, gcc@11.2.0, and the Intel OneAPI Compilers (v2021.2.0). By using the Intel OneAPI Compilers, we observe a 32% reduction in wall-time.

Figure 3.3.1 : The relative simulation runtime for CONUS 2.5km at 480 MPI ranks comparing gcc@10.3.0, gcc@11.2.0, and the Intel OneAPI Compilers (v2021.2.0). The relative runtimes are references to the slowest simulation (gcc@10.3.0).

Optimizing File IO

In this section, we examine the impacts of the io_form_history and io_form_restart namelist parameters available in WRF in addition to the impacts of file system choice. Although WRF can be built with Parallel NetCDF support, this does not mean that file IO is distributed across MPI ranks.

The manner in which files are written are controlled by the io_form_* namelist parameters. We are specifically interested in how WRF reads/writes restart files and history files. Restart files are used to pick up a WRF simulation from a point where a previous simulation left off. History files are the main product of running a simulation and contain 3-D gridded data of the atmospheric state. The io_form_restart and io_form_history parameters control how WRF handles file IO.

Table 3.4.1 below is adapted from WRF’s documentation on the description of namelist variables, specifically focusing on the definition of the io_form_restart and io_form_history parameters.

Table 3.4.1 : Adapted from WRF’s documentation on the description of namelist variables to show the available IO options for history and restart files with WRF.

For all options, except for 102, WRF will aggregate all of data from all MPI ranks to the “root rank” (rank 0) and write to a single file. With these options, file IO can become a major bottleneck that becomes more noticeable as the number of MPI ranks increases. There are two causes for this.

Blocking MPI collective communications must occur before each history file or restart file write where all MPI ranks send data to the root rank.
No parallelism is exposed in file IO; restart and history files are written to disk by one rank.

For the io_form option 102, each rank will independently write its own subdomain to a NetCDF file. This option permits “embarrassingly parallel” file IO. In this case, the performance of file IO is limited by the speed of NetCDF and the specifics of the host filesystem. This approach does have the drawback that the NetCDF files often need to be stitched together to be compatible with the post-processing tools available in the WRF ecosystem. The JOINER script from UCAR is the recommended tool to stitch files together after a WRF simulation and should become a part of any post-processing pipeline.

In this study, we compare file IO performance of the io_form 11 and io_form 102 options. For filesystem options, we look at two options

NFS file system hosted on the Slurm Controller
Lustre file system

The deployment details of the NFS filesystem on the controller and the Lustre file system are provided in Section 2.3.

For the CONUS 12km benchmark, we found in Section 3.2 that the maximum number of MPI ranks we could use was 480. However, we showed that those 480 ranks could be distributed across 8 c2-standard-60 instances, using 1 rank per hardware thread, or could be distributed across 16 c2-standard-60 instances, using 1 rank per core with 2 OpenMP threads per rank.

Figure 3.4.1 : The total time spent in File IO for the CONUS 12km benchmarks compared between simulations using 480 ranks on 8 compute nodes and 16 compute nodes. Both runs use the NFS filesystem and io_form=11.

Figure 3.4.1 shows that spreading the MPI ranks across 16 compute nodes, rather than 8, improves (reduces) the time spent in file IO.

Figure 3.4.2 shows that using io_form=102 significantly reduces the time spent in file IO from 5083.8s to 15.9s ( a factor of about 319 ). With 480 MPI ranks, this corresponds to a scaling efficiency of about 66%. By changing the filesystem to Lustre, we are able to further reduce the time spent in file IO to 0.997s ( another factor of 15.9 ).

We see similar behavior in the CONUS 2.5km benchmark, where we have 960 MPI ranks across 16 nodes. When using io_form=11 on the NFS filesystem, we spend 8004.3s in file IO for the CONUS 2.5km benchmark. Transitioning to io_form=102 provides improved file IO performance and using a Lustre file system provides additional gains.

Figure 3.4.2: The speedup of the File IO, relative to the best io_form=11 CONUS 12km simulation, is shown for io_form=102 on the NFS filesystem and on the Lustre filesystem. By using io_form=102, we observe a 319x improvement in time spent in file IO. By transitioning to the Lustre file system, we observe an additional 15x improvement in file IO time.

Figure 3.4.3: The speedup of the File IO, relative to the io_form=11 CONUS 2.5km (960 rank) simulation, is shown for io_form=102 on the NFS filesystem and on the Lustre filesystem. By using io_form=102, we observe a 192x improvement in time spent in file IO. By transitioning to the Lustre file system, we observe an additional 68x improvement in file IO time.

With the 960 rank CONUS 2.5km benchmark, the scaling efficiency is only 20%. This observation is consistent with the notion that, at 960 ranks, we have exceeded the capacity of the NFS filesystem to keep up with parallel file IO demands. By transitioning to Lustre, however, we are able to gain an additional 68x speedup, bringing the file IO time down to 0.6s.

Scaling CONUS 2.5km

With an ideal task affinity and file-system determined, we use these configurations to measure how the CONUS 2.5km benchmark scales as we increase the number of MPI ranks. For these simulations we look at the compute time, initialization time, file IO time, and the total wall time. We start with a simulation with 480 MPI ranks and successively double the number of MPI ranks out to 3840.

Figure 3.5.1 : The total wall-time (blue), compute/forward stepping time (red), initialization time (yellow), and file IO time (green) are shown for the CONUS 2.5km benchmarks as a function of the number of MPI ranks.

Figure 3.5.1 shows the wall times for the CONUS 2.5km simulation for 480 to 3840 MPI ranks on the c2-standard-60 instances. This figure shows that the majority of the total wall-time can be accounted for by the time spent in forward stepping the model. Generally, the wall-time decreases as the number of MPI ranks increases. However, the scaling is not linear, indicative of scaling inefficiencies that grow as the number of MPI tasks increases.

In addition to the wall-times, we calculate a scaling efficiency for each simulation relative to the 480 MPI rank benchmark, using the formula,

Where T480 is the wall-time for the 480 rank benchmark and Nis the number of MPI ranks for a given simulation. A scaling efficiency of one represents ideal scaling and no contribution of MPI overhead to the overall runtime. Lower scaling efficiencies indicate an increasing impact of MPI overhead on the runtime.

Figure 3.5.2 : The scaling efficiency for the CONUS 2.km benchmarks, relative to the 480 rank run, are shown for the compute/forward stepping time.

Figure 3.5.2 shows the scaling efficiency of the CONUS 2.5km benchmarks, relative to the 480 rank run, as a function of the number of MPI tasks. As the number of MPI tasks increases the scaling efficiency decreases, reaching about 43 % for the total wall-time.

Summary

In this article, we shared our progress in optimizing the performance and cost for WRF v4 on Google Cloud. We outlined an image building process that allows us to easily create suites of images for benchmarking to experiment with combinations of compilers, compiler flags, target architectures, and machine types on Google Cloud. In this study, we have concluded that using the Intel OneAPI compilers, the cascadelake target architecture build flag, and c2 instances with MPI ranks bound to vCPU provides the optimal balance of simulation cost and performance.

Looking ahead, we are interested in extending this study to the new c2d (AMD Epyc Milan) instances on Google Cloud and the AOCC compilers from AMD. If you are interested in experimenting with WRF on Google Cloud, you can easily get started with Fluid Numerics’ RCC-WRF click-to-deploy solution on Google Cloud Marketplace or with our Colaboratory walkthrough.