Optimizing Open Source Field Operation and Manipulation(OpenFOAM) on Google Cloud
Overview
The goal of this article is to show weak scaling and performance sensitivity to hardware choices on Google Cloud for interFoam, a multiphase solver within Open Source Field Operation and Manipulation(OpenFOAM). To illustrate these points, we work with a 2M cell dam break simulation as the chosen benchmark. Key takeaways from this work include :
For interFoam, we find good scaling efficiency with >25K cells / MPI rank on both c2 and c2d instances on Google Cloud. When the number of cells per MPI rank is less than 20K, we observe poor scaling efficiency and eventually an increase in simulation wall-time as this ratio is reduced.
Scaling efficiency losses are attributed to MPI_AllReduce calls, which account for ~25% of simulations wall-time at 10K cells/rank, and cannot necessarily be attributed to just cross-VM communication.
The c2d instance type on Google Cloud provides the lowest runtime and cost in the 20K cells/MPI rank regime.
In addition to these results, we share our methods for creating VM images for running OpenFOAM on Google Cloud and our process for benchmarking and profiling using open-source technologies.
Further, for users wishing to quickly get started with OpenFOAM on Google Cloud, without having to create and manage their own VM images, we have published updates to the RCC-CFD (formerly CFD-GCP) Marketplace solution on Google Cloud based on this work. Documentation on using RCC-CFD, including how to leverage the target architecture optimized images can be found at the RCC ReadTheDocs.
Methods
Benchmark Description
For this work, we focus on the interFoam application, included with OpenFOAM, and use a higher resolution version of the Dam Break simulation; the input files are made publicly available on the rcc-apps github repository. The resolution is increased, relative to the Dam Break test case included with OpenFOAM, by modifying the blockMeshDict, directly adding additional cells in each region of the model domain. The resulting mesh has about 2.8 Million grid cells.
The dam break simulation uses the interFoam application, which solves the incompressible Navier-Stokes equations for a two-phase fluid using a Volume-of-Fluid (VOF) method. Relative to single phase Navier-Stokes, the two-phase fluid model also requires an advection model for the volume fraction of one of the fluid phases. The fluid density in this approach is calculated as a weighted average of the phase densities where the averaging weight for each grid cell is given by the volume fraction.
For gradient, divergence, and laplacian calculations linear differencing methods are used, with linear upwinding applied for tracer advective flux divergences. For the volume fraction advection equation, we use the interface compression scheme with van-Leer’s flux limiter. The equation for the pressure is solved using the preconditioned conjugate gradient method with the Diagonal-based Incomplete Cholesky (DIC) preconditioner. To forward step the model, we use first order Forward Euler with a maximum courant number of 1.
The initial conditions consist of a block of a liquid fluid phase occupying the lower left region of the domain. Buoyancy forces accelerate this mass of fluid down and no normal flow leads to a bore that encounters a step in the model domain and develops shear flow instabilities at the fluid interface which promotes mixing.
Building with Packer, Spack, & Google Cloud Build
RCC-Apps is an open-source Git repository that maintains Google Cloud VM image build scripts for commonly used research applications. In general, the RCC-Apps build workflow consists of the following steps:
Provision an imaging node
Install Spack dependencies and Spack
Install the desired compiler
Install a Spack environment
Deprovision imaging node, create VM image from the image node disk, and delete the image node disk.
To achieve this workflow, RCC-Apps uses Hashicorp’s Packer with the googlecompute builder. Packer manages the image node provisioning, deprovisioning, imaging, and error handling during the setup process. RCC-Apps provides a common script that is used to install Spack’s dependencies, Spack, and the desired compiler. Each application defines a Spack environment that can be used during the build process to install the desired application.
To install OpenFOAM, we use the RCC-Apps infrastructure, which provides steps to install Spack’s dependencies, Spack, and the requested compiler as part of the build system. This allows the user to simply provide an installation script or a Spack environment file that can be used to install their desired application. Using Spack also has the advantage that we can simply specify the target architecture to ensure we are using the appropriate compiler flags to enable specific architecture optimizations.
This setup allows for a methodical approach to generating VM images using a matrix of compilers, target architecture, and optimization flags. Additionally, by using a Spack environment file alongside Terraform infrastructure as code, we are able to completely version control the benchmarking process.
Target Architectures
Google Cloud provides access to a variety of compute platforms. The machine types are named using a convention that indicates the machine family and series, vCPU-to-memory ratio, and the number of cores. Since our goal in this work is to understand which combination of compiler and architecture provides optimal performance for OpenFOAM on Google Cloud, we need to take advantage of compiler features that can produce OpenFOAM binaries that are targeted for each specific platform.
When building OpenFOAM from source code, you have the ability to request the compiler attempt to leverage various optimizations. Typically, a suite of optimizations are grouped into the -O options. For example, -O0 instructs the compiler to not attempt any special optimizations, while -O3 is usually the most aggressive. Depending on which compiler you are using (e.g. Intel, GCC, etc.) the various optimization levels implement different strategies. In addition to optimization levels, compilers can also be instructed to generate tuned instructions for specific cpu hardware models and generations.
Spack uses a library called archspec to resolve target architectures and defines the appropriate compiler flags for many commonly used compilers to generate tuned instructions. Although in this work, we did not explore the impacts of other compiler flag settings, Spack also provides the ability to set compiler flags and override the decisions made by a package’s build system. Spack effectively reduces the problem of creating targeted builds with various compilers to simply making statements about which compiler we want to use and what architecture we want to target. The table below shows some of the machine types available on Google Cloud alongside the target architecture name in Spack.
When building OpenFOAM binaries with target architecture specifications, we lose the guarantee of being able to run that build of OpenFOAM on all possible platforms. For example, a cascadelake targeted build will not necessarily run on a c2d (AMD Epyc Milan) instance on Google Cloud. If portability is favored, we can instruct Spack to target the generic x86_64 platform.
With these ideas in mind, during our build process we create the following images :
Compiler = GCC 10.2.0, Target Architecture = x86_64
Compiler = GCC 10.2.0, Target Architecture = cascadelake
Compiler = GCC 10.2.0, Target Architecture = zen3
Compiler = Intel OneAPI Compilers 2022.1.0, Target Architecture = x86_64
Compiler = Intel OneAPI Compilers 2022.1.0, Target Architecture = cascadelake
Profiling with HPC Toolkit
On Google Cloud, we want to determine how well OpenFOAM scales for fixed size problems. Ideally, adding more vCPU to a specific problem and simultaneously increasing the number of MPI ranks will result in speedup that is proportional to the number of vCPU. However, MPI communications are expected to become increasingly expensive as the number of MPI ranks is increased. Additionally, public cloud platforms like Google Cloud leverage “ethernet-like” networking that typically has higher communication latency than on-premises HPC networking (e.g. Infiniband). This latter fact is a source of concern for researchers attempting to tackle problems that require high resolution computational fluid dynamics simulations where clusters of tightly coupled compute nodes work together to simulate the evolution of a fluid.
While bulk runtimes of a benchmark can be measured as a function of the number of MPI ranks to provide an estimate of scaling efficiency, differential profiling can indicate what parts of an application contribute to the loss in scaling efficiency. With differential profiling, flat or hotspot profiles are collected for simulations run with differing degrees of parallelism. Then, the profiles between the two runs are compared to determine what parts of an application contribute to losses in scaling efficiency.
To perform differential profiling of openfoam, we use the DOE Exascale Project’s HPC Toolkit and hatchet, a python library for working with profile databases created by HPC Toolkit. The HPC Toolkit is installed on top of our openfoam images using a Spack environment. To collect hotspot profiles, OpenFOAM is run under the hpcrun binary, which creates an HPC Toolkit measurements directory. The program structure is then recovered using hpcstruct and the profile database, with relevant source code alignment, is created using hpcprof-mpi. The code snippet below provides an example of how the hotspot profile is collected for the interFoam application.
HPCVIEW=${HOME}/view/bin
HPCRUN=$HPCVIEW/hpcrun
HPCSTRUCT=$HPCVIEW/hpcstruct
HPCPROF=$HPCVIEW/hpcprof
# Run interFoam under hpcrun
mpirun -np ${SLURM_NTASKS} $HPCRUN interFoam -parallel
# Recover the program structure
$HPCSTRUCT hpctoolkit-interFoam-measurements-$SLURM_JOBID
# Create the profile database and align measurements with source code
mpirun -np 1 $HPCPROF --metric-db=yes -I $(spack location -i openfoam-org)/src hpctoolkit-interFoam-measurements-$SLURM_JOBID
To keep the profile databases manageable in size, the simulation time is reduced from 1 s, which we use for the bulk wall-time measurements, to 0.1s.
To perform differential analysis, HPC toolkit databases are created for two runs with varying number of MPI ranks. We create flat profiles for each run by reporting the maximum inclusive wall-time across all MPI ranks. We chose to use the inclusive wall-time so that we can easily compare the time associated with each MPI call (and its children) with the main program wall-time. We chose to report the maximum, as opposed to the sum, mean, or minimum, since the total simulation wall-time is dictated by the worst performing MPI rank. The code snippet below shows how we use hatchet in Python to create the flat profile. From this point, the flat profile is exported to a CSV for visualization in DataStudio.
# Read in HPCToolkit database.
gf = ht.GraphFrame.from_hpctoolkit('/path/to/hpctoolkit-interFoam-db')
# Drop all index levels in the DataFrame except ``node``.
gf.drop_index_levels()
# Group DataFrame by ``name`` column, compute max of all rows in each
# group.
grouped = gf.dataframe.groupby('name').max()
Results
When running the benchmarks on Google Cloud, we are looking to find the ideal hardware and software configuration for running OpenFOAM. An ideal configuration, in our view, is one that provides a balance between shortest runtime and lowest cost. When time-to-solution is not a factor, obtaining the minimal cost is an obvious goal for anyone running on public cloud systems. For some applications, the lowest time to solution and the lowest cost configurations are not equivalent to the shortest runtime.
This section covers the behavior of the Dam Break (2.8M) benchmark as we vary the machine type, target architecture, compiler, number of MPI ranks and task affinity. For the machine types, we compare two of the compute optimized instance types available on Google Cloud, the c2 (Intel Cascadelake) and c2d (AMD Epyc Milan). When running on these instances, we use the largest machine type available for each type and use compact placement policies. Wider VMs allow more MPI communications to occur on VM rather than across VM when compared to instances with fewer vCPU per node. Additionally, compact placement helps minimize communication latency between VMs by ensuring VMs are hosted physically close together in Google Cloud datacenters. On the c2 instances, we use the cascadelake targeted builds with Intel and GCC compilers; on the c2d’s, we use the zen3 targeted builds with the GCC compilers.
For tightly coupled CFD applications, like OpenFOAM, simulations can often be sped up through domain decomposition and parallel execution with MPI. In domain decomposition, the model’s physical domain is divided as equally as possible and distributed for execution to multiple MPI processes (or “ranks”). At the boundaries of each rank’s domain, information must be shared regularly between neighboring ranks to ensure correctness of the execution. Additionally, the algorithm used to calculate the pressure field in interFoam uses an implicit solver that also requires all-to-all communication (AllReduce) to verify convergence of the preconditioned conjugate gradient solver.
This additional MPI communication overhead typically grows with the number of MPI ranks and leads to inefficiencies in scaling. Figure 1 shows the runtime for the Dam Break (2.8M) benchmark as a function of the number of MPI ranks and various build and runtime configurations. In all scenarios, we see that the runtime decreases as the number of MPI ranks increases, out to about 120 MPI ranks. Beyond this point, we see that the runtime increases, presumably because the MPI overhead is becoming more prevalent.
Since MPI communication is required when running in parallel, obtaining the best possible runtime for a fixed number of MPI ranks depends (in part) on the placement of the MPI processes; ideally, communications off-VM are minimized. On the other hand, providing memory bound applications, like OpenFOAM, providing each MPI rank more computational real estate can provide more memory bandwidth per rank, potentially improving model runtime. In our study, we looked at binding MPI ranks to hardware threads and to cores. Rank binding prevents the operating system from evicting each MPI rank from the resource it is bound to. Binding a rank to a core provides each MPI rank residency on a physical core and less contention for memory resources. On the other hand, binding a rank to a hardware thread (a vCPU on Google Cloud) allows us to pack more MPI ranks per node and reduce cross-VM communications.
Figure 1 : The runtime for the damBreak (2.8M) benchmark as a function of the number of MPI ranks, build configuration, and task affinity.
In Figure 1, each line corresponds to a set of benchmarks that use the same compiler, target architecture, and task affinity. For the c2-standard-60 instances, independent of compiler, binding to cores clearly provides better performance, but both affinity configurations observe the same slow-down beyond 120 MPI ranks. On c2d-standard-112 instances, binding MPI ranks to cores provides better performance for single VM configurations (56 ranks). However, cross-VM communications in the core-bound c2d run at 112 ranks lead to increases in the model run time. Similar to the c2 scaling, vCPU-bound c2d runtime increases beyond 112 MPI ranks.
Figure 2 : The cost for the damBreak (2.8M) benchmark as a function of the number of MPI ranks, build configuration, and task affinity.
The cost of a simulation can be estimated as the product of the runtime, the cost per compute node, and the number of nodes. Each machine type on Google Cloud has a different per vCPU and per GB memory cost (see Google Pricing). For the simulations shown in Figure 1, we estimate the compute cost using the publicly available GCP price book for the US during April 2022 and Figure 2 shows these results.
For the c2 instances, at 30 ranks (single VM) it is more cost effective to bind MPI ranks to cores. However, when binding to hardware threads at 60 ranks (single VM) provides a better price point than using 30 ranks on a single c2 instance. However, since the core-bound c2 simulations enjoy better scaling efficiency (near ideal) between 60 and 120 ranks, binding to cores on c2’s provides better cost and performance than binding to hardware threads.
The c2d instances provide a better overall cost, mostly due to the significant improvement in runtime over the c2 instances. On c2ds, binding to hardware threads provides the best runtime and cost for single VM instance runs.
Figure 3 : Flat profile depicting main program max inclusive time, across MPI ranks, and the max inclusive time for each MPI call on the c2d-standard-112 instances with compact placement using 224 MPI ranks (2 VM instances; left column) and 112 MPI ranks (1 VM instance; right column).
At 120 core-bound MPI ranks on c2’s there are 4 VMs participating in model execution; for 112 ranks on c2ds, there is only 1 VM participating. The decrease in model runtime up to ~120 ranks and increase beyond this point is consistent with the idea that the scaling efficiency of OpenFOAM is more related to the algorithms in OpenFOAM, than the choice in the number of virtual machines participating in communications. The lone anomaly in this case is the core-bound c2d run at 112 ranks, which requires further investigation.
Since MPI communications are the likely culprit of inefficiencies in scaling, we compared the max (across ranks) total inclusive time of each MPI call in interFoam to the max (across ranks) total runtime for 112 and 224 ranks on c2d’s. Figure 3 shows the main runtime (for the shortened Dam Break (2.8M) at 5.78 seconds compared to less than 0.5s spent cumulatively in MPI calls. At 112 ranks, MPI calls account for about 8% of the simulation runtime. However, at 224 ranks, MPI calls account for about 30% of the simulation runtime. In both cases, AllReduce calls are the dominating MPI communication cost.
Summary
In this article, we looked at the influence of hardware, compiler, and task affinity on simulation runtime and cost for interFoam, a multiphase solver within OpenFOAM, using a high resolution version of the Dam Break benchmark.
We found that
For interFoam, we find good scaling efficiency with >25K cells / MPI rank on both c2 and c2d instances on Google Cloud. When the number of cells per MPI rank is less than 20K, we observe poor scaling efficiency and eventually an increase in simulation wall-time as this ratio is reduced.
Scaling efficiency losses are attributed to MPI_AllReduce calls, which account for ~25% of simulations wall-time at 10K cells/rank, and cannot necessarily be attributed to just cross-VM communication.
The c2d instance type on Google Cloud provides the lowest runtime and cost in the 20K cells/MPI rank regime.
Learn more
As part of this work, Fluid Numerics has updated the RCC-CFD solution on the Google Cloud Marketplace to help you quickly get started with proof of concept work using OpenFOAM. We have also provided a codelab ( Run the NACA0012 Aerofoil OpenFOAM® Benchmark on Google Cloud ) to help you get started with CFD on Google Cloud.
Additionally, RCC-CFD includes a set of publicly available VM images that have OpenFOAM built using the Intel OneAPI compilers and GCC 10.3.0 with target architecture flags enabled for zen3 (c2d) and cascadelake (c2) architectures. You can learn more about how to leverage these images to prepare your CFD solution on Google Cloud at Fluid Numerics’ RCC ReadTheDocs.
Fluid Numerics specializes in deploying research computing and high performance computing applications on Google Cloud. We have backgrounds in applied mathematics, computational fluid dynamics, numerical analysis, and high performance & research computing. Further, we are a Google Workspace and Google Cloud reseller that can offer Google Cloud and Google Workspace support alongside research computing and engineering support services.
Reach out to support@fluidnumerics.com to learn how we can help you in your research computing journey on Google Cloud.