Optimizing Open Source Field Operation and Manipulation(OpenFOAM) on Google Cloud
Further, for users wishing to quickly get started with OpenFOAM on Google Cloud, without having to create and manage their own VM images, we have published updates to the RCC-CFD (formerly CFD-GCP) Marketplace solution on Google Cloud based on this work. Documentation on using RCC-CFD, including how to leverage the target architecture optimized images can be found at the RCC ReadTheDocs.
Building with Packer, Spack, & Google Cloud Build
Spack uses a library called archspec to resolve target architectures and defines the appropriate compiler flags for many commonly used compilers to generate tuned instructions. Although in this work, we did not explore the impacts of other compiler flag settings, Spack also provides the ability to set compiler flags and override the decisions made by a package’s build system. Spack effectively reduces the problem of creating targeted builds with various compilers to simply making statements about which compiler we want to use and what architecture we want to target. The table below shows some of the machine types available on Google Cloud alongside the target architecture name in Spack.
When building OpenFOAM binaries with target architecture specifications, we lose the guarantee of being able to run that build of OpenFOAM on all possible platforms. For example, a cascadelake targeted build will not necessarily run on a c2d (AMD Epyc Milan) instance on Google Cloud. If portability is favored, we can instruct Spack to target the generic x86_64 platform.
Profiling with HPC Toolkit
To perform differential profiling of openfoam, we use the DOE Exascale Project’s HPC Toolkit and hatchet, a python library for working with profile databases created by HPC Toolkit. The HPC Toolkit is installed on top of our openfoam images using a Spack environment. To collect hotspot profiles, OpenFOAM is run under the hpcrun binary, which creates an HPC Toolkit measurements directory. The program structure is then recovered using hpcstruct and the profile database, with relevant source code alignment, is created using hpcprof-mpi. The code snippet below provides an example of how the hotspot profile is collected for the interFoam application.
To keep the profile databases manageable in size, the simulation time is reduced from 1 s, which we use for the bulk wall-time measurements, to 0.1s.
Figure 1 : The runtime for the damBreak (2.8M) benchmark as a function of the number of MPI ranks, build configuration, and task affinity.
In Figure 1, each line corresponds to a set of benchmarks that use the same compiler, target architecture, and task affinity. For the c2-standard-60 instances, independent of compiler, binding to cores clearly provides better performance, but both affinity configurations observe the same slow-down beyond 120 MPI ranks. On c2d-standard-112 instances, binding MPI ranks to cores provides better performance for single VM configurations (56 ranks). However, cross-VM communications in the core-bound c2d run at 112 ranks lead to increases in the model run time. Similar to the c2 scaling, vCPU-bound c2d runtime increases beyond 112 MPI ranks.
Figure 2 : The cost for the damBreak (2.8M) benchmark as a function of the number of MPI ranks, build configuration, and task affinity.
The cost of a simulation can be estimated as the product of the runtime, the cost per compute node, and the number of nodes. Each machine type on Google Cloud has a different per vCPU and per GB memory cost (see Google Pricing). For the simulations shown in Figure 1, we estimate the compute cost using the publicly available GCP price book for the US during April 2022 and Figure 2 shows these results.
Figure 3 : Flat profile depicting main program max inclusive time, across MPI ranks, and the max inclusive time for each MPI call on the c2d-standard-112 instances with compact placement using 224 MPI ranks (2 VM instances; left column) and 112 MPI ranks (1 VM instance; right column).
At 120 core-bound MPI ranks on c2’s there are 4 VMs participating in model execution; for 112 ranks on c2ds, there is only 1 VM participating. The decrease in model runtime up to ~120 ranks and increase beyond this point is consistent with the idea that the scaling efficiency of OpenFOAM is more related to the algorithms in OpenFOAM, than the choice in the number of virtual machines participating in communications. The lone anomaly in this case is the core-bound c2d run at 112 ranks, which requires further investigation.