MITgcm Benchmarks on Google Cloud Platform

Background

The MIT General Circulation Model (MITgcm) is a computational fluid dynamics tool from (you guessed it!) MIT for modeling oceanographic and atmospheric fluid flows. The MITgcm employs a finite volume discretization with an Arakawa C-Grid staggering and z-level coordinate system to solve the Hydrostatic Primitive Equations (HPEs). You can learn more about the MITgcm from the open-source Github repository and the MITgcm documentation.

The HPEs are the workhorse equation set for ocean, atmosphere, and climate modeling on planet earth. In our case, we're using the MITgcm to develop a better understanding of the dynamics that govern the Gulf Stream separation as part of an NSF-funded project with Dr. William K. Dewar and Dr. Nicolas Wienders at Florida State University. In this project, we are carrying out a series of simulations that successively resolve smaller length scales of motion and have smaller domains.

These simulations start at a resolution of 3km and cover the Florida Straits to New England and the east coast United States to the New England Seamounts. The next simulation is planned to have a resolution of 1km and "zooms in" on the Gulf Stream north of the Bahamas. The following two experiments focus in on a region just north of the Charleston Bump and south of Cape Hatteras, NC with target resolutions of 300m and 90m. As we increase resolution, and decrease the domain size, the current plan is to maintain (roughly) the same number of grid cells. In the first domain, we have 550x600x68 (~22 million) grid cells, and in each cell we solve for the fluid velocity, temperature, salinity, and pressure.

Benchmark Setup

When starting out on this project, we needed to develop an estimate for the total number of CPU-hours required to carry out these simulations. We used this information to submit compute allocation requests where-ever we could find CPU hours. To do this, I decided to run short benchmark simulations on the 3km resolution domain on a number of different machines on Google Cloud Platform and an on-premise system available to our team ("The Comedians").

On-premise resources

The on-premise system is a small sized cluster of compute nodes called "The Comedians"; the nodes are each named "Moe", "Curly", "Larry", and "Bennyhill". Each Comedian consists of a dual socket Intel Sandy Bridge (Xeon E5-2670) with 16 cores/node (8 cores/socket) and 2 hyperthreads/core with 126 GB RAM/node.

To build and run the MITgcm from source code, we need a Fortran compiler, an MPI implementation, and NetCDF-Fortran. On the comedians, I've used Spack to build a GNU 8.2.0 + OpenMPI 3.1.2 software stack with these dependencies.

Google Cloud Platform Compute Resources

To benchmark on Google Cloud Platform, I used the fluid-slurm-gcp-openhpc HPC cluster solution. This system deploys a login node and controller node that hosts the Slurm job scheduler. This solution comes with a GNU 8.2.0 + MPICH 3.3.2 software stack out-of-the-box, so I didn't need to install any additional packages to build the MITgcm.

After launching the cluster, I configured 3 compute partitions, each with a different GCP machine type.

n1-highcpu-64 (32 core, 2 hyperthread/core, Intel Broadwell with 57.6 GB RAM)
n2-highcpu-64 (32 core, 2 hyperthread/core, Intel Cascade Lake with 57.6 GB RAM)
n2d-highcpu-64 (32 core, 2 thread/core, AMD EPYC Rome with 64 GB RAM)

On cloud resources, the costs of running simulations is more readily visible than on-premise. As part of this benchmarking exercise, I decided to keep track of the estimated simulation costs. This chart shows the estimate cost per node hour for each of these compute node types.

Simulation Configuration

The MITgcm, like many other CFD applications, is data-parallel. The simulation domain is divided into bricks and separate processes execute the same instructions on their own data brick; this is referred to as "domain decomposition". In each iteration of the model, information from each brick is communicated to other processes using the Message Passing Interface (MPI). The amount of time it takes to execute the simulation depends on how many MPI processes (a.k.a. "ranks") are used and how the domain is decomposed.

In our previous studies, we were able to obtain reasonable throughput (about 1 model year per week) on 1 node of the Comedians with 30 MPI ranks. In this benchmarking study, I examine a domain decomposition with 25 ranks. Each rank is responsible for forward integrating a 110x120x68 subdomain.

How to estimate CPU-hour requirements

To develop an estimate of the total CPU-hours needed for our production runs, I run short benchmark simulations for 1000 iterations. This gives us a measure out the amount of wall-time needed per iteration. Then, I can multiply by the number of CPUs and the number of iterations needed for each production simulation to estimate how many CPU-hours are needed to carry out our study.

Benchmark Results

Wall Time

The simulation "wall time" is the amount of time it takes to complete the execution of the benchmark. This is the raw metric that we measure and use to derive other quantities. For the MITgcm, the wall time depends on application specific details (e.g. domain size, domain decomposition, physics packages enabled, equation of state choice, etc.), build specifications (e.g. compiler, MPI implementation, optmization flags), and the system specifications (e.g. hardware, operating system)z.

In our benchmark study, the application specific details remain the same in all situations while the system and build specifications vary. Overall, the Comedians result in the highest wall-time and the GCP AMD EPYC Rome instance provides the lowest (best) wall-time.

CPU-Hours per model-year

For our project, we are planning to run a 6 year simulation for the 3km resolution domain. To estimate the number of CPU-hours needed to complete a 6 year simulation, we first calculate the number of CPU-hours needed for a single model year. From here, we can multiply by 6.

To obtain the CPU-hours per model year metric, I'm using the following information

We can use the wall-time for 1,000 iterations to provide a wall-time per iteration
Each iteration advances the model by 180s
There are 365 days in 1 year, giving 175,200 iterations per model year

I then calculate

CPU-Hours/Model-year = (25 CPUs)x(N Hours/Iteration)x(175,200 Iteration/Model-year)

As expected from the wall-time measurements, the GCP AMD EPYC Rome instance provides the lowest CPU-Hour cost per model year.

Cost

The cost per model year can be readily estimated on Google Cloud Platform, by multiplying the cost per node hour by the wall-time (in hours). For the MITgcm, the new AMD EPYC Rome instance on GCP provides the lowest cost per model year. It's also a bonus that the n2d instances provide the quickest time-to-solution as well.

It's difficult to compare the cost of cloud with the cost of on-premise computing. This is largely because on-premise computing costs at Universities is hidden. Most system users don't see the energy bills or the human resources costs that go into maintaining and operating compute systems. At the end of the day, if a research grant has allocated the indirect expenses that the University has asked for, the system cost is essentially "free" to the end-user, and the true cost remains hidden.