Running MITgcm workflows on Cloud CFD

Summary

You may be aware that Fluid Numerics is actively engaged in studying the dynamics of the Gulf Stream. In this work, we are using the MIT general circulation model to conduct a series of downscaling simulations focused on the eastern seaboard of the United States, and particularly around Cape Hatteras, NC, where the Gulf Stream starts to head offshore as a baroclinically unstable jet. We started this work by running a few benchmarks on Google Cloud Platform and on systems available at our colleague's department at Florida State University.

Today, I wanted to share some of the progress we've made and talk about how we are using the Cloud CFD solution to run our MITgcm simulations, process MITgcm output to VTK, post-process and visualize simulation output with Paraview, and monitor simulation diagnostics with Big Query and Datastudio.  Additionally, we've been able to make some progress on estimating the cost for each model simulation year for our current configuration.

Project Requirements

Batch Job Submission

To carry out our work, we wanted to have a compute platform where we could send MITgcm simulation jobs and mostly operate hands-free. From our initial benchmarks, we expect that our runs will take about 10 days of wall-time to provide one model simulation year. Ideally, we should be able to submit one-year simulations and they should run without our intervention. 

Application Monitoring

When the MITgcm runs, it can be configured to report global statistics, like the maximum CFL number and min/max/mean temperature and salinity. These fields are reported to a file called STDOUT.0000, which can be parsed using a few clever grep | awk commands. While the simulation is running, we want to be able to visualize these global statistics as they become available. This will allow us to catch obvious model configuration mistakes early and prevent wasted compute cycles.

Paraview Server

To support model validation, we typically look through a number of parameters that need to be visualized. This includes sea-surface temperature, sea-surface salinity, sea-surface height, and sea-surface kinetic energy (per unit mass). Additionally, when developing models of the ocean, numerical simulations can often predict erroneous values of temperature and salinity. To help quickly identify physical regions where this may be occurring, we want to be able to create 3-D visualizations of grid cells that have temperature and salinity values that are abnormal. Finally, we also want to create time-series output for another diagnostics such as the average potential temperature and salinity above 300m depth.

Paraview makes all of this post-processing possible in an easy-to-use interface. Since we are anticipating TB of data to process, we want to have compute infrastructure that can run Paraview server (under MPI) to connect to a Paraview client on our workstations. Our team already has a script to process MITgcm standard output to VTK, a format that Paraview can easily read.

Minimal Setup

We also wanted to have minimal setup required, outside of the preparation of the MITgcm simulation input decks and job submission scripts. At a minimum, the MITgcm requires a Fortran compiler, C compiler and pre-processor, and an installation of MPI. Additionally, we don't want to spend a lot of time installing and testing other tools like Slurm and Paraview.

Infrastructure

Cloud CFD

To meet our project requirements, we chose to go with a combination of Cloud CFD, Filestore, Big Query and Datastudio.  The Cloud CFD solution on Google Cloud Platform is an auto-scaling HPC cluster with the Slurm job scheduler and workload manager. It comes with GCC@9.2.0 and OpenMPI@4.0.2 pre-installed, meeting the minimum requirements for building the MITgcm. Additionally, Cloud CFD comes with Paraview pre-installed and ready to use at scale on Google Cloud Platform.

Filestore

We chose Filestore to provide 8TB of storage space to the Cloud CFD cluster over NFS. This size is sufficient for storing one-year of our model simulation output in MITgcm's metadata format and in rectilinear VTK  format. This setup will allow us to run 1 year simulations without having to worry about filling up file space.

SMTP Relay

Because we don't want to constantly baby-sit our simulation runs, we also set up email notifications using Slurm and Google's SMTP Relay service in Google Workspace. This setup required publishing an A Record for the Cloud CFD controller in our Google Domain and configuring the SMTP relay service within the Google Workspace Admin console.  Cloud CFD comes with an easy-to-use CLI called cluster-services to help configure the SMTP email server that Slurm can use to send job status updates.

Big Query & Datastudio

To provide a dashboard to monitor the simulation, we set up a crontab to run a script that parses the MITgcm STDOUT.0000 for monitoring metrics on the Cloud CFD login node. This script outputs newline-delimited json files containing the parsed metrics that are then uploaded to Big Query. 

Finally, the Big Query table is connected to a Datastudio report that graphically displays the max CFL numbers, max/mean kinetic energy, min/mean/max potential temperature, and min/mean/max salinity. This report is fully interactive, but we're showing a screenshot below.

Configuration & Setup

Compute Partitions

Based on our benchmarks on Google Cloud, we decided to use the n2d-highcpu-96 instances to run the MITgcm. For Paraview, the main concern is ensuring that we don't run out of system memory when performing post-processing. After some experimentation in setting up our Paraview layouts, we found that having about 96 GB of memory was sufficient for data visualization and time-series generation for our model grid (95.6 million grid cells). Last, we set up a small partition for handling the conversion of MITgcm data to rectilinear VTK output. 

The relevant partition definitions in our cluster-configuration file are given below

- labels:

    goog-dm: gfd-cluster

  machines:

  - disable_hyperthreading: false

    disk_size_gb: 20

    disk_type: pd-standard

    external_ip: false

    machine_type: n2d-highcpu-96

    max_node_count: 10

    name: mitgcm-node

    preemptible_bursting: false

    static_node_count: 0

  max_time: INFINITE

  name: mitgcm

- labels:

    goog-dm: gfd-cluster

  machines:

  - disable_hyperthreading: false

    disk_size_gb: 20

    disk_type: pd-standard

    external_ip: false

    machine_type: n1-highmem-16

    max_node_count: 10

    name: pv-node

    preemptible_bursting: false

    static_node_count: 0

  max_time: INFINITE

  name: paraview

- labels:

    goog-dm: gfd-cluster

  machines:

  - disable_hyperthreading: false

    disk_size_gb: 20

    disk_type: pd-standard

    external_ip: false

    machine_type: n1-highmem-2

    max_node_count: 10

    name: vtk-proc-node

    preemptible_bursting: false

  max_time: INFINITE

  name: vtk-processing

Adding Filestore

To add Filestore to our cluster, we created a directory ( /mnt/mitgcm-datastore ) and set the group ownership to fluid_users. This is a built-in Linux group for all users in the Cloud CFD cluster that can be used for shared directories. From here, we created a Filestore instance with 8 TB of Basic HDD space. Then, we set up the mounts definitions in our cluster-configuration file, similar to what's shown below

mounts:
- group: fluid_users
  mount_directory: /mnt/mitgcm-datastore
  mount_options: rw,hard,intr
  owner: joe
  permission: '755'
  protocol: nfs
  server_directory: FILESTORE-IP:/mitgcm_data

Getting the MITgcm Running on Cloud CFD

MITgcm Domain Decomposition

MITgcm executables are created for specific simulation test cases. At compile-time, the MITgcm needs to know the model grid layout in addition to the domain decomposition and number of MPI ranks. Domain decomposition in the MITgcm is only done in the longitude and latitude directions. Since our model grid has 1280 (longitude) x 996 (latitude) x 75 (vertical) grid cells and we want to leverage 96 MPI ranks to fully subscribe an n2d-highcpu-96 instance, each MPI rank works on a subdomain of 160x83x75 grid cells. This gives 8 (longitude) x 12 (latitude) = 96 MPI Ranks. This domain decomposition is encapsulated in the SIZE.h header file.

Building the MITgcm

The MITgcm is built by first running the included genmake2 script to pre-process Fortran 77 files and create a makefile for building the application. During this process, we need to indicate to the build system how to find MPI compilers and what compiler flags to use. This information is specified in an opt-file that gets passed to genmake2. Our opt-file that works on the Cloud CFD cluster is shown below.

#!/bin/bash

MPI='true'
FC="/apps/openmpi/bin/mpifort"
CC="/apps/openmpi/bin/mpicc"
F77="/apps/openmpi/bin/mpifort"
DEFINES='-DWORDLENGTH=4'
CPP='/usr/bin/cpp -P -traditional'
EXTENDED_SRC_FLAG='-Mextend'
INCLUDES='-I/apps/openmpi/include'
LIB=''
CFLAGS='-O3'
FFLAGS="$FFLAGS -fconvert=big-endian -fimplicit-none -mcmodel=medium"
FOPTIM='-O3 -funroll-loops'
NOOPTFILES="$NOOPTFILES ini_masks_etc.F"

To build the application, we created a short script (rebuild.sh) that loads the OpenMPI models and runs through the MITgcm build workflow

module load openmpi
make clean
make CLEAN
../../tools/genmake2 --mods=../code --optfile=google-cloud_gcc-openmpi
make depend
make -j
mv mitgcmuv ../run

Running the MITgcm

The MITgcm is a finite volume CFD application that solves the hydrostatic primitive equations on an Arakawa C-Grid. In our first round of simulations, we've decided to run the simulation with 96 MPI ranks and with each rank mapped to each virtual CPU (hardware thread/hyperthread) on the n2d-highcpu-96 compute nodes. In our Slurm Batch header, we explicitly indicate that we are launching a job with 96 tasks and we require 80 GB memory. When launching the MITgcm executable with mpirun, we use the -bind-to hwthread option to bind each MPI rank to a hardware thread.

#!/bin/bash

#SBATCH --account=cfd

#SBATCH --partition=mitgcm

#SBATCH --ntasks=96

#SBATCH --mem=80g

#SBATCH -o mitgcm.out

#SBATCH -e mitgcm.out

# /////////////////////////////////////////// #

module load openmpi

mpirun -np 96 -bind-to hwthread ./mitgcmuv

Visualizing Results

A huge benefit of the Cloud CFD cluster is the fact that Paraview is pre-installed and the solution comes with a Paraview Server connection (pvsc) file.  The pvsc file can be loaded in Paraview to bring up a menu to connect to the Cloud CFD cluster. When you connect to the Cloud CFD cluster using this pvsc file, a job is submitted to start the Paraview server on auto-scaling compute nodes. When you disconnect from the server in Paraview, the job is automatically cancelled and the compute node is taken down for you by Cloud CFD. There is an easy-to-follow codelab that walks through each step for connecting your Paraview client to Paraview Server.

Paraview supports a large number of file formats from many common HPC applications. However, the MITgcm standard output is not supported. To bridge this gap, we developed a simple Python script that converts MITgcm standard output to rectilinear VTK output, which Paraview does support. When a simulation job is submitted, we also submit a post-processing job to convert the simulation output. This job is submitted with a dependency on the simulation job, so that post-processing only occurs when the simulation successfully completes.

#! /bin/bash

# Simulation job

simId=$(sbatch  mitgcm.slurm)

# Post-processing job
sbatch  --dependency=afterok:$simId mitgcm2vtk.slurm

Once the VTK output is ready (and we receive the email notification from Slurm), we can connect to Paraview server to begin rendering output.

Simulation Costs

When thinking about costs for running the MITgcm on Google Cloud, we have to distinguish between two groups of costs

Monthly Overhead Costs

The monthly overhead costs include costs for running the Cloud CFD controller login VMs and the Filestore instance. These resources are considered monthly overhead costs because these are resources we expect will be running 24/7 for the duration of our project. To help keep monthly costs low, we opted for n1-standard-4 instances for the controller and login nodes. To estimate our monthly costs, we used the GCP pricing calculator and found that our overhead costs is expected to be about $1,835/month.

Compute Costs

While running simulations, we kept track of the wall-time needed to forward step the model per time step. After a few runs, we found that our configuration of 96 MPI ranks fully subscribed on an n2d-highcpu-96 instance required about 230.76 hours to complete a simulation year. Using the GCP pricing calculator, we've estimated that one simulation year costs about $552.90.

We think we can do better on our simulation costs by leveraging preemptible instances. However, this requires modifications to our job submission scripts so that the job can recover from a node preemption. If we did make these changes, however, we could expect (at best) our one year simulation cost to be about $167.44. Of course, there is likely to be some overhead that is incurred because of preemption; when VM is preempted, we'd have to restart the MITgcm at the nearest pickup file. The only other room for improvement in the simulation costs is to explore different compilers, MPI flavors and build options, and operating system optimizations, which are a bit out of scope for this project, but certainly would be interesting to explore.