OSU Benchmarks on Google Cloud Platform
How does MPI scale on Google Cloud Platform ?
The Message Passing Interface (MPI) is the backbone of most data-parallel high performance and scientific computing applications. MPI enables scientific software developers to write applications in C/C++, Fortran, and Python that can coordinate any number of processes across any number of servers.
In data-parallel applications with MPI, programs are written assuming that individual processes, called MPI Ranks, execute the same set of instructions, but operate on a different, often unique, subset of data. In this approach, each rank occasionally needs to be made aware of another rank's data. This incurs a communication overhead that dictates the application's scalability. An application's ability to scale to larger problems or achieve a faster time-to-solution depends partly on the platform you choose to run on. Check out our article on data-parallelism and scaling to learn why!
While it's a good idea to benchmark and profile your application on existing hardware before transitioning to new platforms, community accepted micro-benchmarks allow developers and system integrators to assess system performance limitations and potential bottlenecks. Micro-benchmarks are designed to exercise individual routines or key hardware components. For example, the stream benchmarks are designed to assess memory bandwidth, which is crucial for modeling and understanding an HPC application's performance, such as in Roofline Modeling.
Ohio State University's Computer Science and Engineering (OSU-CSE) leads the development of MVAPICH, an implementation of MPI for C/C++ and Fortran applications. They have also developed a standard suite of micro-benchmarks for exercising each routine in the MPI standard. The OSU benchmarks include programs for assessing the performance point-to-point, all-to-all, scatter, gather, non-blocking communications, and one-sided communications like puts and gets.
Why point-to-point benchmarks ?
Of the micro-benchmarks available in the OSU benchmarks toolkit, we're interested in understanding point-to-point message passing performance on Google Cloud Platform. This is driven by our HPC application development in Discontinuous Galerkin Spectral Element Methods with explicit time integration. In these methods, MPI ranks communicate only with "nearest-neighbors" and exchange data only with a small subset of the total number of ranks.
Bandwidth and Latency
Additionally, point-to-point benchmarks are useful for assessing network performance and MPI behavior differences on-VM and off-VM. MPI performance is usually characterized by latency and bandwidth. Latency is a measure of the amount of time it takes for a message to be transmitted from one MPI rank to another, and is usually measured in microseconds. Bandwidth is the rate that data can be transmitted from one MPI rank to another, and is usually measured in MB/s or GB/s.
From the OSU Benchmarks site
"The latency tests are carried out in a ping-pong fashion. The sender sends a message with a certain data size to the receiver and waits for a reply from the receiver. The receiver receives the message from the sender and sends back a reply with the same data size. Many iterations of this ping-pong test are carried out and average one-way latency numbers are obtained. Blocking version of MPI functions (MPI_Send and MPI_Recv) are used in the tests."
"The bandwidth tests were carried out by having the sender sending out a fixed number (equal to the window size) of back-to-back messages to the receiver and then waiting for a reply from the receiver. The receiver sends the reply only after receiving all these messages. This process is repeated for several iterations and the bandwidth is calculated based on the elapsed time (from the time sender sends the first message until the time it receives the reply back from the receiver) and the number of bytes sent by the sender. The objective of this bandwidth test is to determine the maximum sustained data rate that can be achieved at the network level. Thus, non-blocking version of MPI functions (MPI_Isend and MPI_Irecv) were used in the test."
I used the Fluid-Slurm-GCP+OpenHPC cluster as a test-bed to carry out the OSU latency and bandwidth benchmarks. Once this system was setup through the GCP marketplace, I added Slurm partitions for
n1-highcpu-64 (64 vCPU Intel Broadwell)
n1-standard-16 (16 vCPU Intel Broadwell)
n2-highcpu-64 (64 vCPU Intel Cascade Lake)
n2d-highcpu-64 (64 vCPU AMD EPYC Rome)
I ran the OSU bandwidth and latency benchmarks on all four partitions using the provided GNU/8.2.0 and MPICH/3.2.1 MPI stack. In one round of tests, benchmarks were executed using two MPI ranks on one VM instance to measure on-VM performance. In the other round of tests, benchmarks were executed on two VM instances with one MPI rank per node.
The average bandwidth measurements from osu_bw are shown in a log-log plot.
Generally, what we observe for all machine types and for on-VM and off-VM communication is that bandwidth increases as the MPI packet size increases, until it begins to plateau beyond about 500 KB. On-VM bandwidth maximum is (roughly) 10 GB/s, whereas off-VM bandwidth maximum is around 2 GB/s.
At the time we ran these benchmarks, the off-VM results were consistent with GCP documentation on network bandwidth limitations that indicated 16 Gbps (2 GB/s) egress network caps. Recently, GCP has increased egress network caps for Skylake, Cascade Lake, and EPYC Rome Instances with 16 vCPU or more to 32 Gbps (4 GB/s).
The average bandwidth measurements from osu_latency are shown in a log-log plot.
Generally, what we observe for all machine types and for on-VM and off-VM communication is that latency increases as the MPI packet size increases. Additionally communication off-VM results in a higher latency than on-VM communication. At small packet sizes, less than 1KB, off-VM latency is about 70x-80x larger (worse performance) than on-VM latency. That perfomance gap narrows for larger packet sizes, where off-VM latency is only 10x-20x larger for packet sizes greater than 1MB.
This is just the start of an OSU benchmark data-set to help HPC developers characterize cloud system performance (including MPI performance) across cloud providers and on-premise resources. As we continue to experiment with GCP and other on-premise resources, we plan to build a more complete data-set to characterize available computing resources for HPC. We plan to include other benchmarks, like the Empirical Roofline Toolkit to help characterize memory and compute performance for CPUs and GPUs.
I have put together a bitbucket repository that you can use to reproduce these benchmark runs on a HPC cluster with a Slurm Job Scheduler. If you end up benchmarking a system we don't have shown in our figures here, reach out to us to share your data!