Seeing Clearly in the Cloud

an errata

This article is written in response to a recent Google Blog post we co-authored with Google.

First, a bit about the domain Fluid Numerics operates within and what we do as a service provider and a research software engineering community advocate and contributor.

Within most domain sciences, including Weather Research and Forecasting, there are a number of factors considered in the scalability of a computing workload or job. Some computers are specifically designed to execute large scale distributed jobs and some are not. At Fluid Numerics we share time on large privately and publicly owned Supercomputers, maintain our own Research Computing Cloud with hyperscale public cloud clusters on Google Cloud for our own research and manage Research Computing Clouds and Clusters for our customers.

Since our work is positioned on different systems synchronously and asynchronously it is in our best interest to categorize, classify and define systems specifically. We work with emerging technologies which are considered “cutting edge” and “state-of-the-art” regularly and understand the “bleeding edge” of these experimental categories as well. We fully understand how exciting it is to hear about access to new “high performance computing” resources and “ready-to-use” software applications.

Our experience and skills related to optimization and performance tuning for research software has helped us identify and understand where the Weather Research and Forecasting (WRF) modeling system performs and where it does not on a variety of infrastructure including Google Cloud.

Collateral resources available on Google Cloud through Fluid Numerics are available and stable within the terms of service and base images provided by Google Cloud. In the realm of High Performance Computing, we typically prefer to interoperate with hardware with as much freedom as possible to optimally configure a cluster down to Bare Metal.

At Research Computing and Supercomputing centers worldwide, hardware is procured and software stacks that offer ideal programming interfaces are tuned to fit a variety of scientific applications and needs. For Research Computing Clouds based on Google Cloud, we are able to tune operating systems and 3rd party software while bounded by Google Cloud’s hardware configurations that keep their system operable and reliable.

As we continue to invest in cloud resources for and with the community, we continue to discover and release functional tools and resources that fit the needs of researchers and scientists around the globe while also profiling the strengths and weaknesses of these resources through in-depth performance benchmarking research and study. Many applications have a broad range of use cases and in many cases, a production research application operates on critical infrastructure which has been largely housed within on-premise datacenters and networks.

A couple of clarifications should be made about some statements in this blog post to distinguish our approach to cloud from other organizations.

On Performance Comparisons with On-Premises Systems

“To help, weather forecasters can now run the Weather Research and Forecasting (WRF) modeling system easily on Google Cloud using the new WRF VM image from Fluid Numerics, and achieve the performance of an on-premises supercomputer for a fraction of the price”

In a livestream we ran back in May of 2021, we showed performance comparisons between Google Cloud and Cheyenne for the CONUS 2.5km benchmark. For a small number of MPI ranks (480 MPI ranks or less), performance was on par on Google Cloud and with Cheyenne. We found, however, that scaling on Google Cloud was less than stellar, even when using compact placement.

Since May of 2021, we have done a bit more work on top of the material presented in the livestream; particularly we looked into using the Intel OneAPI compilers, which were not publicly available at the time we first started integrating WRF with Google Cloud. This provided overall a marked improvement over using the GCC compilers (22% improvement) for given runs, but unfortunately this did not improve scaling.

This being said, we want to be clear that operating on Google Cloud does have its benefits. Out of the systems that we work on, we promote Google Cloud as an excellent platform for Research Computing.

While WRF performance on Google Cloud does not meet our classification as “high performance” when compared to an on-premise supercomputer, we utilize and rely on Google Cloud as a component of our own internal research computing workflows which prepare us for operations at larger, better interconnected datacenters.

Development and engineering operations for research software sometimes consist of many iterative testing cycles before software updates and upgrades are ready for production scale on a larger system. By integrating a lightweight testbed in the cloud we avoid allocations, downtime, preemption and a variety of other impediments that are typically realized at private and publicly owned research and supercomputing centers.

Semantics and terminology have been key to marketing new technologies since the datacenter market started developing in the 1970’s. As collateral hardware starts taking new directions alongside applications and use cases, we have started to apply specificity when referring to systems as “high performance” or as a “supercomputer”.

At Fluid Numerics, we define “HPC” or “high performance computing” as the practice of developing algorithms and software and selecting the appropriate hardware to obtain the lowest software execution runtimes. Practicing high performance computing typically involves parallel computing, hardware systems design and engineering, and deployment optimizations. We have designated our products as “Research Computing” to appropriately categorize the general use of these systems.

Let’s be clear. There are areas where Google Cloud performs, and we do consider applications that are not latency bound and may perform at scale on Google Cloud as ideal candidates for study. Many high performance computing applications do not require multi-node communications and can perform all operations on a single node. Some applications can hide communication latency well by overlapping communication with computation.

Before considering an application High Performance, we must develop confidence that the end results are equal to or better than a commonly accepted recorded baseline performance specification. A benchmark that provides us with a validated answer to compare a system’s “performance” to.

On public cloud systems, only a subset of controls are given for hardware systems design and engineering; most of the hardware decisions are made by the public cloud provider who is interested in serving the needs of a broad user-base who are typically not interested in high performance computing. On private clusters and on-premise supercomputers, there are many more design and engineering considerations as the owner and operator of an organization’s research computing resource. Today’s supercomputer can be built to fit a variety of applications and end users.

At Fluid Numerics we typically describe a “Supercomputer” as a homogeneous cluster of servers with compute and networking hardware specifically designed to meet performance goals for end users.

Although public cloud providers do provision access to an enormous amount of compute resources, a public cloud provider’s network does not typically meet the low communication latencies often obtained in supercomputers; you can see OSU latency and bandwidth microbenchmarks on Google Cloud in one of our previous journal articles.

Comparing Google Cloud, which uses “ethernet-like” networking, with the Cheyenne supercomputer that was designed with a low latency Infiniband network is a bit unfair. What we can say is that even though Google Cloud networking currently has imprints of a higher latency network, you don’t have to lay out more than $35 million USD plus additional operating capital to obtain access to a large amount of compute resources. How end users define “performance” is a very important consideration in today’s research computing allocation and procurement decisions.

Further, Cheyenne runs on coal power, which seems counterintuitive when we consider the workloads studying climate science are technically contributing to the climate crisis. Google Cloud, on the other hand provides options to select greener compute facilities that operate 100% carbon-neutral with a calculable carbon footprint. In addition, Fluid Numerics has also committed at least 1% of annual revenues to environmental efforts. Considering the impacts of computational research and science is necessary in order to effectively take action.

The teams we work with at research computing and supercomputing centers around the world rely on unbiased fact when approaching the challenges involved in operating critical supercomputing infrastructure and the scientific research applications that run on them. Our time at Fluid Numerics is split unevenly between high performance and research computing and we focus on enabling research by providing supported integrations that assist you through the hardships one or many might face when getting started with their own systems where they are now an administrator and/or researcher.

Clarification on File IO Performance Improvements

“Below, we show the speedup in file IO activities relative to serial IO on an NFS file system. For this example, we are running the CONUS 2.5km benchmark on c2-standard-60 instances with 960 MPI ranks. By changing WRF’s file IO strategy to parallel IO, we accelerate file IO time by a factor of 60.”

Figure showing correct time spent in File IO Speedup metric quoted incorrectly in the statement above. The File IO Speedup factor is 192.5, not 60 as stated in the original text of the blog post.

A couple of items seem to have gotten mixed up here. The figure shows that transitioning from serial IO to parallel IO simply by changing a setting in WRF’s namelist parameters reduces the amount of time spent in file IO by a factor of 192 (not 60). We’ll add here that the Slurm controller (n2-standard-80 with 2TB PD-SSD) was used as the NFS host for file IO. By switching the file server to a Lustre file system with 4 n2-standard-16 OSS servers (with local SSD’s) and a n2-standard-16 MDS/MGS server, we were able to squeeze a lot more performance (another factor of ~68) in the file IO, overall providing speedup in the file IO of about 13000x over serial.

There are two key ingredients here :

WRF supports parallel file IO in an embarrassingly parallel way - each MPI rank writes its own NetCDF file.
A parallel file system with higher speed disks (pd-ssd compared to local ssd) and multiple servers to share the IO workload.

Either one of these ingredients on their own is not sufficient to obtain these kinds of results. Running serial IO on a large Lustre file system, for example, likely won’t result in a performance gain that justifies the additional expense. Conversely, running with parallel IO over NFS with 960 ranks (in this case) provided 192x reduction in file IO time (20% scaling efficiency). Further, operating a cluster where the controller serves as the sole mount point in this case would likely cause some “laggy” behavior for other users who may be sharing the same cluster.

In a complete workflow involving weather simulations, post-processing and analysis of the simulation results is necessary. When working with general circulation model (GCM) data, many of the post processing tools (python, matlab, Paraview) can easily work with output where one file is provided per time-slice. When using the parallel IO option with WRF, each MPI rank writes its own NetCDF file per time-slice. Because of this, a complete workflow for producing weather simulation data and analysis will typically require stitching back together all of the “tiles” of model output. Alternatively, a post-processing system will be necessary for working with tiled NetCDF files; to our knowledge, such a post-processing system does not exist in the open-source WRF ecosystem and an end-user will need to develop such tooling.

This last consideration is particularly important when considering total cost of ownership. We need to remember that driving down the simulation runtime does not necessarily imply lower production costs for a weather forecasting operation or weather research project. Every decision you make about how you deploy software influences other steps in a complete workflow for producing a useful product or service for your customers.

Summary

At Fluid Numerics, we think critically about how we design and integrate systems to enable customers on Google Cloud. Additionally, we want to help you consider the choices you have available and illuminate the potential paths and pitfalls so that you can safely navigate your journey into research computing on public cloud systems.