In a livestream we ran back in May of 2021, we showed performance comparisons between Google Cloud and Cheyenne for the CONUS 2.5km benchmark. For a small number of MPI ranks (480 MPI ranks or less), performance was on par on Google Cloud and with Cheyenne. We found, however, that scaling on Google Cloud was less than stellar, even when using compact placement.
Since May of 2021, we have done a bit more work on top of the material presented in the livestream; particularly we looked into using the Intel OneAPI compilers, which were not publicly available at the time we first started integrating WRF with Google Cloud. This provided overall a marked improvement over using the GCC compilers (22% improvement) for given runs, but unfortunately this did not improve scaling.
This being said, we want to be clear that operating on Google Cloud does have its benefits. Out of the systems that we work on, we promote Google Cloud as an excellent platform for Research Computing.
While WRF performance on Google Cloud does not meet our classification as “high performance” when compared to an on-premise supercomputer, we utilize and rely on Google Cloud as a component of our own internal research computing workflows which prepare us for operations at larger, better interconnected datacenters.
Development and engineering operations for research software sometimes consist of many iterative testing cycles before software updates and upgrades are ready for production scale on a larger system. By integrating a lightweight testbed in the cloud we avoid allocations, downtime, preemption and a variety of other impediments that are typically realized at private and publicly owned research and supercomputing centers.
Semantics and terminology have been key to marketing new technologies since the datacenter market started developing in the 1970’s. As collateral hardware starts taking new directions alongside applications and use cases, we have started to apply specificity when referring to systems as “high performance” or as a “supercomputer”.
At Fluid Numerics, we define “HPC” or “high performance computing” as the practice of developing algorithms and software and selecting the appropriate hardware to obtain the lowest software execution runtimes. Practicing high performance computing typically involves parallel computing, hardware systems design and engineering, and deployment optimizations. We have designated our products as “Research Computing” to appropriately categorize the general use of these systems.
Let’s be clear. There are areas where Google Cloud performs, and we do consider applications that are not latency bound and may perform at scale on Google Cloud as ideal candidates for study. Many high performance computing applications do not require multi-node communications and can perform all operations on a single node. Some applications can hide communication latency well by overlapping communication with computation.
Before considering an application High Performance, we must develop confidence that the end results are equal to or better than a commonly accepted recorded baseline performance specification. A benchmark that provides us with a validated answer to compare a system’s “performance” to.
On public cloud systems, only a subset of controls are given for hardware systems design and engineering; most of the hardware decisions are made by the public cloud provider who is interested in serving the needs of a broad user-base who are typically not interested in high performance computing. On private clusters and on-premise supercomputers, there are many more design and engineering considerations as the owner and operator of an organization’s research computing resource. Today’s supercomputer can be built to fit a variety of applications and end users.
At Fluid Numerics we typically describe a “Supercomputer” as a homogeneous cluster of servers with compute and networking hardware specifically designed to meet performance goals for end users.
Although public cloud providers do provision access to an enormous amount of compute resources, a public cloud provider’s network does not typically meet the low communication latencies often obtained in supercomputers; you can see OSU latency and bandwidth microbenchmarks on Google Cloud in one of our previous journal articles.
Comparing Google Cloud, which uses “ethernet-like” networking, with the Cheyenne supercomputer that was designed with a low latency Infiniband network is a bit unfair. What we can say is that even though Google Cloud networking currently has imprints of a higher latency network, you don’t have to lay out more than $35 million USD plus additional operating capital to obtain access to a large amount of compute resources. How end users define “performance” is a very important consideration in today’s research computing allocation and procurement decisions.
Further, Cheyenne runs on coal power, which seems counterintuitive when we consider the workloads studying climate science are technically contributing to the climate crisis. Google Cloud, on the other hand provides options to select greener compute facilities that operate 100% carbon-neutral with a calculable carbon footprint. In addition, Fluid Numerics has also committed at least 1% of annual revenues to environmental efforts. Considering the impacts of computational research and science is necessary in order to effectively take action.
The teams we work with at research computing and supercomputing centers around the world rely on unbiased fact when approaching the challenges involved in operating critical supercomputing infrastructure and the scientific research applications that run on them. Our time at Fluid Numerics is split unevenly between high performance and research computing and we focus on enabling research by providing supported integrations that assist you through the hardships one or many might face when getting started with their own systems where they are now an administrator and/or researcher.