Comparing On-premise and Cloud Costs for High Performance Computing
Transitioning to cloud computing systems to support high performance computing (HPC) workloads is becoming an active effort for universities and government labs. In working with these groups, we always wind up discussing cloud computing costs and total cost of ownership (TCO). In cloud-HPC, TCO includes direct compute, network, and storage costs from the cloud provider in addition to software licensing, infrastructure-as-code development and maintenance, support and consulting, user onboarding, and training.
With these costs, cloud provides certain benefits over on-premise solutions, such as variable capacity. Variable capacity has two benefits :
1. Under times of significant usage, users can experience reduced queue times, relative to on-prem, and organizations see fewer "missed-opportunities" where workloads are unable to be scheduled.
2. Under times of no usage, your organization does not need to pay for unused compute resources.
A true cost-benefit comparison of on-premise and cloud-hybrid or cloud-native models for HPC Longer queue times, typical of fixed-capacity on-prem systems, can result in irregular workflows for HPC system users that contributes to difficult-to-quantify loss of productivity in human resource expenses.
In this article, we propose a simple model for understanding and comparing a few cloud solutions with on-premise platforms.
On-premise HPC Costs
Operating Systems, compilers, job schedulers
Energy costs are the primary driver of utilities costs for operating an HPC system. Other expenses include networking and ISP charges, but we do not include these here. Some attention has been given to this subject in peer-reviewed literature As a simple model for estimating energy costs, we make the following assumptions to approximate reality
1. A single server consumes energy at a rate of K_s when fully subscribed.
2. A single server consumes energy at a rate of K_i when idle ( K_i < K_s )
3. Over a month, the percentage of time nodes are spent fully subscribed in S (S < 1), and the percentage of time spent idle is (1-S)
4. When powered on, a percentage of the energy given to the server is converted to heat; the remaining energy is what actually is used for productive operations. Per unit time, the percentage of energy given off as heat is P.
5. A computing facility needs to extract the heat given off by the servers in order to maintain fixed operational temperatures.
6. Heating and cooling systems operate at less than 100% efficiency. We assume the amount of energy required to remove heat H is r*H, where r > 1.
Energy consumption rates, K_s and K_i, are given in units of KiloWatts (KW); this is equivalent to 1000's of Joules per second, where a Joule is a measure of energy. Utilities often charge in units of KiloWatt-Hours (KWH). Because of this, in our equations for monthly energy consumption, we multiply by 730 hours/month
From these assumptions, we can estimate the KWH for operating an HPC cluster with N servers per month as
For a complete derivation of this energy equation and an energy calculator, sign up for the Fluid Numerics Journal.
Once we have the energy consumption, we can multiply by the cost per KWH to obtain an estimate of monthly utilities costs.
Now, I want to illustrate a back-of-the-envelope estimate for energy costs. For this estimation, let's make additional assumptions,
1. Over a month, your system stays fully subscribed ( S = 1 )
2. The energy consumption rate when fully subscribed is K_s = 500 W per node.
3. The energy consumption rate when idle is K_s = 350 W per node
3. 50% of the energy given to the server is given off as heat (P=0.5)
4. The HPC cluster consists of N=50 servers
5. It takes 25% more energy to remove heat energy ( r = 1.25 )
With these assumptions, E=29,656.250 KWH. The national average cost for electricity in the United States is $0.1284/KWH, giving an estimated $3,807.86/month
The chart below shows the monthly estimated energy costs for this 50 node cluster as the percent utilization increases. The idle costs show the amount of money spent to keep the system running that does not result in productivity. The "Productive Capital" shows the dollars spent on compute resources that were contributing to productive labor. When the system capacity used is greater than 35-40%, in this model, productive expenses exceed idle expenses
Node failures contribute to reduced system capacity that can cause interruptions in service for users. This, in turn, can translate to lost productivity due to variability in user workflow. Additionally, unexpected node failures divert system administration human resources
Cloud HPC Costs
Operating Systems, compilers, job schedulers
Google Cloud and Slurm-GCP Pricing Model
Generally speaking, Google provides compute, network, and storage resources at its core. Compute resources are charged per virtual CPU (vCPU) or GPU per unit time ( e.g. USD/vCPU/hour ) and per GB memory per unit time. Google Cloud has a variety of vCPU models available, classified as "n1", "n2", "m1", "c2", or "e2" instances. The classifiers distinguish between various models of high performance and commodity CPU chips; for more information on the CPU platforms available on GCP, see the CPU Platforms documentation. Each class of instances has predefined machines that have various core/memory ratios. A unique feature of Google Cloud Platform, relative to other cloud providers, is the ability to create custom instances with more fine-tuned core/memory ratios.
On GCP, network resources are only charged for egress out of a GCP region. The cost is determined by the quantity of data (GiB) that is transferred out of a GCP region. Traffic that stays within a GCP region, such as MPI traffic within a VPC subnet, is free.
Storage resources, like persistent disks, are charged per GB/hour. In terms of persistent disks on GCP, you can choose between the high bandwidth persistent SSD's or the persistent HDD disks. As you may have guessed, SSD's cost more per GB/hour.
Fluid Numerics' Slurm-GCP consists of a number of static login nodes, a controller node (that hosts the Slurm job scheduler), static compute nodes, and ephemeral compute nodes. All nodes login and controller nodes have an attached boot disk
Cloud computing resources provide an ecosystem where compute resources can be added and removed dynamically over time. Public cloud providers, like Google Cloud Platform, require resource quotas to be submitted and approved in order to establish your available capacity. With a quota in place, infrastructure can be provisioned using pre-build marketplace solution. Alternatively, you can build more customized cloud infrastructure using infrastructure-as-code tools such as Google Cloud deployment-manager or Hashicorp's Terraform, along with suitable scripting languages (e.g. Python).
Solutions like Fluid Numerics' Slurm-GCP maintain a static login node and controller that expose the ability to leverage dynamic compute capacity using a traditional HPC job scheduler. Ephemeral compute nodes are created when HPC workloads are submitted to the Slurm job scheduler. When compute nodes become idle for 5 minutes or more, the Slurm-GCP system automatically deletes the instances. This setup ensures that you are paying primarily for the compute costs that result in productive labor.
Questions we have
Now that we've shared with you our take cloud and on-premise costs for HPC, we've got a few questions for you! Reach out to firstname.lastname@example.org to join our discussion group!
What indexes/measures do you have for productivity ?
How would you quantify unproductive labor costs that result from system down-times and long queue times ?
How valuable is access to new and heterogeneous hardware for your organization ?
Is our on-prem cost model applicable to your organization ? Let us assess your current HPC operating costs and help you understand if cloud HPC is the right move for you!
Do you have any comments, criticisms, or feedback about this article?
We're interested in seeking out the truth and building consensus. Please contribute by letting us know what we're missing!