Expose Google Storage Buckets as a Globus Endpoint with GCSFuse

Introduction

In high performance computing, moving data between systems is at worst a major bottleneck. At best, it can be an annoyance. If you're one of the lucky few to have direct access to HPC resources within your own organization and you're able to keep data on one server, this struggle may feel foreign. For the rest of us, we usually cobble together HPC resources allocations through programs like XSEDE or  NCAR's CHAP. We usually have workstations locally and perhaps a small departmental cluster for housing data, but not a whole lot of compute power. 

As an example, in our work with FSU on the Topographic Dynamics of the Gulf Stream, we're working with a combination of resources from Google Cloud Platform, NCAR's Cheyenne, and a small cluster of servers at FSU called "The Comedians" (each server is named after famous comedians "Moe", "Larry", "Curly", "Bennyhill", etc.). In this project, we start by designing our simulations on The Comedians. This includes pre-processing topography data, atmospheric fields that drive the ocean circulation, and generating boundary conditions to help us model the Gulf Stream along the east coast U.S. Once these input decks are ready, they are shipped to a supercomputer for running the MITgcm. After running the simulations we post-process the model output to create VTK files for visualization and compute bulk time series statistics for model validation. For us, this data is further used to prepare our next round of higher resolution simulations through downscaling and NSF ultimately requires the data be made publicly available.


Syncing Data with GCS and GCSFuse

Typical routes for shifting terabytes of data between systems include using tools like scp or rsync. Many supercomputing platforms, though, will timebox the amount of time you can stay connected over these kinds of connections. While rsync can help you quickly recover, maintaining data consistency on multiple systems, becomes a tedious "data-entry gig" that can derail any otherwise productive day. 

To get around the multiple file-systems issue, I've been wanting to sync up files on Google Cloud Storage and mount the GCS bucket to all of our systems to allow easy access across our team. GCSFuse is a user-space file system for mounting Google Cloud Storage buckets (Blob Storage) to your workstations, servers, and even GCE instances. Once installed, we can use the gcsfuse CLI to mount the bucket on-the-fly, or we can use an /etc/fstab entry to make sure the bucket is mounted automatically. 

The added benefit of this approach is that it is simple to make data publicly available. We simply switch an ACL to allow read access for "allAuthenticatedUsers" or "allUsers".


Working with Globus

While a universally mounted GCS bucket in theory can solve our data synchronization problems, we don't have the necessary permissions on all systems to expose our GCS bucket where it would be needed.

Globus is a great tool for keeping your data synchronized across multiple systems.  It works by exposing filesystems through "endpoints" that you can use to transfer files to and from. On Cheyenne, NCAR provides Globus endpoints for the /glade directories where all of our files are written.  Initially, I was looking into setting up a Globus endpoint for GCS Buckets, but this type of service requires a Globus subscription. For us, as a small company engaging as support in non-profit NSF funded research, the subscriptions are cost prohibitive. 

However, using the free tier Globus allows you to set up a personal Globus endpoint. By combining a free-tier personal Globus endpoint with a GCSFuse-mounted bucket, we can manage data transfers with GCS with minimal effort and without the overhead of a subscription we utilize strictly per project.