Enabling High‐Performance Cloud Computing for Earth Science Modeling on Over a Thousand Cores: Application to the GEOS‐Chem Atmospheric Chemistry Model

Cloud computing platforms can facilitate the use of Earth science models by providing immediate access to fully configured software, massive computing power, and large input data sets. However, slow internode communication performance has previously discouraged the use of cloud platforms for massively parallel simulations. Here we show that recent advances in the network performance on the Amazon Web Services cloud enable efficient model simulations with over a thousand cores. The choices of Message Passing Interface library configuration and internode communication protocol are critical to this success. Application to the Goddard Earth Observing System (GEOS)‐Chem global 3‐D chemical transport model at 50‐km horizontal resolution shows efficient scaling up to at least 1,152 cores, with performance and cost comparable to the National Aeronautics and Space Administration Pleiades supercomputing cluster.


Introduction
Cloud computing platforms can provide scientists immediate access to complex Earth science models and large data sets, thus greatly facilitating scientific research and collaboration . Past work has ported a number of Earth science models to the cloud (Vance et al., 2016), including the Weather Research and Forecast (WRF) Model (Molthan et al., 2015), the Community Earth System Model (Chen et al., 2017), the Model for Prediction Across Scales-Ocean (Coffrin et al., 2019), and the Goddard Earth Observing System (GEOS)-Chem chemical transport model . Beyond proof-of-concept demonstrations, organizations are also starting to provide operational support for models on cloud. National Center for Atmospheric Research is now exploiting cloud computing to better support WRF (Werner et al., 2020), and National Oceanic and Atmospheric Administration plans to build a cloud-based, collaborative model development environment for next-generation U.S. national weather models (Cikanek et al., 2018). However, existing applications have been mainly limited to a single compute node or a small cluster with less than 100 cores, because high-performance computing (HPC) applications with intensive internode communication did not scale well to a large cluster on the cloud (Chang et al., 2018; Program (https://aws.amazon.com/opendata/), the Azure Open Datasets (https://azure.microsoft.com/en-us/services/open-datasets/), and the Google Cloud Public Datasets (https://cloud.google.com/public-datasets/). For example, 70 TB of the Community Earth System Model Large Ensemble data set (Kay et al., 2015) has recently been made available via the AWS Open Data Program (de La Beaujardière et al., 2019). The 100 TB of the Model Intercomparison Project Phase 6 data is now available in the Google Public Dataset . The NASA Earth Observing System Data and Information System plans to move petabytes of Earth observation data to the AWS cloud storage (Behnke et al., 2019;Lynnes et al., 2017;Ramachandran et al., 2018). The cloud further offers massive computing power to analyze big data. For example, the Pangeo big data framework (Robinson et al., 2019) provides a user-friendly way to process massive geoscientific data (in NetCDF or other formats) on cloud platforms or local HPC clusters, leveraging open-source Python libraries like Xarray (Hoyer & Hamman, 2017), Dask (Rocklin, 2015), and Jupyter notebooks (Perkel, 2018).
Enabling reproducible research. Access to code and data is increasingly required as part of a scientific publication to foster reproducible research (Irving, 2016;NAS, 2019). However, it may be impractical to reproduce a research project even if its source code and data are published online, due to the difficulty to ensure the exact same software environment on different machines and the generally slow data transfer rate from one institution to another. Public cloud platforms can guarantee a consistent environment and the same data source for different users, leading to a high degree of reproducibility (de Oliveira et al., 2017;Howe, 2012;Perkel, 2019). An example is MyBinder (Jupyter et al., 2018, https://mybinder.org), a free, online, cloud-based service for reproducible execution of Jupyter notebooks.
Continuous integration (CI) for scientific code. Online cloud-based CI services such as TravisCI (https:// travis-ci.com/) and Azure pipelines (Microsoft, 2019) automatically build the software source code and run the unit tests at every code commit and report any potential errors during the build or test stage. CI is a standard practice in software engineering (Duvall et al., 2007) and is also useful for scientific data analysis (Wessel et al., 2019) and HPC application development (Sampedro et al., 2018). Without CI, software developers would need to manually find and track software issues. CI services can also run in parallel with multiple software environments using the "build matrix" feature (e.g., https://docs.travis-ci.com/user/ build-matrix/); this is particularly useful for ensuring that a complicated HPC codebase works with a wide variety of library versions.
Providing special hardware like GPUs (graphical processing units). Next-generation weather and climate models are likely to be based on GPUs or other special hardware instead of general-purpose CPUs (Lawrence et al., 2018). However, configuring a local GPU environment can be a challenge for scientific users. The cloud offers high-end, HPC-oriented GPU types with preconfigured GPU libraries (Amazon, 2017a), allowing users to prototype and test GPU code without upfront investment. Other special hardware types such as FPGAs (field programmable gate arrays) (Düben et al., 2015) and tensor processing units (Jouppi et al., 2017) are also available via cloud services (Amazon, 2017b;Google, 2018a).
Accelerating "embarrassingly parallel" computations. Certain problems like sensitivity study and ensemble forecasting require a large number of independent model simulations that can run on individual compute nodes ("embarrassingly parallel"). With the large resource pool on public cloud platforms, independent jobs can be executed simultaneously and finish much faster (Monajemi et al., 2019). For example, the AWS cloud has provided 40,000 compute nodes for one industrial HPC use case (Amazon, 2019g). The NSF IceCube Experiment utilizes 51,500 GPUs across three cloud vendors (Amazon, 2019b) via the HTCondor software tool (Thain et al., 2005). As long as the parallelization efficiency is near 100% (no overhead from internode communication), users can launch any numbers of compute nodes without increasing the total CPU time and total cost.
Machine learning enhanced modeling. A recent research trend is to incorporate big data and machine learning into Earth science models (Reichstein et al., 2019;Schneider et al., 2017), by having coarse-resolution global models continuously learn from observation data or high-resolution local simulations. Machine learning workflows can require massive resources for training data generation, data preprocessing, and machine learning model training. Cloud platforms provide abundant data storage and compute resources for such workflows, including the readily available Earth science data sets as potential training data, and GPUs and tensor processing units for fast training of neural networks.
"Cloud bursting" for local clusters. Local clusters can be connected to cloud platforms to handle temporary surges in user demand or to provide extra hardware types like GPUs. Such usage is called "bursting into the cloud" and is supported by popular job schedulers like Slurm (Amazon, 2019d). For example, the NASA Pleiades supercomputer is now interfaced with the AWS cloud, allowing Pleiades users to submit jobs to AWS compute services and upload files to AWS cloud storage (NASA HECC, 2019). Such cloud bursting capability can also simplify the funding and cost management issues for commercial cloud resources, as the cloud charges can be managed by the local cluster administrators and potentially be absorbed by research grant allocations.

HPC Workflow on the AWS Cloud
Here we describe the research flow in an HPC cluster environment on the AWS cloud. Similar concepts and services apply to other cloud platforms such as Microsoft Azure and Google Cloud (Google, 2018b;. We refer the reader to a step-by-step online tutorial (Zhuang, 2019a) to supplement the conceptual overview presented here.

Single-Node Workflow
For novice users, it is important to first get familiar with the single-node workflow , which is often sufficient for early-stage model testing and data analysis tasks. Figure 1a shows the single-node workflow. It involves two core services on AWS: Elastic Compute Cloud (EC2) for computation and Simple Storage Service (S3) for data storage. The user launches a virtual machine (an "EC2 instance" in AWS terminology) to perform any computation tasks (e.g., perform a GEOS-Chem simulation or run a data analysis script) and terminates the instance after finishing the computation. The EC2 instance behaves like a normal Linux server that users can log-into via Secure Shell (SSH) on their terminal. The hardware aspects (CPUs, memory, and network speed) of the instance are defined by the "instance type" (Amazon, 2018a), and the software aspects (operating system, preinstalled software) are defined by an Amazon Machine Figures 1. Single-node and multinode workflows on the AWS cloud. The single-node workflow (a) is a simplified version of Figure 2 in , where the Amazon Machine Image (AMI) contains the software environment for the virtual machine (EC2 instance). The multinode workflow (b) adds the autoscaling group for compute nodes. The software environment for the master node and all compute nodes is still created from an AMI, but the creation of nodes is generally handled by a high-level framework, not manually by the user, so the AMI is not shown in the workflow (see main text in section 3.2). The master node runs a Network File System (NFS) server to allow all compute nodes to access the EBS volume. If a Lustre parallel file system is further enabled (not shown in the figure), it will give direct I/O access to all compute nodes and avoid any I/O bottlenecks from the NFS. Image (AMI). The EC2 instance uses Elastic Block Store (EBS) volumes as temporary disks to perform direct I/O during computation. S3 is used for persistent data storage.

Multinode Workflow
A multinode cluster on the AWS cloud contains an ensemble of EC2 instances (each is a compute node), connected via SSH, allowing an MPI program to run across multiple nodes. A special master node handles administration tasks. It is also useful to install a job scheduler such as Slurm (Yoo et al., 2003) that is commonly used on local HPC clusters. The scheduler serves three purposes: (1) correctly map MPI processes (also called MPI ranks) to different compute nodes and cores, (2) manage the initialization and finalization of each simulation job, and (3) potentially integrate with the Auto Scaling capability (Amazon, 2018c) to request or terminate compute nodes as needed and thus avoid the cost of idle compute resources. Unlike job schedulers on local shared HPC clusters, the scheduler here is generally not used for resource sharing between users, because any user can create a new cluster for exclusive use.
Besides compute facilities, one needs a shared file system to host data files, software libraries, and model executables that can be accessed from all nodes. A simple choice is the Network File System (NFS) that runs on the master node and serves data to all compute nodes. The NFS architecture may cause an I/O bottleneck, and I/O-intensive models can benefit from a high-performance parallel file system such as Lustre (Schwan, 2003). The Amazon FSx for Lustre service (Amazon, 2018b) provides a fully managed Lustre file system that can be mounted to multiple compute nodes. Figure 1b illustrates the basic form of a multinode cluster, using a single EBS volume as a shared disk for all nodes, without a parallel file system. The software environment for the master node and all compute nodes is still created from an AMI, but the launch of nodes is generally handled by a high-level framework (see section 3.2), not manually by the user. The user logs into the master node via SSH and submits jobs to compute nodes via a job scheduler. The compute nodes form an Auto Scaling group (Amazon, 2018c) that automatically adjusts the number of nodes based on the jobs in the scheduler queue. After finishing the computation, the user archives important data to S3 storage and then terminates the entire cluster.
Building a multinode HPC cluster environment on the clouds used to be difficult, prompting developments of "HPC-as-a-Service" (Baun et al., 2011;Church et al., 2015;Huang, 2014; Section 2.3 of Netto et al., 2017;Wong & Goscinski, 2013), where scientists can access a preconfigured HPC environment, often through a graphical web portal (Calegari et al., 2019), with no need to understand the underlying infrastructure. However, such black-box service has several drawbacks for research computing: (1) The available software libraries and applications are determined by the HPC service provider and are not easy to extend to custom research code and (2) continuous maintenance by the service provider is required to keep software and hardware up to date. Recent developments in software tools have made it much easier for scientists to create their own HPC environment on the cloud for custom research needs, without requiring specific support from system administrators. On the AWS cloud, the user can now quickly launch a multinode HPC cluster using the AWS ParallelCluster framework (https://github.com/aws/aws-parallelcluster) and then easily install the desired software libraries (e.g., specific versions of compiler, MPI, and NetCDF) using the Spack HPC package manager (Gamblin et al., 2015, https://github.com/spack/spack). Appendix A reviews other approaches and tools for deploying HPC cluster infrastructure on the cloud. Appendix B shows the advantages of Spack compared to other software installation methods. If desired, it is then possible to deploy HPC-as-a-Service for other users on top of such a custom HPC environment, using auxiliary AWS functionalities (Amazon, 2019f, 2019a(Amazon, 2019f, , 2019c and multiuser serving frameworks like JupyterHub (Glick & MacHe, 2019;Milligan, 2018;Prout et al., 2017;Sarajlic et al., 2018); this can be useful for classes and workshops where the customization of software environment is not important. benchmarks (https://github.com/intel/mpi-benchmarks). Compared to actual model benchmarks, microbenchmarks provide an application-agnostic assessment of system performance, help attribute actual model performance issues to specific hardware components (e.g., very slow networks), and catch potential hardware/system problems at an early stage. For example, we were able to catch a critical performance issue regarding OpenMPI and EFA on the AWS cloud using microbenchmarks (reported to https://github.com/aws/aws-parallelcluster/issues/1143). If we had skipped the microbenchmarks and directly tested the actual model, such performance problem would have been much harder to identify.
Early microbenchmark results for cloud platforms have largely become obsolete (e.g., Evangelinos & Hill, 2008;Iosup et al., 2011;Jackson et al., 2010;Sadooghi et al., 2015), due to rapid hardware innovations. Breuer et al. (2019) conducted an up-to-date, systematic study of the CPU, memory, and network performance of the latest AWS EC2 "c5n.18xlarge" instance, and concluded that the CPU performance is "within 95% of the efficiency of the bare-metal system," and the memory performance also "behaves similar to the bare-metal machine." This is in part due to the extremely low overhead of the AWS Nitro Hypervisor (Gregg, 2017). However, Breuer et al. (2019) did not test AWS EFA, which should further improve network performance. Here we do not repeat the microbenchmarks on CPU and memory and instead only focus on internode communication, which is known to be the most important difference between cloud platforms and local HPC clusters.
We compare the performance of the AWS HPC cluster to that of the NASA Pleiades supercomputer, here with OMB as a generic internode communication benchmark and in the next section (section 5) with application to GEOS-Chem. On AWS, we use the biggest "C5n.18xlarge" EC2 instance type (36 physical cores). We use Spack to install the Intel compiler, OpenMPI (Gabriel et al., 2004), and NetCDF-Fortran on top of the AWS ParallelCluster environment. The cluster also contains a preinstalled Intel-MPI library optimized for AWS EFA. On Pleiades, we use the latest available software modules. Table 1 summarizes the hardware and software.

Network Performance Factors
We measure the internode network performance with OMB Version 5.6.1, originally described by Liu et al. (2003) and available for download at MVAPICH (http://mvapich.cse.ohio-state.edu/benchmarks/). Network performance includes two factors: bandwidth and latency (Chapter 4.5.1 of Hager & Wellein, 2010). The total time T for passing a message can be computed as follows: where T l is latency (μs), B is bandwidth (MB/s), and N is the message size (bytes). Latency indicates the initial connection overhead and bandwidth measures the sustained data transfer rate after the connection is established. Transferring small messages is latency limited (T l ≫ N/B), while transferring large messages is bandwidth-limited (T l ≪ N/B). The actual size of messages is application-dependent and can be measured by MPI profiling tools, detailed in section 5.3. In some literature as well as in the OMB code, "latency" may also refer to "the total time to send messages" (T) instead of just the initial overhead (T l ). Here we strictly use the term "latency" for T l and "time to send messages" for T. The network latency and bandwidth are determined by both network hardware (physical fabric and network adaptor) and software (network driver and communication protocol). On the hardware side, high-end supercomputing clusters typically use InfiniBand (Grun, 2010), while low-end clusters and most cloud platforms use Ethernet (Chapter 3 of Gavrilovska, 2009). On the software side, InfiniBand-clusters can communicate via Remote Direct Memory Access (RDMA), an HPC communication technology with ultralow overhead and submicrosecond latency, while Ethernet clusters generally rely on the Transmission Control Protocol (TCP), a general-purpose network protocol with much higher overhead and longer latency. The technical differences between TCP and RDMA are further explained in Appendix C. TCP is known to be inefficient for HPC applications, and various non-TCP mechanisms have been developed to improve the communication performance even on Ethernet clusters (Gavrilovska, 2009), such as RDMA over Ethernet (known as RoCE, Guo et al., 2016). Although AWS EC2 does not yet support RDMA, the newly introduced AWS EFA provides a high-performance, non-TCP communication mechanism, supporting some RDMA-like features (Appendix C). EFA uses the open-source Libfabric library (Grun et al., 2015) for interfacing between the low-level AWS network protocols and the high-level MPI libraries. Because the change is made in the MPI library level, there is no need to modify the application code (e.g., the GEOS-Chem model code in this work) in order the utilize the EFA feature. When conducting this work, we found that only the Intel-MPI library can efficiently utilize EFA; the OpenMPI library still had to use TCP for internode communication. Figure 2 shows the OSU point-to-point, one-pair latency and bandwidth benchmark results (the "osu_latency" and "osu_bw" programs in OMB code). The benchmarks involve two compute nodes passing messages to each other ("point-to-point"), with only one MPI process running on each node ("one-pair"). The time and bandwidth are shown as a function of message size, following the standard practice for visualizing network performance (Liu et al., 2003). Latency is the asymptote for small messages in Figure 2a. Between two EC2 c5n.18xlarge instances, the latency is about 30 μs with TCP and 16 μs with EFA. The NASA Pleiades cluster with InfiniBand network has a latency of 2 μs with the native RDMA mechanism. To demonstrate the important effect of communication protocol (in addition to the network hardware), we also force TCP over InfiniBand on Pleiades (disabling the native RDMA capability following Section 9.1.4 of Yelick et al., 2011) and find a much higher latency of 11 μs.

Network Benchmark Results
For bandwidth, we focus on the maximum bandwidth achieved by relatively large messages (right limit in Figure 2b). The EC2 c5n.18xlarge bandwidth with EFA is about 8,000 MB/s, 7 times higher than with TCP (1,200 MB/s), and exceeds the bandwidth of NASA Pleiades InfiniBand (5,000 MB/s). The bandwidth measured here is defined as "unidirectional bandwidth" as messages flow in only one direction. We also measure the "bidirectional bandwidth" (the "osu_bibw" programs in OMB code), which allows two MPI processes to simultaneously send messages to each other (not plotted). The bidirectional bandwidth on Latency is the asymptote for very small messages, that is, the left limit of panel (a). The maximum bandwidth is achieved by relatively large messages, that is, the right limit of panel (b).
EC2 is the same as the unidirectional value, while the bidirectional bandwidth on Pleiades doubles its unidirectional bandwidth, as expected for InfiniBand interconnect (Figure 4 in Lockwood et al., 2014).
One would expect the 100-Gigabit Ethernet on EC2 C5n.18xlarge to deliver 12.5-GB/s bandwidth (8 Gigabit (Gb) = 1 GigaByte (GB)). However, TCP only delivers one tenth of theoretical bandwidth. This is because the data transfer with TCP is limited by the CPU processing power instead of network speed (Appendix C). The full bandwidth can be obtained by a multipair benchmark (the "osu_mbw_mr" program in OMB code.), which uses many MPI processes to simultaneously pass messages between two compute nodes and make use of all CPU cores (detailed instructions in Zhuang, 2019b). EFA and RDMA have much smaller CPU overhead and can achieve near-full bandwidth with only one pair of MPI processes.
Overall, the current network performance on AWS EC2 shows 100 times improvement in bandwidth and 5 times improvement in latency over the old EC2 instance types with only 80-MB/s bandwidth and 80-μs latency (Evangelinos & Hill, 2008;Hill & Humphrey, 2009). Such improvement comes from the continuous updates in the AWS networking technology over the past decade, such as the support of Cluster Placement Group (to ensure full bandwidth between nodes; Amazon, 2010), Enhanced Networking (also known as SR-IOV, Lockwood et al., 2014;Chapter 13.3.2 of Foster & Gannon, 2017), and most recently EFA.
We also benchmark collective MPI functions (often called "collectives") that perform complex message transfer across many compute nodes. We find that the choice of MPI libraries and communication protocols has a major impact on the performance of MPI collectives, especially at large core counts. Figure 3 show the performance of the "MPI_Bcast," one of the most frequently used collective functions (Table 3 of Parker et al., 2018). For example, GEOS-Chem uses "MPI_Bcast" to send input data from the master MPI process to all MPI processes (shown in section 5.3). For small messages (left limit of Figure 3a), NASA Pleiades InfiniBand is 6 times faster than EC2 with TCP (either for Intel-MPI or OpenMPI) and 3 times faster than EC2 with EFA. This performance difference between EC2 and Pleiades is mainly due to the higher latency on EC2, which mostly affects small messages (Figure 2a). For large messages (right limit of Figure 3a), EC2 with Intel-MPI (either for TCP or EFA) is as fast as Pleiades InfiniBand, while EC2 with OpenMPI is 4 times slower than Pleiades. Broadcasting with OpenMPI does not scale well with the number of cores, compared to the excellent scalability of Intel-MPI (Figure 3b). We also compare the "MPI_Allreduce" function, with a similar finding that OpenMPI is 5 times slower than Intel-MPI at 1,152 cores. This performance difference between OpenMPI and Intel-MPI is mainly due to their underlying collective algorithms (Chan & Heimlich, 2007;Patarasuk & Yuan, 2009;Pješivac-Grbović et al., 2007)   (1) the "column operators" (Long et al., 2015) for local or column-based computations such as chemical kinetics and convection, (2) the 3-D advection component using the GFDL-FV3 model (Putman & Lin, 2007), and (3)

Performance and Scalability
We benchmark a 7-day global simulation at C180 cubed-sphere horizontal resolution (≈50 km) and 72 vertical layers. The model reads 150 GB of input data during the simulation and writes the global 3-D daily concentration fields of 163 chemical species, resulting in 60 GB of output data (8.6 GB per day × 7 days). All data are hosted on a single throughput-optimized hard disk drive (HDD, the "st1" EBS volume type) with a theoretical throughput of 500 MB/s, mounted to all compute nodes via NFS. We do not enable the AWS FSx for Lustre service for parallel I/O, because the model version used here mostly relies on serial I/O. Figure 4 shows the benchmark results from 144 to 1,152 cores. We achieve good performance and scalability on AWS EC2 with Intel-MPI and EFA, consistently faster than on the NASA Pleiades cluster at the same number of cores (Figure 4a). This is in part to due to the newer processor generation (Intel Skylake) available on EC2, as opposed to the older generation on Pleiades (Intel Haswell). With OpenMPI and TCP, however, the model cannot scale beyond 576 cores, due to a major slowdown in reading input data ( Figure 4c) and a minor slowdown in advection (Figure 4b). When the model reads input data, a single master MPI process handles most of the disk read operation and broadcasts the data to the rest of MPI processes. While the disk operation time roughly stays constant, the broadcasting takes longer with more cores, as each core runs an MPI process that has to receive a copy of data. The exact same I/O problem was also found in early results with the Community Atmosphere Model on the AWS cloud (Section 9.1.4 of Yelick et al., 2011). Writing output data is fast compared to input, as each core can write its own domain without data exchange. The slow performance of "MPI_Bcast" with OpenMPI (section 4.2 and Figure 3) is the major cause of the I/O  Table 1). The AWS cluster uses EC2 c5n.18xlarge instances with either Intel-MPI and EFA or OpenMPI and TCP for internode communication. Gray dashed lines indicate perfect scaling. Note difference in ordinate scales between panels.
bottleneck here, as further verified in the next section (section 5.3). Importantly, we find that serial I/O does not limit performance with at least up to 1,152 cores when using Intel-MPI and EFA (Figure 4).
While our benchmark focused on the model simulation time, the job queuing time is also relevant for a real research workflow, especially for early-stage experiments that require frequent debugging and resubmission of jobs. On AWS EC2, we were able to request a thousand-core cluster in less than 5 min; on the NASA Pleiades cluster, the queuing time to start a thousand-core job ranged from an hour to a day, depending on the cluster utilization.

MPI Profiling
The weaker performance of OpenMPI compared to Intel-MPI is worth investigating further, because OpenMPI has otherwise the advantage of being popular and open-source (Intel-MPI is free but closed-source It is sufficient to use the lightweight IPM profiler here, instead of more powerful HPC profilers such as TAU (Shende & Malony, 2006) and HPCToolkit (Adhianto et al., 2010), which can capture all function calls. Here we are only interested in MPI function calls (a tiny subset of all function calls), because any potential network performance issues on AWS EC2 will slow down MPI. Figure 5 shows the results of IPM profiling applied to the GEOS-Chem simulations presented in section 5.2. The most time-consuming functions are "MPI_Barrier" (all MPI processes need to wait for the slowest process) and "MPI_Bcast" (after reading data, the master process broadcasts data to all processes), followed by "MPI_Wait" (waiting for asynchronous functions like "MPI_Isend" to finish) and "MPI_Allreduce" (computing the global masses of chemical species, to check mass conservation). With more cores, the time spent on "MPI_Bcast" grows rapidly when using OpenMPI, but stays constant when using Intel-MPI and EFA (Figure 5b). This is consistent with the "MPI_Bcast" microbenchmarks in section 4.2 and Figure 3. The idling time in "MPI_Barrier" is relatively insensitive to the MPI configuration and is not specific to the cloud. It reflects a load imbalance in parallelization, and further investigation shows that it is caused by slow integration of the chemical kinetics near sunrise/sunset (Appendix D). We found MPI profiling to be a critical step in the early stage of this work for identifying and resolving the model performance bottleneck. Initially, we only built the model with OpenMPI 3.1.4, which is the standard library configuration used for the official GEOS-Chem benchmarks conducted on Harvard's local cluster (GEOS-Chem, 2019). However, this software configuration does not work efficiently in the AWS cloud environment. By conducting profiling with IPM, we were able to narrow down the performance issue to a single "MPI_Bcast" function and then focus on improving "MPI_Bcast" performance using OSU microbenchmarks with different MPI configurations, leading to success when using Intel-MPI with EFA. Without such further investigation, we might have erroneously concluded that the AWS cloud is inefficient for HPC, following on the body of literature cited in section 1.

Cost Analysis
AWS EC2 charges per computing second and per unit storage. For our application, the cost of storage is cheap compared to the cost of computing . There are several pricing options for computing, including the standard "on-demand" pricing and the cheaper "spot" pricing (about a factor of 3 cheaper). The difference between the two is that spot instances may be interrupted, but this happens rarely (Amazon, 2018d). Here we used spot pricing and did not experience any interruptions. The spot price varies depending on time and region. When conducting this work in the AWS "us-east-1" region, one EC2 "c5n.18xlarge" node (36 cores) cost 1.17 USD per hour. The cost would have been 0.72 USD per hour if we were in the "us-east-2" region. To compare these costs to the NASA Pleiades cluster, we use the Standard Billing Unit cost model to estimate true hourly cost, following the calculations in Chang et al. (2018) and . One Pleiades Haswell node (24 cores) costs 0.53 USD per hour in this cost model. Figure 6 shows the cost versus the completion time of the 7-day global GEOS-Chem simulation at cubed-sphere C180 (≈50 km) resolution, for the AWS EC2 and the NASA Pleiades clusters. The AWS EC2 time and cost are based on Intel-MPI with EFA and assume "us-east-1" spot pricing. There is a trade-off between time and cost-using more CPU cores reduces the time to finish simulation, but increases the total CPU hours (thus the cost). Even though Figure 4 shows a relatively successful parallelization of GEOS-Chem Figure 6. Cost of a 7-day global GEOS-Chem simulation of tropospheric-stratospheric chemistry at cubed-sphere C180 (≈50 km) resolution. The time required to finish the simulation (x axis) varies with the number of cores (indicated as text next to the data point), which in turn affects the total cost because scalability is less than 100%. Cost of the AWS EC2 cluster is compared to that of the NASA Pleiades cluster. The AWS EC2 time and cost are based on Intel-MPI and EFA performance and assume "us-east-1" spot pricing. The NASA Pleiades cost is based on the Standard Billing Unit (SBU) model.

10.1029/2020MS002064
Journal of Advances in Modeling Earth Systems on both AWS EC2 and Pleiades in going from 144 to 1,152 cores, there is still a 42% departure from perfect scaling with AWS EC2 and 35% with Pleiades, as indicated by the increase in total cost (perfect scaling means zero increase in total CPU hours and cost). If the turnaround time required for simulation completion can be longer than 5 hr, then AWS EC2 is cheaper than Pleiades; if it must be less than 5 hr, then Pleiades is cheaper than AWS EC2. If the amount of money available for the simulation is less than 50 USD, then AWS EC2 is the faster choice; if it is more than 50 USD, then Pleiades is the faster choice. Obviously, the trade-offs will depend on the model configuration and resolution. Overall, the cost difference between AWS EC2 and Pleiades is within 20% for the same turnaround time, for our application and cost model. For other applications, the cost comparison could be either in favor of the cloud or local clusters, depending on the use case, the computational efficiency, and the choice of cost model (Section 2.1 of Netto et al., 2017;Roloff et al., 2017;Siuta et al., 2016;Thackston & Fortenberry, 2015). Our results contrast with some previous studies concluding that the AWS cloud was not cost-efficient compared to local HPC clusters (e.g., Chang et al., 2018;Emeras et al., 2017), partly because these studies did not take advantage of recent advances in cloud networking.

Conclusions
We have shown that the AWS cloud can be used for massively parallel Earth science model simulations with performance comparable to local supercomputing clusters. This is enabled by recent improvements in the cloud internode network performance. We present a complete research workflow in a custom HPC environment on the AWS cloud, using the AWS ParallelCluster framework and the Spack HPC package manager. The workflow can be easily applied to any Earth science model. Online instructions (see the Acknowledgments section) and tutorials (Zhuang, 2019a) are available for reproducing the workflow. Making models readily available in an HPC cloud environment can save researchers considerable time that would otherwise be spent in configuring complicated software and downloading input data.
We demonstrated the HPC-on-cloud capability with the GEOS-Chem global 3-D chemical transport model and show that we can make it scale efficiently up to at least 1,152 cores on the AWS cloud, with comparable performance and cost to the NASA Pleiades supercomputer. This is in sharp contrast to previous studies finding that cloud platforms were much slower and more costly than local HPC clusters. The AWS cloud has recently become much more suited for large-scale HPC applications, due to various hardware and software improvements including the new "c5n.18xlarge" EC2 instance type with much higher network bandwidth, the new EFA for more efficient internode communication, and the fast MPI collective algorithms in the latest Intel-MPI library. We demonstrated these improvements in internode communication performance using microbenchmarks applicable to general HPC applications and with MPI profiling applied to the GEOS-Chem model.
An open question is how to fund researchers for utilizing commercial cloud resources. Researchers are used to having their computing costs subsidized through their institution or covered through research grants. For this work, we were able to obtain free credits from the AWS Cloud Credit for Research Program. In the long term, however, there must be coordination between funding agencies, cloud vendors, and universities to build a more cloud-friendly funding procedure. For example, the NSF Cloud Access Program (NSF, 2019) aims to support research and education on cloud computing platforms. For Earth science specifically, "Jupyter meets the Earth" (Pérez et al., 2019) is a recent NSF-funded project that focuses on geoscientific software development for cloud platforms and HPC centers.

Appendix A: Approaches to Deploy HPC Cluster Infrastructure on the Cloud
To deploy the cluster architecture described in section 3.2, there are two main approaches: bottom-up and top-down. In the bottom-up approach, the user launches individual resources (EC2 instances and EBS volumes) and manually glues them together. For connecting EC2 instances, this just means copying the SSH key to each instance so they access each other. A step-by-step tutorial is given by Lockwood (2013). This is easy to do even for a novice AWS user. Other tasks, such as deploying an auto-scaling-enabled job scheduler and configuring a shared file system, involve more complicated steps. This bottom-up deployment can be done by manually clicking on the AWS web console (Appendix A of Jung et al., 2017) or be partly automated by custom scripts (Chapter 9 of Yelick et al., 2011;Chang et al., 2018) based on the AWS Command Line Interface.
A better approach is top-down, or "Infrastructure as Code" (IaC) in cloud computing terminology (Chapter 1 of Brikman, 2018;Morris, 2016; Chapter 4 of Wittig & Wittig, 2018). Unlike the bottom-up approach that specifies the exact operations to build a complicated cluster infrastructure (a "procedural" approach), here the user simply describes the desired architecture in a text file (a "declarative" approach). The text file, often written in JSON or YAML format, contains the complete details of the cluster including the EC2 instance type for master and compute nodes, the number of compute nodes, the size and type of EBS volumes, how the volumes are mounted to the instances, and the network configurations between the instances. This description file is then digested by a cloud management service (CloudFormation in case of AWS; Deployment Manager in case of Azure and Google Cloud) to produce the full architecture without human intervention. A key difference from the bottom-up approach is that IaC handles intercomponent dependencies much more robustly. In the bottom-up approach, one AWS Command Line Interface command typically only handles one component ("launching an EC2 instance," "creating an EBS volume," or "mounting the volume to the instance," etc.), but there can be complicated dependencies between each instruction. For example, before mounting an EBS volume to an EC2 instance, one needs to validate that both the instance and the volume are running correctly. However, executing a single EC2 launch command does not guarantee a ready-to-use instance, due to the warm-up time of the instance and even potential errors during instance creation. IaC frameworks ensure that all prerequisites are met before executing the next stage. When deleting the cluster after the model computation is done, IaC also helps resolve the dependency problem in the reverse order.
Compared to the bottom-up approach of manually assembling a cluster, the top-down IaC approach has important advantages: The cluster architecture and deployment steps are self-documented by the description file, making it easy to reproduce the same architecture and share it with other people. The cluster architecture as a text file can even be version-controlled by standard version control software like Git. The manual approach, in contrast, is vulnerable to human errors and unclear documentation. For frequent users, the manual deployment task will become repetitive and time-consuming, while the IaC approach only involves executing a single command on the existing configuration file. The IaC approach also scales better. For example, suppose that one wishes to increase the number of compute nodes from 2 to 16. The top-down approach only involves modifying a single number in the configuration file and using a single command to update the cluster; the bottom-up approach requires manually launching 14 more instances and connecting them to the existing cluster, a complicated and error-prone process. Due to those advantages, IaC has become the standard practice in modern IT management (Morris, 2016). Thus, we advise scientific users to adopt the top-down approach for production deployment. The bottom-up approach is still a useful technical exercise for novice AWS users to better understand the cluster infrastructure (Lockwood, 2013).
Although it is possible to create an HPC cluster on AWS using the standard AWS CloudFormation IaC service, scientific users should use higher-level IaC tools that are designed specifically for managing HPC clusters. Those cloud-HPC tools are built on top of general IaC frameworks like CloudFormation but have a simpler interface and a smoother learning curve. An example is AWS ParallelCluster, an open-source HPC cluster management tool developed and supported by AWS. It is the successor of the CfnCluster tool used in previous literature (Freniere et al., 2016;Madhyastha et al., 2017). There are other third-party tools with different strengths that might be of interest to scientists. AlcesFlight (http://docs.alces-flight.com) provides a comprehensive, scientist-friendly documentation and supports both AWS and Azure cloud; Slurm-GCP (https://github.com/SchedMD/slurm-gcp) provides an easy way to launch Slurm-enabled clusters on Google Cloud; ElasticCluster (https://github.com/elasticluster/elasticluster) is designed to support multiple cloud vendors; Ronin (https://ronin.cloud/) provides a user-friendly web portal on top of command-line-based tools like AWS ParallelCluster. StarCluster (http://star.mit.edu/cluster) was once a popular cloud-HPC tool and still used in recent literature (Chen et al., 2017), but we do not recommend it due to the stalled development and the lack of new feature support (e.g., new instance types and networking capabilities). We recommend AWS ParallelCluster for the AWS cloud environment, because it supports the latest AWS HPC features (e.g., FSx for Lustre and EFA) that are not available in other tools.
In practice, the cluster involves additional AWS concepts and services that are not shown in Figure 1b, including virtual private cloud and subnet for network address configuration; EC2 Security Groups for network permission; EC2 Placement Groups for network topology configuration; Identity and Access Management for cross-service permission; and messaging services (Amazon SNS, SQS) to monitor the cluster state. When using a high-level framework like AWS ParallelCluster, these additional components can be handled automatically.
Appendix B: Approaches to Install HPC Software Libraries on the Cloud After the cluster infrastructure is deployed, the next step is to install software libraries such as MPI and NetCDF. Building scientific software libraries is notoriously difficult, due to complicated interlibrary dependencies (Dubais et al., 2003). Traditionally, users need to choose between two extreme ways of software installation: using default package managers or building from source code. Most operating systems contain a default package manager (apt for Debian/Ubuntu and yum for RedHat/CentOS) that can install prebuilt binary files ("binaries") in a few seconds and ensure correct interlibrary dependencies. However, the available prebuilt binaries are limited to very few versions and compile settings, while HPC codebases with legacy software modules often require specific library versions to compile correctly (e.g., the case with the GEOS model in Chang et al., 2018). Furthermore, precompiled binaries are built for general environments and are not optimized for a particular computer system and compiler. Therefore, HPC model developers generally have to build each library from its source code and manually handle interlibrary dependencies, a slow, laborious, and error-prone process.
To solve this problem, we use the Spack supercomputing package manager (Gamblin et al., 2015) to install the required libraries on the AWS cloud. Spack was originally developed at the Lawrence Livermore National Laboratory and is now used in many HPC centers. To install a particular library, Spack first builds a directed acyclic graph (DAG) representing the interlibrary dependencies and solves the DAG bottom-up to ensure that all dependencies are met. This DAG approach is similar to how default package managers (apt/ yum) resolve dependencies. Different from default package managers, Spack builds libraries from source code instead of downloading precompiled binaries. This allows users to compile arbitrary software versions and even insert custom compile flags to optimize software performance. Compiling from source code takes relatively long time, but the build process for the entire DAG is automated and does not require human intervention. Other HPC-oriented package managers like EasyBuild (Geimer et al., 2014) and OpenHPC (Schulz et al., 2016) serve a similar purpose as Spack.
Spack provides advanced software management utilities (McLay et al., 2011) beyond library installation. It allows multiple environments to coexist and provides an easy way to switch between them (via the "spack env" command). Such utility is particularly valuable for model development that often requires testing many combinations of libraries. Spack has been chosen by U.S. DOE's Exascale Computing Project to deliver the entire Exascale Computing Project software stack (Heroux et al., 2019), which contains many cutting-edge HPC development tools that can be useful for Earth science model developers.
After the software libraries and models are correctly built on the cloud, other model users can simply reuse the environment and do not need to repeat the installation. On a single EC2 instance, the environment can be shared through an AMI . For a multinode cloud cluster, the most straightforward way is to archive the entire Spack software directory in a persistent storage such as AWS S3 and let users pull the directory to their clusters' shared disks to replicate the same software environment. An alternative way to share prebuilt models is modifying the compute node's AMI to include custom software, but this is discouraged by the AWS ParallelCluster official documentation (https://docs.aws.amazon.com/parallelcluster/latest/ug/tutorials_02_ami_customization.html), because "after you build your own AMI, you no longer receive updates or bug fixes with future releases of AWS ParallelCluster." Containers like Docker, Singularity (Kurtzer et al., 2017) and CharlieCloud (Priedhorsky et al., 2017) are additional ways to simplify software deployment. Spack and containers serve different purposes and may be used together. Containers help the software delivery process once the software has been correctly installed, while Spack helps the installation process itself. To create a container image, one can use any of the approaches reviewed above: pull prebuilt binaries, build from source manually, or use Spack. The advantage of containers is the guarantee of the exact same software environment on different systems. On the native noncontainer environment, in contrast, even if the user installs exactly the same software versions with Spack, there is still chance for operating-system-specific errors to occur, especially for legacy software modules that are not extensively tested on multiple platforms. Despite some pioneering examples of using containers within multinode MPI environments (Younge et al., 2017;Zhang et al., 2017), such usage can still be challenging in practice, as stated in the Singularity container documentation (https://sylabs.io/guides/ 3.3/user-guide/mpi.html) that: "the MPI in the container must be compatible with the version of MPI available on the host" and "the configuration of the MPI implementation in the container must be configured for optimal use of the hardware if performance is critical." In this work, we use the native system without containers. The Spack installation will be portable to a different cloud platform as long as the same operating system image (CentOS 7 in this work) is used.

Appendix C: Details of Network Protocols for Internode Communication
TCP's large overhead mainly comes from excessive data copying (Chapter 7.1 of Donahoo & Calvert, 2009). One would expect that sending data from one compute node to another would require a single copying operation from source memory to destination memory. Although RDMA does offer such direct-copy capability, TCP requires multiple copying operations under the hood, as illustrated in Figure C1. First, the data residing in the user-application's memory are copied to the TCP Send Buffer residing in the kernel space. Next, the kernel sends the data over the network via the Network Interface Card. The destination node then receives the data from the network and put them into the TCP receive buffer. Finally, the data are copied from the receive buffer to the actual program's memory. In fact, even more copying operations can happen inside the network protocol stack (Mellanox Technologies, 2014). Those copying operations degrade the performance of MPI programs in three different ways (Chapters 2 and 3 of Gavrilovska, 2009): 1. Copying from one memory location to another is limited by memory bandwidth. In early days, the memory bandwidth was much faster than network bandwidth, so such extra memory copying did not incur much performance penalty. However, modern networks have comparable bandwidth as memory, so extra memory copying can be a communication bottleneck. 2. The CPU is responsible for the copying operation and protocol processing (e.g., "checksum" to verify message integrity), and during the same time the CPU cannot perform useful numerical computations. The CPU processing speed can even become the bottleneck for very fast message transfer. This explains why a single pair of MPI processes (which only utilize a single pair of CPU cores) cannot drive the full 12.5-GB/s bandwidth on C5n.18xlarge, and a much higher bandwidth is achieved by adding more MPI pairs (which utilize more CPU cores), as shown in section 4.2. 3. Copying from user space to kernel space requires a Linux "system call," which further adds latency and disturbs the user program. A system call is for switching a Linux program between "user mode" and

10.1029/2020MS002064
Journal of Advances in Modeling Earth Systems "kernel mode." A program typically runs in the user mode with limited permissions; accessing external devices (e.g., disk I/O or network communication) requires the kernel mode with higher privileges.
On the contrary, InfiniBand RDMA offers low-overhead communication features. Below are three main features, each addresses one of the above TCP limitations (Chapter 8 of Gavrilovska, 2009): 1. Zero copying. The data in the user program are directly sent over the network and get directly placed in the receiver program's memory. This removes the memory copy overhead as in TCP. 2. Protocol offloading. The protocol processing workload is offloaded from the CPU to the network device (RNIC in Figure 2b). This allows the CPU to do more useful numerical computations (Squyres, 2009). 3. Kernel bypassing. The communication does not invoke the operating system kernel. The program can keep running in user mode without being disturbed. It is also called operating system bypassing or user-space networking (Squyres, 2015).
AWS EFA supports some RDMA-like features such as kernel bypassing, to reduce CPU overhead and increase network performance.

Appendix D: MPI Barrier Caused by Load Imbalance in the Chemical Integrator
The MPI profiling in section 5.3 reveals a GEOS-Chem performance bottleneck with the "MPI_Barrier" function ( Figure 5). This bottleneck is not caused by internode communication or any cloud-platform-specific limitation. It comes instead from the load-imbalance of the chemical kinetics solver (the Rosenbrock algorithm of Sandu & Sander, 2006) when applied globally to the GEOS-Chem domain. Figure D1 shows the number of internal time steps required for integrating the chemical kinetics over one 20-min external time step, in a GEOS-Chem simulation at 4°× 5°resolution using the standard tropospheric-stratospheric chemistry. The internal time steps are selected adaptively by the solver to achieve a certain error tolerance. The numbers are summed over the 72 vertical layers for each model grid column. The solver takes much more internal steps around sunrise/sunset when photolysis frequencies change rapidly. We find here that 93% of model columns use 700-1,100 internal steps, while the sunrise/sunset columns use near 1,650 internal steps. Thus, different MPI processes may require very different time to complete the chemical integration depending on their number of sunrise/sunset columns. We verified that this was the problem by rerunning the model with chemistry turned off and observing a much shorter "MPI_Barrier" time. A similar pattern was shown by the Figure 1 of Alvanos and Christoudias (2017) in their global atmospheric chemistry model. The code and further discussions for this problem are at the websites (https://github.com/geoschem/gchp/issues/44 and https://github.com/geoschem/geos-chem/issues/77). Figure D1. Number of internal steps taken by the Rosenbrock solver for integrating chemical kinetics in GEOS-Chem over a 20-min external time step, sampled at 01 July 2016 00:00 UTC for a global simulation at 4°× 5°resolution. Numbers are summed over the 72 vertical layers for each column of the model grid.
One might attempt to correct this load imbalance by parallelizing the global domain by latitude bands, so that each compute node has a comparable number of sunrise/sunset grid boxes. But the advection component of the model would then require more internode communication and thus scale less efficiently. Alternatively, one might seek to speed up the chemical integration at sunrise/sunset. There has been interest within the GEOS-Chem community in developing adaptive chemical solvers to speed up the chemical integration (Santillana et al., 2010;Shen et al., 2019). Our results indicate that particular attention should be placed on sunrise/sunset in order to accelerate massively parallel computation.