Accelerating an Adaptive Mesh Refinement Code for Depth‐Averaged Flows Using GPUs

Solving the shallow water equations efficiently is critical to the study of natural hazards induced by tsunami and storm surge, since it provides more response time in an early warning system and allows more runs to be done for probabilistic assessment where thousands of runs may be required. Using adaptive mesh refinement speeds up the process by greatly reducing computational demands while accelerating the code using the graphics processing unit (GPU) does so through using faster hardware. Combining both, we present an efficient CUDA implementation of GeoClaw, an open source Godunov‐type high‐resolution finite volume numerical scheme on adaptive grids for shallow water system with varying topography. The use of adaptive mesh refinement and spherical coordinates allows modeling transoceanic tsunami simulation. Numerical experiments on the 2011 Japan tsunami and a local tsunami triggered by a hypothetical Mw 7.3 earthquake on the Seattle Fault illustrate the correctness and efficiency of the code, which implements a simplified dimensionally split version of the algorithms. Both numerical simulations are conducted on subregions on a sphere with adaptive grids that adequately resolve the propagating waves. The implementation is shown to be accurate and faster than the original when using Central Processing Units (CPUs) alone. The GPU implementation, when running on a single GPU, is observed to be 3.6 to 6.4 times faster than the original model running in parallel on a 16‐core CPU. Three metrics are proposed to evaluate relative performance of the model, which shows efficient usage of hardware resources.


Introduction
Tsunamis are among the most dangerous natural disasters and have recently resulted in hundreds of thousands of fatalities and significant infrastructure damage (e.g., the 2004 Indian Ocean tsunami, the 2010 Chile tsunami, the 2011 Japan tsunami, and more recently the 2018 Indonesia tsunami). A wide range of research has been conducted to simulate tsunami phenomena. These studies are essential for evacuation planning (e.g., Lämmel et al., 2010;Scheer et al., 2012), designing vertical evacuation structures (e.g., Ash, 2015;González et al., 2013), risk assessment (e.g., Annaka et al., 2007;Geist & Parsons, 2006;González et al., 2014), and early warning systems (e.g., Liu et al., 2009;Taubenböck et al., 2009). The later two, in particular, can benefit from fast simulators. For instance, Probabilistic Tsunami Hazard Assessment often requires thousands of simulations to generate tsunami hazard curves and maps, and some early warning systems depend on rapid simulation results to allow longer evacuation time.
In this paper we present a GPU-accelerated version of the open source GeoClaw code distributed as part of Clawpack (Clawpack Development Team, 2017), which is widely used for modeling tsunamis and has undergone a benchmarking process in 2011 (Horrillo et al., 2014) as part of the National Tsunami Hazard Mitigation Program, allowing its use on tsunami hazard modeling projects supported by this program (e.g., González et al., 2013;Titov et al., 2018). It has been used for many other projects involving tsunami hazard assessment and/or the study of historic or paleotsunamis, (e.g., Amir et al., 2013;Arcos & LeVeque, 2014;Cienfuegos et al., 2018;Galanti et al., 2011;Hayes & Furlong, 2010;Johnstone & Lence, 2009;MacInnes et al., 2013;Ren et al., 2013). GeoClaw has also been used in modeling storm surge (e.g., Mandli & Dawson, 2014) and dam failures (e.g., George, 2010;Turzewski et al., 2019), and so the GPU-accelerated version should prove useful beyond tsunami modeling.
The use of AMR adds challenges to the implementation, including dynamic memory structure creation and manipulation, balanced distribution of computing loads between the CPU and the GPU, and optimizations to minimize global memory access and maximize arithmetic efficiency in the GPU kernel. To achieve good speedup, some simplifications have been made to the algorithms of GeoClaw, as described below. The tests presented in section 6 show speedups from 3.6 to 6.4 when compared to the original model running in parallel on 16-core CPU, while the effect of the algorithmic simplifications on the computed results has been found to be negligible on these test problems and on other tests.
In sections 3 and 4 we summarize the AMR and finite volume methods implemented in GeoClaw to the extent needed to explain the GPU implementation, which is described in section 5. Many more details can be found in previous publications; in particular, LeVeque et al. (2011) give an extensive discussion of the algorithms, along with an overview of tsunami modeling applications, while Berger et al. (2011) discuss more details of the software, with additional test problems. More references are summarized in Table 1.

Related Work and Other Approaches
The numerical modeling of tsunamis often includes modeling multiple phases with very different scales, including tsunami generation from the source (e.g., Nosov, 2014;Okada, 1985), long-distance propagation (e.g., Choi et al., 2003;George & LeVeque, 2006;Titov & Gonzalez, 1997), local inundation of coastal regions (e.g., Park et al., 2013;Qin et al., 2017, and its interaction with coastal structures (e.g., Motley et al., 2015;Winter et al., 2017). Some computer codes use separate models for different phases of tsunamis, while some integrate the modeling of multiple phases into a single simulation (e.g., Macías et al., 2016;Zhang & Baptista, 2008), facing the computational challenges induced by very different scales (from thousands of kilometers to tens of meters) in the problem. Often multiple grid resolutions are used, with finer cells near the coastal region of interest and coarser cells offshore. In particular, the MOST code described (MOST: Method of Splitting Tsunami model, 2019; Titov & Gonzalez, 1997), which is used both at the NOAA Center for Tsunami Research (NCTR) and as part of the U.S. tsunami warning system, uses nested rectangular grids at different resolutions about each coastal location of interest. Other codes use triangular meshes where the triangle size can vary dramatically. However, static refinement is typically used in which the grid resolutions are set a priori and do not vary as the simulation evolves.
AMR algorithms have been shown to effectively reduce computational cost in the numerical simulation of multiscale problems in many application areas, allowing one to track features much smaller than the overall scale of the problem and adjust the computational grid during the simulation. The algorithm has been implemented and developed into several frameworks and can be categorized into three major variants.

10.1029/2019MS001635
The first one is often referred to as patch-based or structured AMR, as originally proposed in Berger and Oliger (1984) and (Berger & Colella, 1989), and is the approach used in GeoClaw. It allows rectangular grid patches of arbitrary size and any integer refinement ratios between two level of grid patches (e.g., Adams et al., 2015;Bryan et al., 2014;Hornung & Kohn, 2002;Zhang et al., 2016). Another variant is cell-based AMR, which refines individual cells and often uses a quadtree or octree data structure to store the grid patch information. The last variant is a combination of the first two, often referred to as block-based AMR. Unlike the patch-based AMR, which stores the multiresolution grid hierarchy as overlapping and nested grid patches, this approach stores the grid hierarchy as nonoverlapping fixed-size grid patches, each of which is stored as a leaf in a forest of quadtrees or octrees (e.g., Burstedde et al., 2011Burstedde et al., , 2014Fryxell et al., 2000;Liang & Borthwick, 2009;Liang et al., 2004;MacNeice et al., 2000;Meister et al., 2017;Popinet, 2012). Other multiresolution approaches, such as adaptive wavelet methods, have also been applied to shallow water equations, (e.g., Aechtner et al., 2015).
GeoClaw, like many numerical codes that are commonly used for tsunami hazard assessment and forecasting, currently implements only explicit methods for the first-order shallow water equations described in section 3. Shallow water equations are also frequently used to model storm surge and overland flooding due to dam breaks, and GeoClaw has been used in these application areas as well (e.g., George, 2011;Mandli & Dawson, 2014;Turzewski et al., 2019). Dispersive effects are very important to model for some short-wavelength tsunamis, for example, those generated by landslides (e.g., Watts et al., 2003), and can be important in some situations for long-wavelength (e.g., earthquake generated) tsunamis. However, including dispersive effects gives rise to higher-order PDEs for which implicit methods are generally needed. This adds considerable complication to AMR algorithms, as discussed by Popinet (2015), who presents an adaptive algorithm. GPU acceleration has been applied to tsunami models in the past, although primarily on grids with fixed spatial resolution, including those of Acuña and Aoki (2009) Smith and Liang (2013), Amouzgar et al. (2016), andDe La Asunción et al. (2016). A recent application to an AMR code is presented by de la Asunción and Castro (2017) for a landslide-generated tsunami, and GPU acceleration of other AMR algorithms have been pursued in other communities, particularly in astrophysics (e.g., Schive et al., 2010;Wang et al., 2010).

Shallow Water Equation With Variable Topography
The shallow water equations have been used broadly by many researchers in modeling of tsunamis, storm surge, and flooding. It can be written in the form of a nonlinear system of hyperbolic conservation laws for water depth and momentum where u (x, y, t) and v(x, y, t) are the depth-averaged velocities in the two horizontal directions, B(x, y, t) is the bathymetry/topography, and D = D (h, u, v) is the drag coefficient. The subscript t represents a time derivative, while the subscripts x and y represent spatial derivatives in the two horizontal directions. The value of B(x, y, t) is positive for topography above sea level and negative for bathymetry. Coriolis terms can also be added to the momentum equations but are generally negligible for tsunami problems and is not used here. The drag coefficient used in the current implementation is where n is the Manning coefficient and depends on the roughness of the ground. A constant value of n = 0.025 is often used for tsunami modeling, and this value is used for all benchmark problems in this study. The above equations are written in Cartesian coordinates and rectangular grids in these coordinates can be used for modeling tsunamis in small regions. Logically, rectangular grids mapped to spherical coordinates for transoceanic tsunami propagation are also supported.

Journal of Advances in Modeling Earth Systems
10.1029/2019MS001635

Finite Volume Methods and Wave Propagation Algorithm
A one-dimensional homogeneous system of hyperbolic equations in nonconservative form can be written as For nonconservative form, the wave propagation algorithm (LeVeque, 1997(LeVeque, , 2002 can be used to update the solution where  + ΔQ i−1∕2 is the net effect of all right-going waves propagating into cell  i from its left boundary and  − ΔQ i+1∕2 is the net effect of all left-going waves propagating into cell  i from its right boundary. Namely, where m is total number of waves,  p is the pth wave from the Riemann problem, p is wave speed of the pth wave, and The notations here are motivated by the linear case where f(q) = Aq. In such a case, the waves are simply decomposition of the initial jumps into basis form by the eigenvectors of the coefficient matrix A, propagating at the speed of eigenvalues where q r and q l are right and left states of the Riemann problem, r p is the pth eigenvector of matrix A, and p is coordinate in the direction of r p .
The wave propagation form of Godunov's method (equation (4)) is only first-order accurate and introduces a great amount of numerical diffusion into the solution. This often smears out the steep gradients in the solution which are common in surface elevation near shoreline in tsunami simulation. To obtain second-order resolution and maintain steep gradients, additional terms are added to equation (4) The second-order correction terms are computed as where the time step index n is dropped and the superscript p refers to the wave family. The wave 10.1029/2019MS001635 where the index I represents the interface on the upwind side of interface x i−1/2 (11) Φ( ) is a limiter function that gives values near 1 where solution is smooth and is close to 0 near discontinuities. Such property of a limiter function preserves second-order accuracy in region where the solution is smooth while avoiding nonphysical oscillations arising near the discontinuities. Note that computing limited waves adds complexity to the parallel algorithm implemented on the GPU since limited waves at one interface cannot be computed until its neighboring waves are solved. This is detailed in section 5.
A two-dimensional hyperbolic system in nonconservative form is a general extension of the one-dimensional hyperbolic system (equation (3)) in two-dimensional space. The Godunov-type finite volume algorithms discussed above can be naturally extended to two-dimensional space by dimensional splitting, which splits the two-dimensional problem into a sequence of one-dimensional problems. The wave propagation algorithm now becomes where  + ΔQ i−1∕2, and  − ΔQ i+1∕2, are net effect of all waves propagating into cell  i from its left and right edges, whileF n i−1∕2, andF n i+1∕2, are second-order correction fluxes through its left an right edges. Similarly,  + ΔQ i, −1∕2 and  − ΔQ i, +1∕2 are net effect of all waves propagating into cell  i from bottom and top edges, whileG n i, −1∕2 andG n i, +1∕2 are second-order correction fluxes through its bottom and top edges. In GeoClaw, the shallow water equation is written in nonconservative form, which augments the system by introducing equations for topography and momentum flux ). An approximate Riemann solver has been implemented by George (2008) to solve the Riemann problems at each cell interface for this augmented system. This Riemann solver has some nice properties for tsunami modeling, including the capability of preserving steady state of the ocean, handling dry states in the Riemann problem, and maintaining nonnegative depth in the solution. As a result, the coast line does not have to be treated as domain boundary but modeled with the wetting and drying process. This gives a simple rectangular computational domain (mapped to spherical coordinates if needed) and easily imposed boundary conditions at the boundary of the computational domain, where zero-order extrapolation of all variables is used. For Godunov-type methods this effectively gives an absorbing boundary condition for outgoing waves.
The time step size Δt for time integration must be chosen and adapted carefully at each time step if variable time step is used, which is typical for tsunami modeling. The Courant, Friedrichs and Lewy (CFL) condition implies that the time step size for a certain AMR level must satisfy where is the CFL number and s is the maximum wave speed seen at the AMR level.

AMR
The block-structured AMR algorithm implemented in Clawpack is described in numerous papers, including Berger and Oliger (1984) and Berger and Colella (1989) and is only briefly summarized here.
A collection of rectangular grid patches is used to store the solution. Grid patches at different levels have different cell sizes. The coarsest grid patches (level 1) cover the entire domain. Grids patches at level l+1 Figure 1. Advancing the coarsest level by one time step, for an adaptive mesh refinement hierarchy with three levels of grid patches. The refinement ratio is 2 for both level l = 1 and level l = 2. Each black horizontal arrow in a solid line represents taking one time step on a specific level. Each blue vertical arrow in a dashed line represents one updating process that averages the solution from a fine level to a coarse level and refluxing process that preserves global conservation. The numbers from (1) to (10) describe the orders in which all operations are taken.
are finer than coarser level l grid patches by integer refinement ratios r l x and r l in the two spatial directions, Δx l+1 = Δx l ∕r l x , Δ l+1 = Δ l ∕r l , and cover subregion of level l grid patches. In this study, the refinement ratios in the two spatial directions are always taken to be equal, r l x = r l . Typically, the time step size is also refined the same factor for level l+1 grid patches, Δt l+1 = Δt l ∕r l t , with r l t = r l x = r l . The high level grid patches are regenerated every K time steps such that they move with features in the solution. When level l+1 grid patches need to be regenerated, some cells at level l are flagged for refinement based on some criterion (in GeoClaw), typically where the amplitude of the wave is above some specified tolerance or in specified regions where higher refinement is required, for example, near the target coastal location or where the wave will affect the solution in destination during time interval of interest as indicated by the backward adjoint solution (Davis & LeVeque, 2016)). The flagged cells are then clustered into new rectangular grid patches, which usually include some cells that are not flagged as well, using an algorithm proposed by Berger and Rigoutsos (1991). The algorithm tries to keep a balance between minimizing the number of grid patches and minimizing the number of unflagged cells that are included in the resulting rectangular grid patches. The newly generated level l+1 grid patches get their initial solution from either copying data from existing old level l+1 grid patches or if no such grid patch exists, interpolating from level l grid patches. We say the level l+1 patch cells are "on top" of some level l cells that cover the same spatial region. Note that the algorithm described below integrates the underlying level l grid patches before the level l + 1 patches. After updating the finer patches, any level l cells under level l + 1 cells have their values updated to the average of level l + 1 cell values. After each regridding step, the new level l + 1 patches need not cover the same level l cells as previously.

Time Integration
Each grid patch in the AMR grid hierarchy, despite different resolution, can be integrated in time with the wave propagation form of Godunov's method described in the previous section. Specifically, the following steps are applied recursively, starting from the coarsest grid patches at level l = 1, as illustrated in a simple case in Figure 1. 1. Advance the solution in all level l grid patches at t n by one step of length Δt l to get solution at t n + Δt l . 2. Fill the ghost cells for all level l + 1 grid patches, by either copying cell values from adjacent level l + 1 grid patches if any exists or interpolating in space and time from the cell values at level l at t n and t n + Δt l if no adjacent level l + 1 grid patch exists. Note that interpolation in time is generally required because the finer grids are integrated with smaller time steps. 3. Advance the solution at level l + 1 for r l t time steps such that solution at level l + 1 is at the same time as solution at level l. Each time level l + 1 is advanced; this entire algorithm (steps 1-5) is applied recursively to the next finer level (with l replaced by l + 1 in these steps) if additional level(s) exist. 4. For any grid cell at level l that is covered by level l + 1 grid cells, the solution Q in that cell is replaced with an appropriate weighted average of the values from the r l x r l level l + 1 cells on top. This is referred to below as the updating process. 5. For any grid cell at level l that is adjacent to level l + 1 grid cells, the solution Q in that cell is adjusted to replace the value originally computed using fluxes found on level l with the potentially more accurate

10.1029/2019MS001635
value obtained by using level l + 1 fluxes on the side of this cell adjacent to the level l + 1 patch. This step also preserves conservation for certain problems and is referred to below as the refluxing process. (This step is dropped in our implementation; see below.) Step 5 is important for some problems where exact conservation is expected, for example, of a conserved tracer or for strong shock waves in nonlinear problems, and is necessary in this case to avoid the use of different numerical fluxes at the same interface on the side of the fine patch and the side where it abuts a coarser grid. However, this step requires storing additional flux information at every time step and communicating this information between levels and was found to have a large negative effect on the ability to speed up the code on the GPU due to extra overhead, which depends on the machine being used as well as the actual problem being simulated, as sizes and numbers of AMR grid patches vary significantly for different problems . We also found that this refluxing step has very little effect on the numerical results obtained for tsunami modeling (as shown in section 6.1). In this application we do not expect conservation of momentum at any rate (due to the topographic and friction source terms) and even conservation of mass is sacrificed when AMR is applied to a cell near the coast (LeVeque et al., 2011). Near the coast, cells may vary between dry state and wet state. The interpolation and averaging strategies during the regridding process near the coast are chosen to maintain the steady state of the ocean at rest, since refinement generally occurs before the tsunami waves arrive in an undisturbed area of the ocean. In some circumstances, this goal cannot be achieved without sacrificing the conservation of mass. Even with this potential loss of conservation the GeoClaw code gives results that have been validated extensively by comparing simulation results with other models or field observations (LeVeque 2011, González 2011 etc.). For these reasons we omit Step 5 in the GPU implementation. This greatly helps to optimize the logistics of the code and achieve very impressive performance in the benchmark problems while only introducing negligible changes to the solution. Note that for all the benchmarks in current study, the current GPU code has been compared with the original GeoClaw code that is configured to turn off Step 5.

Regridding
Every time a level is advanced by b time steps, regridding based on this level is conducted (except on the finest allowed level). A larger b results in less frequent regridding, which reduces time spent on the regridding process. However, in order to ensure that the waves in the solution do not propagate beyond the refined region before the next regridding process, when cells are flagged for refinement, usually an extra layer of b cells surrounding the original flagged cells is flagged. This makes each grid patch 2b cells wider in each of the two horizontal dimensions and thus introduces more cells, which increases the computational time of time integration. As both the regridding process and time integration process contribute to the total running time, a trade-off must be made in order to lower the total running time when choosing a good value for b, as discussed in . However, it is in general hard to determine the "sweet spot" for b as the running time for the regridding process depends on the flagging criteria being chosen while the running time for the time integration process depends on the equations being solved. Typically, b is chosen empirically as 2-4 in tsunami modeling.
In the regridding process, cells must be flagged before they are clustered into new grid patches. A variety of different flagging criteria have been implemented, including flagging based on the slope of the sea surface, sea surface elevation, or adjoint methods. For all the benchmarks in this study, the sea surface elevation is used for flagging. In addition to this, some spatiotemporal regions might also be specified to enforce flagging in these regions, which is very useful for problems where both the transoceanic propagation and local inundation of a tsunami must be modeled and thus require grid cells that are O(10 3 )-O(10 4 ) finer than the coarsest resolution in some near shore regions like a harbor or bay.
During regridding, if a newly generated grid cell cannot copy values from old cells at the same level, its initial value must be interpolated from coarser levels. In the updating process, coarse cell values get updated with the appropriate averaged value of fine grid cells on top. An important requirement for both the interpolation and averaging strategies in tsunami modeling is to maintain the steady state of the ocean at rest, since refinement generally occurs before the tsunami waves arrive in an undisturbed area of the ocean. For areas far from shoreline, the interpolation strategy can be simple linear interpolation and the averaging strategy used is averaging the surface elevation in fine cell values arithmetically and then compute depth in each cell based on the topography and surface elevation. However, near the shoreline where one or more cells is dry, it is impossible to maintain conservation of mass and also preserve the flat sea surface during the interpolation or averaging in some circumstances. Details of the strategies can be found in LeVeque et al. (2011).

A Hybrid CPU/GPU Implementation
One very basic question to answer in designing a GPU implementation of some code is which part of the program should be done by the CPU and which part should be done by the GPU. In the current implementation we have put the Riemann solvers, wave limiter, and CFL reduction on the GPU while letting the CPU take care of the rest, including the updating process, regridding process, filling ghost cells, and updating gauge values (finding the best grid patch to sample quantities of interest from, interpolating from cell values, and output). Since the GPU and the CPU considered in the study have separate physical memory, this design requires the transfer of solution data on each grid patch back and forth between the GPU and CPU memory through a PCI express 2.0 interface, which has relatively low bandwidth compared to the main memory of the CPU and the GPU. However, as we show later in the benchmark results, the extra time introduced by such data transfer takes less than 5% of the total running time, since these operations can be carefully hidden by performing other operations concurrently.
If we instead put all procedures on the GPU, although it might save time by eliminating much of the data transfer, the code might suffer from having tiny GPU kernels that add significant overhead to the running time and running inefficient GPU kernels that can be even slower than the CPU counterpart. One example is filling ghost cells on the GPU. Each GPU kernel that fills ghost cells for a two-dimensional grid patch of a by a cells can have parallelism of O(a) at most (2a ghost cells to fill on each side), which is much less than the parallelism exposed in time integration of the same grid patch, which has parallelism of O(a 2 ) and often involves much more computation. The overhead of launching a GPU kernel is almost a fixed amount of time regardless of the actual execution time of the kernel. This overhead is often much longer than the execution time of a tiny GPU kernel like the GPU kernel that would be needed for filling ghost cells. As a result, the total cost (including kernel launching overhead and kernel execution) of doing such operations on the GPU can be even higher than the cost of doing so on the CPU in some cases. In addition to the consideration from kernel launching overhead, code that puts all procedures on the GPU wastes CPU computational resources. Since a typical machine considered in this study consists of a multicore CPU and a GPU, ideally the workload of the entire program would be distributed between the two as evenly as possible, such that the CPU stays busy as much as possible during the entire execution and so does the GPU. In section 6, two metrics are proposed to measure such characteristics of a code in numerical experiments. Figure 2 shows the major procedures of the code that are hearafter referred to as the non-AMR portion of the code. (Omitted are the regridding process and updating process, essential components of AMR.) An arrow from procedure A to procedure B indicates that procedure A must be finished before procedure B can start. The color indicates the type of hardware resource a procedure needs. Four colors represent four major types of hardware resource involved in the execution of the code. A blue block uses one CPU core, a green block uses the GPU Streaming Multiprocessors, a red block uses the memory transfer engine that transfers the data from the CPU memory to the GPU memory, and a purple block uses the memory transfer engine that transfers data in the opposite way. Note that these are separate transfer engines, but there is only one of each. Any two procedures without dependency can be done concurrently as long as relevant hardware resources are available. The dependencies in the current implementation are enforced through a combination of rearrangement of CPU procedures and GPU kernel launches, use of OpenMP directives and CUDA streams, and proper synchronization between CPU threads and between the CPU and the GPU.  Figure 3 shows an example of these procedures being processed concurrently by four types of hardware on a machine with a 3-core CPU. Procedures follow the dependency specified in Figure 2. Procedures that use the same hardware resource must wait in queue for the hardware to become available. Note that procedures that need CPU cores can use any available CPU core so those processed by different CPU cores can be executed concurrently.

Procedure Dependencies and Concurrent Execution
In Figure 3, during the entire time period when the red block for grid patch 12 is processed by one of the two memory transfer engines, the GPU Streaming Multiprocessors are processing the green block for grid patch 14. In this case, transferring the solution data of grid patch 12 from the CPU memory to the GPU memory does not induce any extra cost. During some middle time period of the red block for grid 14, however, no GPU computation is conducted so the GPU Streaming Multiprocessors are idle, which can be caused by unavailability of data on the GPU, for instance. This time segment does induce extra cost due to transferring data between the CPU memory and the GPU memory. Later in section 6, such extra cost will be quantified to reveal the influence of transferring data between the CPU memory and the GPU memory on performance of the code. Two additional metrics will also be defined and measured later in section 6, the proportion of time during which the CPU has some work to do instead of waiting for the GPU to finish, and the proportion of time during which GPU Streaming Multiprocessors are doing computations.

Memory Pool
During regridding, new grid patches are generated and old ones are destroyed. New memory must be allocated on both the CPU and the GPU for storing solution data and auxiliary data for the new grid patches, while old memory for removed grid patches must be freed. The total overhead of calling the CUDA runtime library to conduct these frequent memory operations cannot be neglected and can even dominate execution of the code sometimes when grid patches are so small that the overhead is expensive relative to the time spent advancing the solution on grid patches. To save the cost of such frequent memory operations, a memory pool is implemented, which requests a huge chunk of memory from the system by calling the CUDA runtime library at the initial time, keeping it until the end of execution, and getting more chunks when needed. All memory allocation and deallocation requests from the code are then through this memory pool at much less cost, with no need to actually allocate/free system memory.

Efficient Design of the Solver Kernels 5.3.1. The CUDA Programming Model
The current implementation is based on the CUDA programming model and targeted Nvidia GPUs. The architecture of Nvidia GPUs and explanation of the CUDA programming model are detailed in the Nvidia CUDA C programming guide (NVIDIA, 2018). Here only a brief review is given to provide sufficient knowledge for understanding the implementation details introduced in this section.
In the CUDA programming model, each function that is written to run on the GPU is called a CUDA kernel (or GPU kernel). The code in a CUDA kernel specifies a set of instructions to be executed by multiple CUDA threads in parallel. The code can specify that some of the instructions should be executed by a certain group of threads but not the others. All threads assigned to execute a CUDA kernel are grouped into CUDA blocks. All such CUDA blocks then form a CUDA grid. CUDA blocks are independent of each other, can be sent to different Streaming Multiprocessors, and run concurrently. During the execution of a CUDA kernel, each thread is provided with information regarding which CUDA block and which thread within that block it is. Based on this information, each thread can perform its own set of instructions on a specific portion of the data.
The GPU has many different types of hardware for data storage. The three relevant types here are registers, shared memory, and main memory. The registers are the fastest storage and have very low access latency, but each Streaming Multiprocessor has a very limited number of registers. Each thread is assigned its own registers and thus can only access its own registers (unless special instructions are used).
The shared memory has relatively slow bandwidth and long access latency than the registers, but its bandwidth is still much faster than that of the main memory and access latency is also much shorter than that of the main memory. Each CUDA block is assigned a specified amount of shared memory, which is accessible by all CUDA threads in the CUDA block. The quantity of registers and shared memory in the Streaming Multiprocessor is limited and fixed. As a result, the number of CUDA blocks that can reside in a Streaming Multiprocessor at the same time is limited by total number of registers and the amount of shared memory these CUDA blocks request. If too few CUDA blocks can reside in a Streaming Multiprocessor at the same time, the Streaming Multiprocessor has low occupancy and thus runs the CUDA kernel less efficiently. For this reason, it is important to minimize the number of registers used by each thread and the amount of shared memory used by each CUDA block when the CUDA kernel is designed.
The main memory is located the farthest from the chip and thus has the lowest bandwidth and the longest access latency. In principle, a CUDA kernel should have a minimal number of read and writes to main memory, especially given the fact that stencil computations for partial differential equations are often memory bandwidth bound. A simple idea in CUDA kernel design is to load all data as efficiently as possible from the main memory to the shared memory, conduct all computations using the shared memory as a buffer to avoid unnecessary accesses of main memory, and then write new data back to the main memory. However, this often causes too much shared memory usage for each CUDA block and results in very inefficient execution of the CUDA kernel.

Data Layout
Every 32 threads within a CUDA block are grouped as a warp, which executes the same instruction at the same time, including memory load and write operations. The hardware can execute memory request from all threads in a warp most efficiently if they access a contiguous piece of memory. This is called a coalesced access. Such characteristic of the GPU hardware makes Structures-of-Arrays (SoA) preferable over Arrays-of-Structures (AoS). With SoA layout, the same state variables, for example, water depth h, on the entire grid patch are stored contiguously in memory, whereas with AoS format, all state variables within the same grid cell are stored contiguously in memory. Such a data layout results in strided access of the GPU memory. Namely, consecutive CUDA threads will access memory locations that are not consecutive. This can greatly reduce effective memory bandwidth since memory accesses cannot be coalesced.
The current implementation contributes to the Clawpack ecosystem (Mandli et al., 2016), which uses an AoS layout since Fortran arrays are dimensioned so that q(m, i, j) is the mth component (depth or momenta) in the (i, j) grid cell. However, many applications within the Clawpack ecosystem will be affected if the AoS data layout is changed. Thus, we continue to use the AoS layout in the current implementation. In the first half of the dimensional splitting method, the CUDA kernel that solves the equation in the x direction reads in data in AoS but writes intermediate solution data in SoA, which is coalesced. The CUDA kernel that solves the equation in the y direction then reads in data in SoA layout in a coalesced manner and writes new solution back in AoS layout.

CUDA Kernel Implementations
In designing a CUDA kernel, one essential goal is to assign computational tasks to each thread. To perform time integration on grid patches with Godunov-type wave propagation methods, the goal is to decide how to distribute to each thread the tasks of solving the Riemann problems at each cell edge, limiting waves, and updating cell values. Figure 4 shows a one-dimensional slice of a grid patch along the x direction when the first step of the dimensional splitting method is conducted to get the intermediate state Q * . Updating cell C j depends on the two sets of waves at cell edges x j−1/2 and x j+1/2 . When waves are limited, the waves at cell edge x j−1/2 depend on waves at cell edges x j−3/2 and x j+1/2 , while the waves at cell edge x j+1/2 depend on waves at cell edges x j+3/2 and x j−1/2 . Any of these waves depends on the two cell values around them, respectively. As a result, the solution at cell C j depends on four neighboring sets of waves, which depend on the five cell values around cell C j (including itself).
If each CUDA thread is assigned to update a cell in this one-dimensional slice, it needs to solve the four Riemann problems the cell depends on. Redundant work is performed since some Riemann problems are solved and some waves are limited by neighboring CUDA threads as well. On the other hand, if each thread is assigned to solve a Riemann problem at one cell edge in this one-dimensional slice, limit the waves, and then update the two neighboring cells with left-and right-going waves, the code must carefully avoid data racing since each cell is updated by two CUDA threads. This typically involves the usage of a synchronization mechanism called lock, which decreases the execution efficiency of the CUDA kernel.
In the current implementation, a combination of the two ideas above is implemented. Figure 5 shows how CUDA threads are first assigned to cell edges for solving Riemann problems and limiting waves and then reassigned to grid cells for updating the solution. Each solid arrow denotes assigning one CUDA thread on a cell edge. The thin arrows show the initial assignment to edges, while the thick arrows show the final assignment to cells.
In the first stage, each CUDA thread is assigned to a cell edge to solve the Riemann problem there. The left and right states for the Riemann problem for a thread are loaded from the main memory, while resulting waves from the Riemann problem are written into the shared memory. In the second stage, each CUDA thread limits its waves to get correction fluxes at the cell edge it is assigned to. This requires reading waves from the two neighboring edges, which were produced by the two neighboring CUDA threads and stored in the shared memory. Each thread then writes the limited waves (correction fluxes) back to the shared memory.
In the last stage, each CUDA thread is assigned to update a cell (thick arrows). At this time, each thread already has left-going waves and correction fluxes at the right edge of its cell in its registers, which can be directly applied to update the cell value. The right-going waves and correction fluxes at the left edge of the same cell were produced by its left neighboring thread and were stored in the shared memory in the last stage. Thus, each thread needs to read in these waves and fluxes from the shared memory and apply them to update the cell it is assigned to. Each thread then writes the updated value Q * back to the main memory. A similar kernel is then conducted for the second step of dimensional splitting method to get new state Q n+1 . This implementation requires a kernel to read solution data from the main memory only once at the beginning of the kernel execution and write the updated solution back to the main memory only once at the end of the kernel execution. This is done by using only a reasonable number of GPU registers for each thread and a reasonable amount of shared memory for each CUDA block. The usage of GPU registers for each CUDA thread only includes storing the state variables for left and right states of one Riemann problem, waves, and wave speeds from one Riemann problem, plus any extra intermediate variables created during solving the Riemann problem, while the usage of shared memory only includes waves from one Riemann problem per thread in a CUDA block.

Numerical Results
This section is focused on evaluating the performance of the GPU implementation. As test problems we have chosen two realistic tsunami modeling problems from recent validation studies. More details of the problems and additional comparisons of the results to those obtained using the MOST code, in simulations performed by the NOAA NCTR, are available in project reports cited below. The first problem uses AMR to model waves propagating across the ocean from the 2011 Tohoku Event, along with fine grid modeling around the tide gauge at Crescent City, CA. The second problem is a local tsunami from the Seattle Fault, with AMR used to focus on inundation in a coastal region very close to the fault, so the finest grid levels are actively immediately. We chose these two problems as typical of different scenarios requiring AMR with potentially differing overhead.
Two machines for benchmarking the original CPU implementation and two machines for benchmarking the current GPU implementation are listed as below.
1. A single Nvidia Kepler K20x GPU with a 16-core AMD Opteron 6274 CPU running at 2.2 GHz as the host; 2. A single Nvidia TITAN X (Pascal) GPU with a 20-core Intel E5-2698 CPU running at 2.2 GHz as the host (but only 16 CPU threads are used for fair comparison with others); 3. A single 16-core AMD Opteron 6274 CPU running at 2.2 GHz; and 4. A single 16-core Intel Xeon E-2650 CPU running at 2.0 GHz; Figure 7. (x, y, t) at 5.5 and 9.5 hr after the Japan 2011 earthquake. As shown in the previous sections, the GPU implementation consists of jobs that are done by the CPU and jobs that are done by the GPU. In the benchmarks, the CPU implementation always runs in parallel with 16 OpenMP threads, using 16 CPU cores. The CPU part of the GPU implementation is also always processed in parallel by 16 OpenMP threads, using 16 CPU cores. The GPU implementation solves the benchmark problems on machines 1 and 2 while the CPU implementation solves the same benchmark problem on machines 3 and 4. Note that machine 1 and machine 3 have the same AMD CPU, while machine 2 and machine 4 have similar Intel CPU. Thus, in this section, all speedups will be computed by comparing results on machine 1 to results on machine 3 and comparing results on machine 2 to results on machine 4. All numerical experiments in this section are conducted with double-precision floating point operations, on both the CPU and GPU.

Journal of Advances in Modeling Earth Systems
We propose three metrics that can be used to measure absolute performance of current GPU implementation, which do not require comparing a GPU implementation to a CPU implementation. We first define some quantities used in the definition of the three metrics. Along the program execution time line t ∈ R + , define as the time interval that the ith GPU computation event (e.g., one of the green blocks in Figure 3) happens. E GPU i is essentially a set of all moments that the ith GPU computation event is happening.
for the ith CPU computation, the ith memory transfer from the CPU  to the GPU memory, and the ith memory transfer from the GPU to the CPU memory, respectively. Then, all time intervals during which the GPU is doing computation, Ω GPU , can be represented as where ∪ is the union operation for sets and N GPU is total number of GPU computation events. Similarly, we define another three sets of intervals for the other three types of events. All time intervals during which the CPU is doing computation, Ω CPU , can be represented as where N CPU is the total number of CPU computation events. All time intervals during which the memory transfer from the CPU memory to the GPU memory is happening, Ω h2d , can be represented as Ω h2d = ∪ N h2d i=1 E h2d i , where N h2d is total number of such memory transfers. All timer intervals during which the memory transfer from the GPU memory to the CPU memory is happening, Ω d2h , can be represented as where N d2h is total number of such memory transfers. Lastly, define E total = [t start , t end ] as the time interval that the entire program runs.
The first metric measures the proportion of time during which the GPU is doing computation, defined as where |Ω| represents size of the set Ω, which essentially computes the total length of all time intervals in Ω in this case. Similarly, the second metric is defined as which measures proportion of time during which the CPU is doing computation. The last metric measures the proportion of extra time introduced by transferring data between the CPU memory and the GPU memory where − is the subtract operation for sets. For two sets A and B, A-B denotes all elements in A but not in B.

. Problem Setup
The first benchmark problem is the 2011 Japan tsunami, which was triggered by an earthquake of magnitude 9.0-9.1 off the Pacific coast of Tohoku, occurred at 14:46 JST (05:46 UTC) on Friday, 11 March 2011. The earthquake source deformation files were obtained from NCTR and were converted to deformation information on uniform lat-long grids for use in our implementation. This test problem was studied as part of project funded by NCTR to validate GeoClaw for potential future use in the U.S. Tsunami Warning Centers, and more details of this simulation along with comparisons for several other tide gauge locations and for other historical tsunamis can be found in the project report .
The computational domain is from longitude −240 to −100 and latitude −31 to 65 in spherical coordinates, with structured quadrilateral grid cells. Three levels of refinement are set across the ocean and around the source region (before getting close to the destination). Starting from the coarsest level (level 1) that has a resolution of 2 • , the refinement ratios are 5 and 6, giving a resolution of 25 min on Level 2 and 4 min on Level 3. A refinement tolerance parameter can be specified to guide the mesh refinement. The smaller this parameter is set, the more likely the grid will be refined to the highest level allowed in a particular region. The refinement tolerance parameter is chosen to be the wave amplitude and is set to 0.005 m. Thus, if a region is allowed to use any of the choices above (2 • , 24 min, 4 min), the region will be refined up to a maximum of Level 3 when the amplitude of a wave is higher than 0.005. In addition to specifying a tolerance for flagging individual cells, regions of the domain can be specified so that all cells in the region, over some time interval also specified, will be refined to at least some level and at most some level. In the simulation, the three refinement levels mentioned above are allowed in the entire region, with other constraints in specific subregions. In the first 7 hr after the earthquake, a 4-min resolution (Level 3) in the region from longitude −231 to −170 and from latitude 18 to 62 is enforced. This is reverted to the choices of 2 • or 24 min Note. AMR = adaptive mesh refinement. after 7 hr when the wave amplitudes are below tolerance. Then moving onward toward the destination, a 4-min resolution in the region from longitude −170 to −120 and from latitude 18 to 62 is enforced starting at 7 hr till the end of the 13-hr simulation. A combination of tolerance-based refinement and manually enforced refinement is used here since whether the waves are well resolved in some regions in the domain will not affect our location of interest and within our time range of interest. For instance, in Figure 7, the waves at bottom center (near Australia) will not affect our location of interest (point 2) until approximately 2 days later. As a result, enforcing those region to have relative coarse grid after a certain time saves us a huge amount of computational cost. If only tolerance-based refinement is used, those area will end up being refined for much longer time and to much finer level. The choice of such refined region is based on practical tsunami modeling experience. A new variant of the code is under development that uses the solution of an adjoint problem to better guide the adaptive refinement and automatically determine what waves in the forward solution at any regridding time can potentially affect the solution in the target region of interest (Davis & LeVeque, 2016).
Near Crescent City, the destination we are interested in, three higher levels of refinement regions are enforced to resolve for smaller-scale flow features near the coast: 1. Level 4 with 1-min resolution is enforced starting from 8 hr after the earthquake, in the region from longitude −126.995 to −123.535 and from latitude 40.515 to 44.495. 2. Level 5 with 12-s resolution is enforced starting from 8 hr after the earthquake, in the region from longitude −124.6 to −124.05 and from latitude 41.502 to 41.998. 3. Level 6 with 2-s resolution is enforced starting from 8.5 hr after the earthquake, in the region from longitude −124.234 to −124.143 and from latitude 41.717 to 41.783. Figure 6 shows the three refinement regions as well as location of gauge 2, where the time series of water surface elevation is recorded.
To ensure solution data on an entire grid patch can fit into the cache of the CPU for data locality, the size of each grid patch is limited to 128 by 128 for both the GPU and CPU cases. The Godunov-type dimensional splitting scheme is used with second-order MC limiter applied to the waves. The problem is simulated for a simulation time of 13 hr. Figures 7 and 8 show snapshots during the simulation for the entire computational domain and near Crescent city, colored by (x, y, t) defined as  During the tsunami, wave height data were recorded at four DART buoys (Deep-ocean Assessment and Reporting of Tsunamis) near the earthquake source, the locations of which are shown in Figure 9. The blue rectangle in the figure indicates the extent of the earthquake source. However, most of the seafloor deformation is inside the red rectangular region, where 1-min topography files are used to make sure the region is well resolved. The wave heights predicted by the current GPU implementation at the four DART buoys are shown in Figure 10 and compared against observed data. The comparison shows that the predicted results agree quite well with observed data at the four DART buoys.

Simulation Results
Recall that the current implementation uses a dimensional splitting scheme with no refluxing. To show that this simplification gives comparable results to the original GeoClaw code, Figure 11 gives time series of surface elevation recorded at a tide gauge near Crescent City, California, USA, during the 2011 Japan Tsunami. The observation has been detided by subtracting the predicted tide level from it to remove the influence of tide level. Sample result from another well-known tsunami model MOST (Titov & Gonzalez, 1997) has also been included for comparison. The comparison shows that a simplified GeoClaw CPU code that implements such a dimensional splitting scheme with no refluxing gives very close results to those produced by the original GeoClaw and agrees well with another model and observed data. The current GPU implementation gives identical results to this simplified version of the original GeoClaw and is not shown

Journal of Advances in Modeling Earth Systems
10.1029/2019MS001635 Figure 15. (x, y, t) in Puget Sound after a tsunami triggered by Seattle fault rupture.
in the figure. Table 2 shows the total number of cell updates and the total number of time steps taken on each AMR level for the simulation of the Japan 2011 tsunami. Note that a big portion of the computational cost is spent on Level 3 grid patches, which cover a large area in the middle of the Pacific ocean. Although Level 3 grid patches are advanced for only approximately 5,000 time steps, the total number of cell updates on this level is still larger than on level 6 grid patches, which are advanced almost 4 times more often but cover a much smaller region and have fewer cells. One limitation of the current implementation is that the total number of cells allowed on each AMR level is limited by the amount of the GPU memory, which is typically smaller than that of the host CPU machine. The refinement ratio and criteria in both this numerical experiment and the one in section 5.2 have been carefully chosen such that number of cells on each level never goes beyond the limit of GPU memory during the simulation. Figure 12 shows total running time and proportion of the three components on four machines. The total speedups are 4.3 on machine 1 and 6.4 on machine 2 for current GPU implementation. Note that since time spent on the non-AMR portion decreases on machines 1 and 2, the cost for regridding and updating take up larger portion of the total run time. However, one could still only gain a very limited additional performance increase if the regridding and updating processes were implemented on the GPU. Amdahl's law states that theoretical speedup of the execution of a whole program is where S is theoretical speedup of the execution of the whole program, s is the speedup of the portion that is accelerated, and p is proportion of total running time that the accelerated portion takes. Furthermore, one has S(s) ≤ 1 1−p , where the equality is achieved when s approaches ∞ in equation (20). From Amdahl's law, even if the regridding and updating processes are implemented on the GPU and are accelerated infinitely, the entire program only gets roughly 1.2 speedup on machines 1 and 2. Table 3 shows the three metrics for current GPU implementation running on machine 1 and machine 2 when the Japan 2011 tsunami is simulated. The proportion of GPU computation (P 1 ) reaches about 50% on  machine 1 and a higher percentage of 64% on machine 2. This could be due to the fact that machine 2 has a newer GPU which has much lower overhead for kernel launch and memory transfer. The proportion of CPU computation (P 2 ) are around 80% for both machines. In other words, during 20% of the total running time, the CPU is idle. P 3 , the extra time introduced by transferring data between the CPU and the GPU memory, is less than 5% for both machines. This shows that even if the data can be transferred infinitely fast between the CPU and the GPU memory so the data transfer has no effect on execution time at all, the total running time of the entire program can be reduced by at most 5%. Thus, having to transfer data between the CPU and the GPU memory is not a critical issue that affects the performance of the code.

A Local Tsunami Triggered by Near-Field Sources 6.2.1. Problem Setup
The second benchmark problem is the modeling of a local tsunami that is triggered by a near-field earthquake, which typically hits the shoreline much earlier than a tsunami triggered by a far-field earthquake. The tsunami is triggered by a hypothetical Mw 7.3 earthquake on the Seattle Fault, which cuts across Puget Sound (through Seattle and Bainbridge Island; see Figure 13) and can create a tsunami that can cause significant inundation and high currents in some coastal communities around the Puget Sound. The event was designed to model an earthquake that occurred roughly 1,100 years ago and for which geologic data are available for the uplift or subsidence at several locations. Here, we focused on modeling this local tsunami and predicting its impact on Eagle Harbor at the Bainbridge island, the location of which is shown below in Figure 14. The ground deformation file for generating the tsunami was obtained from the NCTR, and this test problem has been used for recent model comparison and validation study of GeoClaw and MOST as part of a tsunami hazard assessment of Bainbridge Island. More details about the modeling, along with additional comparisons of model results from the two codes, can be found in the project report (Titov et al., 2018). Figure 14 also shows the computational domain, which is from longitude −123.61 to −122.16 and latitude 47 to 48.7 in spherical coordinates, with structured quadrilateral grid cells. Since this is a local tsunami in an enclosed Sound surrounded by land, the tsunami waves soon get reflected by shorelines and spread out to cover the full domain very soon after the earthquake. Thus, instead of using a refinement tolerance parameter, we enforce mesh refinement everywhere in the domain regardless of wave amplitude and never Note.AMR = adaptive mesh refinement. regenerate new grid patches. Four levels of refinement are used, as denoted by the rectangles in Figure 14 that denote regions where refinement is enforced. Starting from the coarsest level (Level 1), which has a resolution of 30 min, the refinement ratios are 5, 3 and 6, giving a resolution of 6 min on Level 2, 2 min on Level 3, and 1/3 min on level 4. Note that for this benchmark problem, a large proportion of the domain is dry land and the shorelines are much longer and more complex. As a result, many branches occur along the execution path of solving Riemann problems of the shallow water system since more different situations arise, for example, a Riemann problem with one state being dry initially but becoming wet, or staying dry, depending on the flow depth and velocity in the neighboring cell. For the GPU, if the 32 threads in a warp do not take the same execution path, each extra branch will be executed by the entire warp, introducing significant extra execution time. Thus, the irregularity of water area in this benchmark problem is challenging for some CUDA kernels to use the GPU hardware efficiently. Figures 15 and 16 show snapshots from the simulation at several moments during the simulation in the Puget Sound and near Eagle Harbor, colored by (x, y, t) defined in equation (19). The black solid line denotes the original shoreline before the earthquake. At the entry of the harbor, deep inundation occurred at several places as early as only 3 min after the earthquake. Even at the very end of the Eagle Harbor, the influence from the tsunami is also significant, causing more than 2-m-deep inundation in several places starting at 9 min after the earthquake. One wave gauge is placed inside Eagle Harbor to record the inundation depth during the tsunami. Figure 14 shows the location of the wave gauge. As this is a hypothetical event for modeling an earthquake roughly 1,100 year ago, there is no surface elevation observation available for comparison. Hence, the results from current implementation are compared with those from the MOST tsunami model (Titov & Gonzalez, 1997) (Figure 17). Additional comparisons of GeoClaw and MOST results can be found in the comparison study recently performed by Titov et al. (2018). For that study the original CPU version of GeoClaw was used, with the unsplit algorithm and refluxing, but we have confirmed that very similar results are obtained with the GPU code, at least in Eagle Harbor. Table 4 shows the total number of cell updates and total number of time steps taken on each AMR level for the simulation of a tsunami triggered by the Seattle fault rupture. The majority (88%) of the computational cost is spent on level 4, which is designed to only cover the small area of interest, around Eagle Harbor on Bainbridge island. Figure 18 shows total running time and proportion of the two components on four machines (no regridding process since it is never conducted). The total speedups are 3.7 on machine 1 and 5.0 on machine 2 for this benchmark problem. For the original CPU implementation, the non-AMR portion takes 98% and 99% of the total computational time, which indicates high potential of benefiting from optimizing the performance of this portion. Although the proportion of the non-AMR portion increases for current implementation on machine 1 and machine 2, it still takes more than 95% of the total computational time, showing great potential for further improvement.  Table 5 shows the three metrics for current GPU implementation running on machine 1 and machine 2 when the Seattle Fault tsunami is simulated. Similar values are obtained for all three metrics, showing consistency and validity of the three metrics on evaluating GPU implementation with different tsunami problems.

Conclusions
The shocking fatalities and infrastructure damage caused by tsunamis in the past two decades highlight the importance of developing fast and accurate tsunami models for both forecasting and hazard assessment. This paper presents a fast and accurate GPU-based version of the GeoClaw code using patched-based AMR. Arbitrary levels of refinement and refinement ratios between levels are supported. The surface elevation at DART buoys and wave gauges in benchmark problems show the ability of current tsunami model to produce accurate results in tsunami modeling. With the GPU, the entire tsunami model runs 3.6-6.4 times faster than an original CPU-based tsunami model for several benchmark problems on different machines. As a result, the Japan 2011 Tōhoku tsunami can be fully simulated for 13 hr in under 3.5 min wall clock time, using a single Nvidia TITAN X GPU. Three metrics for measuring the absolute performance of a GPU-based model are also proposed to evaluate current GPU implementation without comparing to others, which show the ability of current model to efficiently utilize GPU hardware resources. Other hazards such as storm surge (e.g., Mandli & Dawson, 2014) and dam failures (e.g., George, 2010) can also be modeled with GeoClaw and can similarly benefit from this GPU-enhanced version of GeoClaw.