Toward Global Stochastic River Flood Modeling

Global flood models integrate flood maps of constant probability in space, ignoring the correlation between sites and thus potentially misestimating the risk posed by extreme events. Stochastic flood models alleviate this issue through the simulation of flood events with a realistic spatial structure, yet their proliferation at large scales has historically been inhibited by data quality and computer availability. In this paper, we show, for the first time, the efficacy of modeled river discharge reanalyses in the characterization of flood spatial dependence in the absence of a dense stream gauge network. While global hydrological models may show poor correspondence with absolute observed river flows, we find that the rate at which they can simulate the joint occurrence of relative flow exceedances at two given locations is broadly similar to when a gauge‐based statistical model is used. Evidenced over the United States, flood events simulated using observed gauge data from the U.S. Geological Survey versus those generated using modeled streamflows have similar (i) distributions of site‐to‐site correlation strength, (ii) relationships between event size and return period, and, importantly, (iii) loss distributions when incorporated into a continental‐scale flood risk model. Extremal dependence is generally quantified less accurately on larger rivers, in arid climates, in mountainous terrain, and for the rarest high‐magnitude events. However, local‐scale errors are shown to broadly cancel each other out when combined, producing an unbiased flood spatial dependence model. These findings suggest that building accurate stochastic flood models worldwide may no longer be a distant aspiration.


Introduction
Hydraulic modeling at large spatial scales is a field of enquiry approaching a state of maturity, with the flood maps produced beginning to inform wide-area planning decisions, insurance pricing, and emergency response. Historically a reach-scale venture, multiple institutions from academia, industry, and elsewhere have expanded the spatial domain they consider when developing models of fluvial flood inundation up to the entire globe, as a response to the scarcity of flood hazard data in most world regions Pappenberger et al., 2012;Sampson et al., 2015;Winsemius et al., 2013;Yamazaki et al., 2011). These flood maps, however, are typically "static"; that is, they are a spatially homogeneous representation of a given probability flood. While at the local scale this may be representative of a realistic event (e.g., a small stretch of river experiencing a T-year flood event), as the size of the area considered increases, the resultant flood extents become increasingly unrealistic when viewed as a simultaneous phenomenon (e.g., an entire country experiencing a T-year flood event at the same time). Actual floods vary in their extremity across space: If a given location is extreme, you may expect proximal locations to be similarly extreme and distal locations to be decreasingly extreme . This asymptotic spatial dependence structure reflects the spatial heterogeneity of the driving physical processes in the atmosphere and how the terrestrial (sub)surface responds hydrologically (Blöschl & Sivapalan, 1997;Smith et al., 1994). Univariate (at-a-point) flood frequency analysis does not capture this spatial structure. Yet, current national-scale risk mapping is generated through assimilating a mosaic of thousands of local-scale univariate models (Gilles et al., 2012), for example, by the U.S. Federal Emergency Management Agency (FEMA) and the England Environment Agency. Equally, the simultaneous simulation across space of univariately defined probability floods is how typical large-scale flood hazard models are executed (Alfieri et al., 2014;Dottori et al., 2016;Pappenberger et al., 2012;Sampson et al., 2015;Wing et al., 2017;Winsemius et al., 2013;Yamazaki et al., 2011).
The failure to represent spatial dependence in risk calculations often leads to a misestimation of wide-area risk profiles (the probability-loss curve) when quantified using such data Hirabayashi et al., 2013;Ward et al., 2013;Wing et al., 2018;Winsemius et al., 2016). While spatially constant probability flood maps can be used to estimate average annual losses (by integrating the area under the probability-loss curve), they cannot tell us about losses which may occur in more extreme loss years. Such events are the very thing insurers must quantify for pricing (and to ensure they have access to adequate capital for regulatory and solvency reasons), that emergency managers must prepare for, and that corporations require to mitigate supply-chain risk. Metin et al. (2020), using a flood model with realistic spatial dependence for the Elbe basin, Germany, demonstrate that damages are underestimated by static flood maps for return periods more frequent than 1 in 50 years and overestimated for return periods less frequent than 1 in 200 years. They found the tail of the loss distribution (>200-year) was at least doubled when conflating the T-year discharge with the T-year loss. This erroneous conflation and its consequences have been further documented by Lamb et al. (2010), Wyncoll and Gouldby (2015), and Thieken et al. (2015).
Such findings support a move away from the spatially homogeneous risk maps that are ubiquitous in flood risk management toward an approach that accounts for the spatial structure of real flood events (Vorogushyn et al., 2018). Although understudied relative to the engineering-standard univariate case, multivariate flood frequency analyses have proliferated since the turn of the century. The characterization of spatio-temporal extreme flows is commonly performed either by continuous hydrological simulation (Falter et al., 2015(Falter et al., , 2016Grimaldi et al., 2013;Haberlandt & Radtke, 2014) or fitting statistical dependence models to samples of stream gauges (Diederen et al., 2019;Ghizzoni et al., 2012;Keef et al., 2009Keef et al., , 2013Neal et al., 2013;Quinn et al., 2019). The former involves cascading synthetic meteorological time series through a rainfall-runoff model, from which the output synthetic discharge time series can form the basis of a flood frequency analysis. This approach has the advantage of full coverage of the river network (at least, down to the resolution of the model) and simulating the full event hydrograph in preparation for unsteady hydraulic modeling, while an observation-based statistical model contends with gauge density issues and simulates only the joint occurrence of flood peaks. However, continuously simulating enough years (10,000 years is common) to adequately explore tail flood risk comes at high computational cost: lower process representation, coarser grid scales, and smaller model domains are necessitated across the model cascade. Furthermore, the use of hydrologically modeled discharges to drive hydraulic inundation models at large spatial scales is subject to substantial uncertainty (Beck et al., 2017;Prudhomme et al., 2011;Sperna Weiland et al., 2010). Such flows are often generated amidst a lack of data with which to calibrate or parameterize the hydrological models (Beven & Cloke, 2012) and are driven with meteorological data that may poorly represent extremes (Beck et al., 2019;Kendon et al., 2017;Kidd et al., 2012). These uncertainties can be exacerbated by the hydraulic modeling component (Falter et al., 2016;Grimaldi et al., 2019), cascading into loss estimates that are thus highly sensitive to the simulated flood peaks (Sampson et al., 2014;Zischg et al., 2018). For the purposes of spatial dependence modeling, not only is the accuracy of simulated discharge at-a-point of concern, so too is the ability of the hydrological model to accurately simulate the occurrence of concomitant flood peaks elsewhere. Continuous hydrological simulation studies to date are yet to evidence the validity of the joint occurrence and magnitude of modeled extreme discharges at-scale.
Adopting a gauge-based multivariate statistical modeling approach reduces accuracy issues through the use of observations of river discharge, though are still subject to uncertainties arising from measurement error, the stage-discharge relationship, and the extrapolation of short time series (McMillan et al., 2012). Such an approach also minimizes computational load through avoiding rainfall-runoff modeling and simulating the hydraulic extremes only. This permits hydraulic modeling of greater complexity and higher resolution applied across wider regions. The perennial issue in hydrology still remains, however: How do you characterize flows on ungauged streams? For the univariate case, flood frequency analyses have been regionalized by correlating extreme flow observations with catchment characteristics, and assuming unobserved flood behavior in similar catchments is thus predictable from the regionalization (Merz & Blöschl, 2009;Salinas et al., 2013;Smith et al., 2015). This time-invariant approach is, however, not applicable for modeling spatial dependence. Quinn et al. (2019), in the only continental-scale stochastic flood risk model subjected to peer-review (models similar in concept exist in the insurance industry but remain scientifically opaque), adopt a crude interpolation method whereby single river gauges transpire to represent large ungauged areas. While this may be acceptable in the gauge-rich contiguous United States (CONUS), reliably applying the Quinn et al. (2019) gauged-based model across a mostly data-poor planet is not possible (Hannah et al., 2011). Ultimately, to obtain a view of flood spatial dependence at the global scale, simulated discharge from large-scale hydrological models will necessarily play a role.
In this paper, we examine the extent to which simulated discharge time series from global hydrological models can reliably inform a stochastic view of flood risk at large spatial scales. To do so, we emulate the gauge-based approach of Quinn et al. (2019) in the CONUS, its statistical engine underpinned by the conditional exceedance model of Heffernan & Tawn (2004;hereafter, H&T), but replace the observational flow inputs with those simulated by global hydrological models. We compare the characteristics of the model-driven H&T synthetic events with those generated by a gauge-driven H&T model in terms of their spatial patterning and degree of extremity. We then produce event depth grids by linking the synthetic return periods at each gauge location for each event and the presimulated suite of CONUS-wide return period flood maps presented in Wing et al. (2017). In this way, the often inaccurate absolute discharges from global hydrological models are disregarded in favor of their exceedance probabilities, while the channel hydraulics and flood inundation dynamics are still informed by a gauge-based flow estimation procedure. With an inundation map for every event, we can examine the risk profile of a sample building inventory when run against the observation-and simulation-driven models and compare their similarity in the context of vulnerability estimation uncertainty. This analysis thus demonstrates, for the first time, the efficacy of simulated streamflow in large-scale stochastic modeling, illustrating the extent to which scientifically sound studies of flood spatial dependence globally may proliferate.

Methods
The core concepts underpinning the research presented in this paper are illustrated in Figures 1 and 2, providing a small-scale example of the kinds of analyses we are subsequently undertaking at the continental scale in section 3. These figures show an exemplar analysis for two sites in the U.S. Midwest-Platte River at Louisville, NE, and, 250 km downstream, Missouri River at St. Joseph, MO (see Figure 1a)-where information on the magnitude and frequency of flows in time and space from a global hydrological model and river gauges are contrasted. Full methodological details are presented in subsequent subsections. Figures 1b and 1c depict a model hindcast and historic observations of flow at the two sites, respectively. Noting the y axis scale in Figure 1b is double that of Figure 1c, it is evident that the hydrological model severely overestimates discharge at each location. The 1 in 5-year discharge (20% annual chance of being exceeded), as computed by fitting a generalized extreme value (GEV) distribution to annual maxima, on the Platte River is 1,800 m 3 s −1 based on observations compared to 4,300 m 3 s −1 according to the model. Similarly, the 5-year discharge on the Missouri River is 4,900 m 3 s −1 when calculated using gauged flows, compared to 12,200 m 3 s −1 when using the model. Equally, there is often model-observation disagreement in the chronology of local 5-year flow exceedance. Observed exceedances on the Platte River in 1983 and 1987 according to the river gauges were not simulated by the models, while unobserved-yet-modeled exceedances occurred in 1985, 1998, 2007, and 2014. However, these above findings-which, by most conventional hydrological metrics, would be heavily penalized in any evaluation of model performance-may not be relevant in a characterization of the correlation between the Platte River and the Missouri River:

10.1029/2020WR027692
Water Resources Research that is, the tendency for both sites to simultaneously be extreme. The Platte River exceeds its data source-specific 5-year flow in eight and seven discrete events according to the model and observations, respectively. The Missouri River concurrently exceeds its data source-specific 5-year flow during four of the eight events in the model record and four of the seven events in the gauge record. Hence, at the

Water Resources Research
arbitrary 5-year frequency threshold, the sites are 57% correlated according to the model and 50% correlated according to the gauges: There is a 57% (model) or 50% (gauge) chance of the Missouri River exceeding its 5-year discharge given that the Platte River has also exceeded its 5-year discharge. In this context, the model and the observations have produced very similar results.
We can explore this extreme flow dependency more fully by computing the magnitude and strength of the correlation between the Platte River and the Missouri River using the H&T statistical model (see section 2.2 for more details), from which we can sample any number of plausible flow co-occurrences (here denoted as an "event"). Figure 2a shows, in blue, the same data as in Figure 1b but with concurrent flows at show historic concurrent maximum 9-day discharges (Q) at the conditioning (x axis) and neighboring ( y axis) sites (in blue), as well as 10,000 samples of the fitted H&T joint distribution above a given extreme threshold (96th quantile) at the conditioning site (in red) for the hydrological model and observations, respectively. Panels (c) and (d) show the same synthetic events in panels (a) and (b) but converted to the quantiles of their respective semiempirical flow distributions for the hydrological model and observations, respectively. Panels (e) and (f) are a version of panels (c) and (d) where the probability of the Missouri River (MR) exceeding its respective flow quantile y given that the Platte River (PR) exceeds its respective flow quantile x is plotted as a surface for the hydrological model and observations, respectively. The probabilities at the numbered points are shown in Table S1 in the supporting information.
the two sites plotted against one another. Figure 2b shows the same, but for the gauged discharge observations (note the common x and y axes in Figures 2a and 2b, illustrating the gulf in flow magnitude between model and observations). The red dots depict 10,000 synthetic flood events as simulated by the H&T model, when supplied with modeled (a) and observed (b) flows. Since we are only interested in the correlation of extreme flows, we condition the H&T model on the Platte River exceeding an arbitrary extreme threshold (96th flow quantile). Hence, given Platte River is extreme, it is often the case that the Missouri River 250 km away is also extreme at some time later during the same event-and as Platte River gets more extreme, so the likelihood the Missouri River is also extreme increases. Both Figures 2a and 2b agree on this assertion, though for very different magnitudes of discharge.
Figures 2c and 2d plot those same synthetic flows (red dots in Figures 2a and 2b) but as their respective quantiles on the source-(model/gauge) and site-(Platte River/Missouri River) specific marginal, semiempirical distribution, or, equally, the cumulative probability of flow exceedance. Here, we can see that the model-(c) and gauge-(d) based simulated flow quantiles are broadly similar: Generally, the Missouri River is often extreme when the Platte River is too. We can examine this in more detail in the surface plots of Figures 2e and 2f. These are based on the same data as Figures 2c and 2d but illustrate the probability that the Missouri River exceeds the quantile on the y axis given that the Platte River exceeds the quantile on the x axis. The probabilities at the numbered points are shown in Table S1. Again, we see very similar results for both input data sources in terms of the probability that the two sites experience certain flow conditions.
The approach of the work in this paper broadly follows the concepts outlined in Figures 1 and 2 but for all available U.S. Geological Survey (USGS) river gauges (subject to quality control; see section 2.1.1), the hypothesis being that models which substantially misestimate streamflow in absolute terms may still have skill in the characterization of site-to-site correlation in terms of relative exceedance probability; in essence, "normalized" modeled streamflows (expressed as a probability) represent flood spatial dependence similarly to observations. Details on the river gauges, hydrological models, statistical model, and flood risk model are outlined in the following subsections, with tests for the similarity of flood spatial dependence modeling between gauge and hydrological model inputs described. The key steps of this research are outlined in the flowchart of Figure 3.

Input Data 2.1.1. USGS River Gauges
As per Quinn et al. (2019), we obtained daily river flow time series at approximately 9,800 gauging locations from the USGS. These records are often discontinuous and of variable quality and so are filtered whereby gauges are selected if they (i) contain data for at least 35 years between 1980 and 2015; (ii) have less than 2% erroneous or missing measurement days; and (iii) exhibit no discernible trend or step changes (as determined by the Kendall τ test). This resulted in 3,094 river gauges for use in the proceeding analyses.

Global Hydrological Models
We employ three global hydrological models in this analysis. Daily streamflow hindcasts simulated by each model for the period 1980-2015 were extracted at each gauge location in order to facilitate the model-observation dependence comparison. To ensure the hydrological model output is analogous to each respective gauge (grid coarseness and/or errors in the underlying hydrography data may produce a divergence in the size of the simulated and observed river), the drainage area upstream of each gauge (as reported by the USGS) needed to be within 30% of that in the model hydrography. To ensure a fair intercomparison, gauge locations were only retained where this condition was met for all three models: of the 3,094, 1,674 gauges were retained. Unsurprisingly, the rejected gauges were generally on smaller rivers (median drainage area of rejected gauges = 450 km 2 ), meaning the median drainage area of 1,300 km 2 from the original 3,094 sites increased to 3,400 km 2 in the accepted set of 1,674. This focus on larger rivers-necessitated by the fidelity of available wide-area elevation and hydrography data used by global hydrological models-is consistent with the use of daily streamflow observations in this analysis, which may fail to capture the flashier rainfall-runoff response in smaller catchments. Further, the application of large-scale hydrological models in the context of flood modeling is commonly restricted to larger rivers, owing to their coarse grid resolution Pappenberger et al., 2012;Ward et al., 2013;Winsemius et al., 2013;Wu et al., 2014;Yamazaki et al., 2011). The method in this paper can therefore be viewed as an analysis of fluvial flood spatial dependence, while flooding on smaller rivers (and hydrologically isolated local depressions)-better 10.1029/2020WR027692

Water Resources Research
characterized as pluvial flood events-remain outside the scope of this work. A brief description of the three hydrological models follows.

CaMa-Flood (UTokyo)
CaMa-Flood (Catchment-based Macro-scale Floodplain model) is a hydrodynamic global river routing model, which was developed mainly as a river module of global climate models and also as a tool for large-scale flood risk assessment (Yamazaki et al., 2011(Yamazaki et al., , 2012. It represents floodplain inundation dynamics as a subgrid physical process based on high-resolution topography data and calculates river discharge using a simplified shallow water flow equation ) along a coarse-resolution river network map including river channel bifurcation (Yamazaki et al., 2014). For this study, we used the latest version of CaMa-Flood (v3.95), which is based on the topography and hydrography data sets MERIT DEM and MERIT Hydro (Yamazaki et al., 2017(Yamazaki et al., , 2019. The model was run, uncalibrated, at 6 arc minutes (~11 km at the equator) spatial resolution from 1980-2015 forced by the daily runoff product calculated by the HTESSEL land surface model for the Earth2Observe project .  Winsemius et al., 2016). For this study, we implemented the latest version, PCR-GLOBWB 2 (Sutanudjaja et al., 2018), at the spatial resolution of 5 arc minutes (~10 km at the equator). The meteorological forcing files were based on monthly CRU TS 3.2 (Harris et al., 2014), daily ERA-40 (Uppala et al., 2005), and ERA-Interim (Dee et al., 2011) for hindcast simulations of the period 1958-2015.
No calibration was performed to the standard parameterization sets. The simulation setup for this study used an advanced surface water kinematic wave routing scheme, which includes floodplain inundation (see, e.g., Winsemius et al., 2013).

GloFAS (JRC/ECMWF)
The third considered data set is the v3.0 reanalysis of the Global Flood Awareness System (GloFAS; Alfieri et al., 2020). It includes estimates of daily river discharge for 1980-2018 with quasi-global coverage (90°N to 60°S) and a spatial resolution of 6 arc minutes (~11 km at the equator). The GloFAS streamflow reanalysis was produced with the LISFLOOD hydrological model (van der Knijff et al., 2010) and ECMWF's atmospheric reanalysis ERA5 (Herzbach et al., 2018) as meteorological input. In the GloFAS-Reanalysis v3.0, LISFLOOD was calibrated using at least 4 years of observed discharges at 1,226 stations in 66 countries, of which 278 were in the United States. GloFAS is designed for medium to large river basins with drainage area larger than 5,000 km 2 . Performance at the 278 calibration stations in the United States have median scores of 0.66 Kling-Gupta Efficiency, 0.75 correlation, and 5.5% bias (Alfieri et al., 2020).

H&T Model
In order to define independent events, we need to be explicit in our meaning of "simultaneous," since different sites may not necessarily experience peak flows from the same event on the same day. Following Quinn et al. (2019), we consider a temporal window of 9 days: suitably long to capture likely flood wave travel times during a single event, and suitably short so as to avoid capturing spurious correlations. However, this time window would require a given conditioning site to be regressed against every gauge nine times (once for each day in the window), presenting a cumbersome computational problem. As such, we temporally coarsen the input data to 3-day maxima so that these time-lagged regressions are required only three times for each neighboring site. To further reduce the number of required joint regressions and prevent unrealistic, spurious correlations, we apply spatial limits to which gauge locations can be considered dependent. The statistical model underpinning the dependence characterization was first presented in Heffernan and Tawn (2004), and its application to stochastic flood modeling is detailed more exhaustively in Keef et al. (2009Keef et al. ( , 2013, Neal et al. (2013), Wyncoll and Gouldby (2015), and Quinn et al. (2019). First, a marginal distribution is fitted to the flow data at each location, defining the probability a certain discharge is exceeded at the given site-independent of flows occurring at neighbors. These distributions are "semiempirical," whereby the original (empirical) data characterizes exceedance probabilities below a given quantile threshold and a generalized Pareto distribution is fitted to data above this threshold to define the tails. This step requires a careful balance between a quantile threshold low enough to have a suitable number of data points with which to estimate the generalized Pareto model (i.e., to keep variance low) yet high enough to avoid model misspecification bias. This "bias-variance trade-off" is guided by the cross-validatory threshold selection method of Northrop et al. (2017). The modal threshold across all sites and data sources was the 96th quantile. These marginal distributions then need to be transformed onto a common scale, in this case, the Laplace distribution, since this induces a linear relationship between joint extremes.
A generic formulation of the H&T model, simulating discharges at neighbor site Y j given conditioning site Y i is extreme, can be described as where y is a vector of discharges at site Y i that exceed a specified extreme flow quantile on the Laplace margin; Y j can be every neighbor to the conditioning site within the spatial limits; parameter α j|i controls the overall strength of the dependence at site Y j given Y i and falls within the limit [−1,1]; parameter β j|i , where an event is defined, in this instance, as the conditioning site (in blue) and a given neighbor (white-red) exceeding the 5-year flow, when simulated using USGS gauge and CaMa-Flood input discharge data, respectively; panel (e) shows the difference between gauge-( obs ) and model-( mod ) based event co-occurrence rates. Blue lines represent river channels with drainage area exceeding 500 km 2 .
controls how different values of y change this dependence and is limited to <1; and Z j|i is a vector of residual terms, which, by virtue of their independence from variable Y i , permit nonparametric modeling of Y j |(Y i = y) beyond the measured (or, in the case of the hydrological models, simulated) range of the input data. Equation 1 is repeated for the temporal lags at Y j within the 9-day window.
In this analysis, we repeat Equation 1 for all conditioning sites, each of the conditioning site's neighbors, and each neighbor's temporal lag for 10,000 values of y (extreme flow samples of equal occurrence probability). The maximum lagged streamflow from a given conditioning-neighbor-y combination is extracted, meaning every conditioning site has 10,000 (extreme) discharge samples and every neighbor within the spatial limits has 10,000 corresponding discharge samples-each one of the 10,000 groups of discharges is considered to be an "event." With 10,000 events at each conditioning site and 1,674 sites in total, we thus analyze 16,740,000 synthetic flood events across the United States for each input data source.

Fathom-US CAT Model
The simulated streamflows from each synthetic event are transformed into a return period derived from the historical site-specific discharges from each input data source (observed or simulated). Maximum likelihood estimation determined the closest fit of either a Gamma or GEV distribution to annual maxima of these historical simulated or observed flows. From these distributions, the synthetic event discharges were converted to return periods. These return periods informed an "offline" extraction of a water depths from presimulated catalog of return period flood maps, which were then intersected with exposure and vulnerability information to calculate losses arising from each event. This flood catastrophe (CAT) model, Fathom-US CAT, is outlined in the following subsections. For more detailed information on the components of this model: extreme flow estimation , inundation modeling Wing et al., 2017), loss calculations (Wing et al., 2018(Wing et al., , 2020, and event set generation .

Flood Inundation Model
Spatially continuous flood maps of the CONUS were presented and validated in Wing et al. (2017) based on the global flood modeling methodology of Sampson et al. (2015). Each map, simulated at 1 arc second (~30 m at the equator) resolution, represents one of 17 "constant return period in space" flood layers of the CONUS, ranging between 1 in 5 and a 1 in 1,000 years (20% and 0.1% annual exceedance probability). These probabilities are derived from an estimation of extreme flows using a U.S.-only version of the global regionalized flood frequency analysis (RFFA) of Smith et al. (2015). This approach is underpinned by the assumption that extreme flows on ungauged streams are predictable from their catchment characteristics (Meigh et al., 1997;Salinas et al., 2013). The information from gauged streams of similar characteristics to a given ungauged stream-in terms of annual average rainfall, upstream drainage area, dominant climatology, and average slope-is "transferred" to predict extreme flows on the latter. Median errors in the estimation of a 1 in 100-year discharge globally were reportedly 56%, though the uncertainty in the gauged observation of the 100-year flow itself can be at least 40% due to measurement error and rating curve configuration (Coxon et al., 2015;Di Baldassarre & Montanari, 2009;McMillan et al., 2012). Further, in its application to the inundation model, biases in the RFFA are dampened somewhat by ensuring river channels can convey the 1 in 2-year discharge, a standard geomorphological assumption. The computational hydraulic engine is based on LISFLOOD-FP (Bates et al., 2010;de Almeida et al., 2012), which numerically solves a local inertial formulation of the shallow water equations in 2-D over the USGS National Elevation Dataset in this instance. Further details are available from Wing et al. (2017).
Each synthetic event guides an extraction from this inventory of flood hazard layers, depending on the return period of the flow simulated by the H&T model at a given location. As per Quinn et al. (2019), each gauge location is assigned a representative river catchment-a discrete spatial unit within which we can reasonably expect the return period to remain constant during a given event. To define this, we use the HydroBASINS data set (Lehner & Grill, 2013). Each gauge site is assigned a Level 8 basin or, where multiple gauges fall within the assigned catchment, a Level 10 basin (see Figure 4b). The synthetic flood events at specific points in space (as simulated by the H&T models) are thus converted to inundation depth grids which are a composite sample of varying probability, presimulated flood maps. For more information on the catchment-based sampling procedure, see Quinn et al. (2019) and .

Water Resources Research
however, is not necessarily directly relevant to the analysis in this paper. These are simply used as a means of comparing the risk output of a gauge-based versus a model-based flood event generator. We do not "validate" the event-based depth grids; rather, we are interested in the similarity of the resultant outputs given differing input data rather than interrogating the substantive meaning of the outputs per se.

Loss Calculations
Each synthetic event was intersected with a database of U.S. buildings-the FEMA National Structure Inventory-which contains information on the location, value, and occupancy type of over 100M structures. Event depths were sampled at each structure location, and a probabilistic depth-damage function was applied, depending on the building type, to compute a range of losses to each building during each event as a proportion of its value. These functions are based on the analysis of Wing et al. (2020), who used 2M flood insurance claims in the United States to derive a stochastic relationship between depth and damage. For instance, a flood depth of 0.4 m has a 43%, 28%, 14%, and 15% chance of causing 0-10%, 10-30%, 30-50%, and 50-100% relative damage to a one-story residential building, respectively. At 1.8 m, there is a 21%, 20%, 19%, and 40% chance of causing 0-10%, 10-30%, 30-50%, and 50-100% relative damage, respectively. We sample from these loss distributions 100 times, producing a spread of damages arising from each synthetic flood event. The relative differences in loss between events generated by the different sources of flow data are then compared.

Model Tests
The meaningful results from this analysis will come from the inter-input data comparison of the spatial footprint and relative magnitude of the synthetic events and the damages they cause. This section describes the tests carried out to interrogate these differences between the gauge-and model-driven structures.
At a given conditioning site (Y i ) for a given input data source, an event co-occurrence rate (COR) is computed under a variety of scenarios for each neighboring site (Y j ): where p is a specified return period and the subscripts i and j indicate that this pertains to the conditioning and neighboring sites, respectively. In simpler terms, COR indicates the probability an event occurs at both the neighboring and the conditioning sites simultaneously, with "event" defined by the return period thresholds p i and p j . Varying these thresholds permits us to examine the similarity between the different H&T models at different degrees of extremity. COR can take any value between 0% and 100%. Figures 4c  and 4d show COR 5|5 for an example conditioning site and its neighbors for the gauge-and model-driven event sets respectively: In other words, this is the probability a neighbor exceeds the 5-year flow given the conditioning site exceeds the 5-year flow.
The difference between the COR using gauges versus using modeled streamflow at each neighbor can be summarized by the root mean square error at each conditioning site; the co-occurrence rate error (CORE), in units of percentage points (%pts). The CORE is also computed under varying scenarios to better understand model performance: where N is the number of neighbors to the conditioning site, obs and mod superscripts indicate whether COR is calculated using the gauge-or model-based events, respectively, and x is a COR threshold above which errors are computed. From Figure 4e, it is evident errors (COR obs − COR mod ) at distal neighbors are low simply because both gauge and model approaches simulate (close to) 0 correlation between such locations and the conditioning site. It may not be particularly discriminatory to reward the model-based method for correctly simulating zero dependence at distal sites, so varying x permits us to examine CORE where conditioning-neighbor correlations are of varying strengths. CORE 5|5,0 (5-year event co-occurrence error at all neighbors) for the conditioning site in Figure 4 is 5.1%pts, while CORE 5|5,5 (errors at the sites where neighbor co-occurrence >5%) is 9.3%pts.

Water Resources Research
We further scrutinize model-based H&T error by splitting COREs based on relevant geophysical and climatic characteristics. Each site is grouped within a hydrologic landscape region (based on gauge location) as defined by the USGS (Winter, 2001;Wolock, 2003), which specifies the climatic setting (humid, subhumid, semiarid, and arid), land surface form (plain, plateau, and mountain), and geologic texture (permeable/impermeable soils/bedrock) of river basins across the United States. These hydrologic regions are depicted in Figure S1. Further, the USGS gauge metadata are used to divide sites into three classes of upstream drainage area (small [<1,800 km 2 ], medium [1,800-7,200 km 2 ], and large [>7,200 km 2 ]; each group an equal-population tertile). Since the original COR calculation is pairwise, a given conditioning/neighboring site pair often do not fall within the same category. Hence, COREs are reported as an interaction within or between each grouping for climatology, morphology, geology, and drainage area. For example, the conditioning site in Figure 4 falls in the small drainage area category. CORE 5|5,0 for this conditioning site and all other small rivers (small-small) is 4.7%pts, for this site and medium rivers (small-medium) is 5.0%pts, and for this site and large rivers (small-large) is 4.8%pts.
To examine the spatial footprint of the synthetic events and how it scales with event magnitude, we perform two comparative tests between the gauge-and model-based event sets. First, where a given conditioning site exceeds a given return period threshold, we compute the median return period across these events experienced at each possible neighbor (based on the spatial limits set out in section 2.2). Assimilating these data for all conditioning sites and plotting median return period against distance from the conditioning site illustrates the distance decay in extremity simulated by the gauge-and model-based approaches. It is here that we expect to see evidence of the aforementioned asymptotic dependence of river flows and the extent to which this differs between data sources. Second, we can examine how the areal extent of flood events change as the conditioning site gets increasingly extreme. Our expectation here is that more extreme events are more widespread in space, while less extreme events are more localized. Again, we are interested in how this relationship differs between H&T model data sources.
For comparisons in a risk-based context, we can plot gauge-and model-based H&T losses on a probability-loss curve. In this instance, probability refers to the empirical event loss quantile; that is, the 99th quantile loss is the loss that has a 1% chance of being exceeded in the event set (the 167,400th most damaging event). Since we have 100 samples of loss for each event-representative of the uncertainty in depth-damage relationships-we can examine how different the probability-loss curves are in the context of this variability.
This suite of analyses, applied to each set of model input data in turn and contrasted with its gauge-based counterpart, can be summarized as follows: • event co-occurrence rate calculated at each site for different levels of extremity; • co-occurrence rate isolated for different hydrologic characteristics; • examination of return period decay with distance when the conditioning site is extreme, and how this scales with increasing extremity; • examination of event size when the conditioning site is extreme and how this scales with increasing extremity; and • calculation of flood damages arising from the synthetic event set.

Results and Discussion
We can first examine the CORs simulated by the model-based approaches and contrast these with those simulated by the gauge-based H&T model. Figure 5 illustrates COR 5|5 (an event defined as exceeding the 5-year flow at conditioning and neighboring sites) for all conditioning-neighboring site pair combinations. The total height of the stacked bars (Figures 5a-5c) represent the proportion of model-based CORs that fall within four subjective bins (COR of 0-2%, 2-5%, 5-10%, and 10-100%) relative to the binned distribution of gauge-based CORs. This distribution is selected to ensure large enough sample size across bins which loosely capture a variety of dependence strengths. A bar height of 100% indicates the same number of model and gauge location pairs fall within a particular bin. Most bin heights are close to the 100% optimum, meaning the distribution of site-to-site dependence for all location pairs is broadly similar between model and gauge methods. CaMa-Flood (91%) and PCR-GLOBWB (94%) experience a slight drop in the number of high-dependence (COR of 10-100%) locations relative to the gauges, while GloFAS (125%) predicts too many high-dependence sites. Low-to medium-dependence (COR 0-10%) sites are closely replicated by CaMa-Flood and PCR-GLOBWB, but there are too few of them simulated by GloFAS (91% and 89% in 0-2% and 2-5% bins, respectively).
While this represents very high performance versus the gauge-based method in aggregate, the colors in each bar stack indicate that, to some degree, nearing the 100% optimum masks some local-level errors. Each discrete color patch corresponds to the COR bin the gauge-based method places a particular location-pair in and thus indicates whether the model approaches have placed individual location-pairs in the same bin as the gauge-based method. The optimum result would be the entirety of the 0-2% COR mod bin bar having The colored dots indicate the COR at every location pair, with darker colors signaling a greater density of data points. The red lines are COR obs versus COR mod quantiles, with a perfect match between these illustrated by the dashed line.

10.1029/2020WR027692
Water Resources Research the lightest gray color of the COR obs 0-2% bin (according to the figure legend), the entirety of the 2-5% COR mod bin bar having the slightly darker gray color of the COR obs 2-5% bin, and so on. Most of the model-predicted high-dependence site pairings (COR mod of 10-100%) consist of gauge-predicted highdependence site pairings (COR obs of 10-100%), yet other COR mod bins also contain 10-100% COR obs bin locations. 72%, 61%, and 57% of the model-predicted high-dependence locations are also gauge-predicted high-dependence locations for CaMa-Flood, PCR-GLOBWB, and GloFAS, respectively. The remaining 10-100% COR mod bin site-pairs are made up of locations where COR obs ≠ 10-100%. Model-predicted low-dependence locations (COR mod of 0-2%) are similarly mostly populated by gauge-predicted low-dependence locations (COR obs of 0-2%). 68 percent (CaMa-Flood), 63% (PCR-GLOBWB), and 67% (GloFAS) of model-predicted low-dependence locations are also gauge-predicted low-dependence locations. The central medium-dependence bins (COR mod of 2-10%), however, are clearly populated by a mixture of location-pairs with COR obs values from different bins. Indeed, the 5-10% COR mod bin (for all models) is almost equally constituted of location-pairs with COR obs from all bins: Only 29% (CaMa-Flood), 23% (PCR-GLOBWB), and 23% (GloFAS) of the 5-10% COR mod bins are locations with 5-10% COR obs . Thus, location-pairs which have (subjectively) high or low dependence strengths according to the gauge-based method are generally also classified as such by the models. Location-pairs of (subjectively) medium-dependence strength-which only occasionally experience joint extreme flooding-are misidentified by the models more often. Figure S2 illustrates that the picture is much the same when event definition is altered to account for return periods of different magnitude at the conditioning site (COR 5|20 , COR 5|50 , and COR 5|100 ). CaMa-Flood and PCR-GLOBWB have near-identical binned distributions of pairwise correlation magnitude to the gauge-driven H&T model, with slightly lower proportions of high-dependence locations (COR mod of 10-100% < COR obs of 10-100%). GloFAS slightly underpredicts the number of low-dependence locations (COR mod of 0-10% < COR obs of 0-10%) while overpredicting the number of high-dependence locations (COR mod of 10-100% > COR obs of 10-100%). It appears, then, that this set of results is insensitive to p i thresholds in the determination of COR. This holds true when each COR mod bar stack is broken down into the COR obs of its component location-pairs also. Low-(0-2%) and high-dependence (10-100%) COR obs locations are generally modeled as such, while medium-dependence COR mod locations (2-10%) are generally a mixture of locations with varying COR obs values.
Overall, at the aggregate nationwide level, overpredictive and underpredictive errors in the characterization of flood dependence seem to cancel out to produce a population of model-based CORs that are broadly similar to gauge-based CORs. This is further evidenced by the plots in Figures 5d-5f, where the bulk of COR obs versus COR mod (darker shading) are clustered around the 1:1 line, albeit with much scatter. The quantiles of the gauge-based CORs versus those of CaMa-Flood and PCR-GLOBWB are a very close fit to quantiles of a normal distribution centered on 0 error. The Q-Q plot of GloFAS illustrates a tendency toward overpredicting location-pair dependence strength, particularly for highly correlated sites. Figure 6 illustrates the COR error (CORE 5|5,0 ) at each conditioning site, where an event is defined as a conditioning-neighboring site pair jointly exceeding their 5-year flow and errors are computed for all locations within the spatial limits. As a whole and for all models, COREs appear relatively low: The majority of errors lie within the 5-10%pts bin (e.g., a location-pair with COR obs of 24% and COR mod of 17% would fall in this group). 21% of CaMa-Flood COREs are <5%pts, 77% are <10%pts. For PCR-GLOBWB, 4% and 70% of COREs are <5%pts and <10%pts, respectively. GloFAS COREs are <5%pts and <10%pts for 4% and 59% of conditioning sites, respectively. Geographic patterns of model skill are similar across all input data sources. Model-and gauge-based CORs seem to converge in the south: around the Arkansas and Ohio Rivers and at sites in Alabama, Mississippi, Tennessee, and Kentucky in southeast United States. COREs increase in the west: most notably across the Rocky Mountains, where COREs exceed 15%pts, and also in New England. Certainly, in the arid Rockies, this is consistent with the quality of the rainfall forcing, where unresolved steep topography can result in an underprediction of extreme precipitation from orographic uplift (Elvidge et al., 2019). Similarly, drylands flooding is mostly due to localized convective storms, meaning hydrological model grid and temporal coarseness may smooth out precipitation extremes. Equally, the proportion of river flow retained in reservoirs or diverted for human use is high in the arid United States, meaning modeled flows may overestimate their true values: Only PCR-GLOBWB attempts to address this where dam operation data are available. Further, by definition, the infrequency of flood events in arid climates results in sensitivity of the fitted extreme value distribution tail to relatively few data points. Slight errors in the simulated flow can thus cascade into large deviations in the gradient of the growth curve. Table S2a isolates these errors (CORE 5|5,0 ) for different climatologies. As above, we can see a deterioration in model skill in more arid regions, where hydrological processes are complex, flood response is spatially and temporally heterogeneous, and precipitation data are more uncertain (Beck et al., 2019). Model-driven event simulation involving more humid locations generally replicate the spatial footprint of gauge-driven events more closely. Table S2b illustrates higher model performance on the plains and lower performance in mountainous areas, consistent with results from Figure 6 and known inaccuracies in rainfall products in mountainous areas (Beck et al., 2019). Skill on plateaus lies somewhere in between that of plains and mountains, perhaps due to the diversity of flood drivers in such regions (e.g., snowmelt and ice-jam floods as well as rainfall-driven events; Stein et al., 2019). In Table S2c, model performance appears relatively insensitive to the permeability of the bedrock but improves as soils become more impermeable. This is likely consistent with our lack of knowledge on subsurface processes (Fan, 2019) and the quality of subsurface process

Water Resources Research
representation in global hydrological models . Where the translation from rainfall to streamflow is predominantly surface runoff, as in impermeable catchments, modeled event footprints more closely resemble gauge-driven ones. In Table S2d, COREs are stratified by river size (expressed as catchment area). Here, we can see that errors in the characterization of site-to-site correlations consistently decrease as rivers get smaller. This may be due to large river flooding requiring the accurate simulation of the timing of multiple flood waves as they aggregate from tributaries, while smaller river flooding may be mostly controlled by the spatial patterning in the meteorological forcing. Equally, larger rivers are much more likely to be subject to human influence via flow regulation, which are liable to increase COREs where this is unaccounted for.
Overall, this thread of analysis indicates that event co-occurrence errors are sensitive to (in order of decreasing sensitivity) climatology, drainage area, land surface form, soil permeability, and bedrock permeability. More humid regions exhibit higher performance than more arid regions; smaller rivers have higher model performance than larger rivers; performance on flatter terrain is higher than in mountainous areas; events on impermeable soils are better simulated than on permeable soils; and performance is generally insensitive to bedrock permeability. This is consistent across all three global hydrological models studied.
We further explore how model performance changes with different return period and COR thresholds (p i and x in Equation 3), as CORE 5|5,0 may (i) mask errors in the characterization of more extreme events than the 1 in 5 year and (ii) be misrepresented by numerous distal, low-dependence sites which are easier to model as such. Figure 7 displays the same information contained in the histograms of

Water Resources Research
Figure 6 but as boxplots with varied p i and x thresholds. COREs appear to increase as the conditioning site becomes more extreme and as it encompasses the errors of more dependent sites only. CaMa-Flood median CORE 5|5,0 is 6.5%pts, while CORE 5|100,0 is 16.1%pts. Removing low-dependence sites from this summary metric, CaMa-Flood median CORE 5|5,5 and CORE 5|100,5 are 9.1%pts and 19.7%pts, respectively. This is perhaps as expected, since the spatial averaging effect of coarse meteorological forcing tends to blunt the largest extremes. This is especially true for smaller river basins, whose flood response is governed by smaller, poorly resolved weather events. Even if we are only interested in relative flow exceedances, the effect of this is to flatten the tails of annual flow maxima. Meanwhile, the model-based events appear more skillful when the test involves a more forgiving classification of extreme versus nonextreme flows (where p i = 5). It should also be noted that increasing values of p i increasingly rely on extrapolations beyond the observation or simulation period of 35 years, meaning the gauge-based H&T benchmark is less reliable in the computation of extremal COREs. The wealth of data simulated by large ensembles of climate models, commonly used in detection and attribution studies, has the potential for addressing the issue of short record length by leveraging many thousands of years of realizations of past weather, rather than relying on a single version of history as we do in this analysis (Mizuta et al., 2017;. Even so, in spite of rising COREs with increasing p i , characterizing site-tosite correlations within~20%pts on average even for very rare events may well be acceptable for application in flood risk modeling, where uncertainties related to inundation modeling, exposure geolocation, and flood vulnerability may dominate beyond those involved in generating the event sets. Rather than taking the relative flow exceedances as output, disparities between model-and gauge-based approaches can be considered when the absolute volumetric flows simulated by the hydrological models and synthesized by the H&T model are used instead. The results of this for CaMa-Flood are shown in Figure 8. Return periods for the absolute volumetric discharges simulated by the CaMa-Flood-based events are computed based upon the flood frequency curve of the relevant USGS gauge, while the relative flow exceedance return periods are computed from the flood frequency curve based on the CaMa-Flood simulated discharge time series at the given location. Median absolute co-occurrence rate errors, with events defined fairly loosely (CORE 5|5 ), are double relative COREs. CORE 5|5,0 increases from 6.5 to 13.9%pts (Figure 8a), CORE 5|5,5 increases from 9.1 to 18.6%pts (Figure 8b). Indeed, across all return period thresholds ( p i ) errors increase substantially, to the extent that even the lower quartiles (Q 25 ) of absolute volumetric errors are greater than the median of relative exceedance errors in all cases. The between-site co-occurrence rate during 100-year floods (when COR > 5%) is misestimated by 28.0%pts on average when absolute flows are used, compared to 19.7%pts when relative exceedances are used instead. Although a benchmark for acceptability is difficult to define in terms of CORE, it is clear that the use of absolute volumetric flows simulated by global hydrological models results in a set of synthetic events that are considerably less plausible than if relative flow exceedances are employed.
We can further compare more general characteristics of the model-and gauge-based event sets through examinations of the size of simulated events and their extremity. In Figure 9, gauge-based H&T events Figure 9. Median return period at a neighboring site versus its distance from the conditioning site, given the conditioning site exceeds the 1 in 5-, 20-, 50-, and 100-year return period (left-right column). Each row corresponds to a separate H&T input data source.

Water Resources Research
show a decay of return period with distance from the conditioning site. As the conditioning site becomes more extreme, extreme events generally occur at greater distance from the conditioning site, but the most extreme return periods during such events remain more localized (illustrated by the sharper decline in return period with distance for more extreme conditioning site events). These phenomena are generally well replicated by the hydrological model-based H&T events. CaMa-Flood and PCR-GLOBWB generally show sharper declines in return period with distance than the gauge-based approach, meaning distal sites become less extreme too quickly or events are generally too localized. Conversely, the distance decay of return period illustrated by GloFAS is gentler than that of the USGS gauges. GloFAS events are generally too widespread, where extreme return periods are experienced at greater distances from conditioning sites than for events simulated used river gauge data. One possible cause of this may be the lack of a floodplain inundation component in GloFAS, which would serve to dampen the flood wave peak. CaMa-Flood and PCR-GLOBWB, where synthetic event footprints are not as widespread, have some representation of inundation in their model structures. Equally, the effect of GloFAS calibration (CaMa-Flood and PCR-GLOBWB are uncalibrated) may have unintentionally altered the correlation of relative flow exceedances, in spite of improving the correspondence between observed and modeled absolute discharges.
The variation in the number of sites experiencing an event when the conditioning site exceeds certain return periods is illustrated for the different input data sources in Figure 10 and Table S3, where observations  Table S3 summarizes the data in this graphic. made in light of Figure 9 are generally reaffirmed. Gauge-based events are generally more widespread as their return period increases. GloFAS-based events impact too many sites across all return periods, while CaMa-Flood and PCR-GLOBWB are a closer match to the gauge-based events (yet impact too few sites). For instance, from Table S3, 22% of gauge-based 100-year events impact at least 100 sites, while for CaMa-Flood, PCR-GLOBWB, and GloFAS, the event proportions are 22%, 19%, and 37%, respectively. For 5-year events, the proportions of events which impact at least 100 sites are 10%, 6%, 6%, and 15% for gauge, CaMa-Flood, PCR-GLOBWB, and GloFAS, respectively.
On the whole, the relationships between flood event probability and spatial extent exhibited by the gauge-based approach are well replicated by the model-based methods. Return period decay with distance -and the scaling of this decay with increasing extremity-are similar across input data sources. Equally, the extent of flood events given their return period-and how the extent increases with increasing extremity-appear consistent too.
The final step in the comparison of synthetic events generated by different input flow data involves a comparison of losses when these events are cascaded through a hydraulic model to produce inundation depths arising from each event, which, in turn, are intersected with building and vulnerability information to generate losses. The vulnerability functions are probabilistic, meaning they relate a given inundation depth to an array of possible flood damages (following Wing et al., 2020). As a result, each synthetic event has a probability distribution of possible damages: The upper and lower quartiles of this are plotted in Figure 11 (the most extreme 20% of events are shown). Comparing model-and gauge-based approaches, losses are virtually indistinguishable for CaMa-Flood and PCR-GLOBWB. Even for GloFAS, the median event loss never falls outside of the 50% confidence interval, though the assertion of events being too large is evident in the event-loss curve. For the most damaging events, losses appear to converge. This perhaps indicates that the largest possible flood events are captured by the behavior inherent to simulated flow time series or that there is some constraint in the physical reality of a synthetic event imparted by the hydraulics.

Water Resources Research
From Table S4, the 99th quantile event-one of the most extreme simulated-generates an average of $5.4 billion of damage according to the gauge-based approach, but the central 50% of losses given vulnerability uncertainty ranges between $1.8 and $9.5 billion. The median 99th quantile loss from model-based approaches are $6.2, $4.8, and $5.6 billion, while the central 50% of losses are $2.1-10.8, $1.6-8.5, and $1.8-9.8 billion, for CaMa-Flood, PCR-GLOBWB, and GloFAS, respectively. It is evident, then, that all flow data sources when used as input to a stochastic flood risk model produce losses to a nationwide inventory of U.S. buildings that are within the uncertainty inherent in the characterization of flood vulnerability. While this uncertainty is very large, preceding tests incorporating only the physical modeling further evidence the correspondence between model-and gauge-driven approaches.

Conclusions
The proliferation of global hydrological models in recent years has enabled a step change in our understanding of the response of the terrestrial water cycle to meteorology and climate, though their defensible application in high-resolution flood risk frameworks which capture the spatial structure of real flood events has yet to be evidenced. In this study, we use simulated and observed discharge time series as input to a U.S.-wide conditional exceedance model based on Heffernan and Tawn (2004) and Quinn et al. (2019) to generate tens of millions of plausible extreme flood events. We find that, while the absolute simulated extreme flows may be unreliable for global hydrological models, the relative simulated flow exceedances largely resemble flood events based on observations from gauge data. Using a relatively loose definition of "an event footprint" (sites experiencing >5-year flow simultaneously), errors in the quantification of site-to-site correlations are generally less than 10%pts. Where events are defined more specifically (e.g., sites experiencing >100-year flow), errors are generally below 20%pts for all return periods. Co-occurrence quantification errors are generally higher for larger rivers, in mountainous areas, and in arid climates, while errors are lower for small streams, on plains, and in humid climates. These location-level errors are mostly mitigated (positive and negative biases cancel) in aggregate, as evidenced by the similar number of location-pairs which fall within subjective bins describing the magnitude of their correlation across all input flow data sources. The relationships between extremity and event size exhibited by the gauge-generated flood events are also broadly replicated by the model-based approaches. In their application within a catastrophe modeling framework to a portfolio of >100M buildings, risk quantifications are largely indistinguishable regardless of the input flow data source. To reiterate, any of these global hydrological models appear fit-for-purpose in this context.
It is important to note that conclusions drawn over the United States may not be directly transferrable globally. The climate reanalysis forcing these hydrological models is likely of higher quality over the United States than globally, since the richness of meteorological observations here will have been assimilated into the global product. Further, the effect of model calibration to streamflow observations may also limit the applicability of these results globally, since the United States has a dense network of river gauges with which to tune model parameters. That said, both CaMa-Flood and PCR-GLOBWB are uncalibrated and generally outperformed the calibrated GloFAS in this analysis of spatial dependence characterization. This is, perhaps, a promising conclusion for the prospect of obtaining comparable skill in the majority of world regions where rigorous calibration is not possible: Event footprints may be largely determined by the climate and topology of the river network rather than model parameterization. It is also worth noting that these U.S.-only results may, in some cases, be an understatement of global hydrological model skill in the context of spatial correlation. U.S. rivers are among the most intensively engineered and managed in the world. The presence and operation of dams and other flow control structures presents a formidable modeling challenge, especially for a global hydrological model. Rivers in other parts of the globe (certainly in the developing world), in general, have flow regimes much more akin to their natural behavior and thus may be simulated more accurately by global models that lack these local anthropogenic details. Further, the United States consists of diverse climatological regions with complex hydrological responses to meteorological events, including the arid and semiarid western United States; topographically complex Rocky, Appalachian, Sierra Nevada, and Cascade mountain ranges; snow-dominated regions in the Rockies and the northern United States; atmospheric river-induced rainfall in the west; and hurricane-induced rainfall in the southeast and east. These characteristics mean many parts of the United States are highlighted as areas of significant bias in evaluations of global climate forcing data (Beck et al., 2017), perhaps indicating some other regions of the globe may be modeled with greater fidelity. In summary, it is likely that different results may be obtained in different regions of the world, and we cannot yet quantify how representative these U.S. results may be of global performance.
To this end, a replication of this analysis in similarly well-instrumented Europe will be a crucial addition to the results presented here. Equally, examinations outside of Europe and North America where suitable river gauge data are available will constrain the accuracy of such an approach amidst true discharge data paucity. Further work must also examine a computationally tractable discretization of sampling points from hydrologically modeled flow time series. In this analysis-for the purposes of intercomparison-time series were sampled from USGS gauge locations. For operational use in a global catastrophe model, sampling for integration into a H&T-like model will need to be at locations which minimize the heterogeneity of return period upstream while maintaining computational feasibility for the calculation of pairwise dependence. The results from this analysis indicate, for the first time, the suitability of global hydrological models in the characterization of extreme flow dependence. This paves the way for the development of global stochastic flood risk models, which properly simulate and capture the physical structure of real events, to more accurately constrain the risk of global populations and assets to extreme flooding.