Volume 57, Issue 9 e2020WR029001
Commentary
Open Access

The Abuse of Popular Performance Metrics in Hydrologic Modeling

Martyn P. Clark

Corresponding Author

Martyn P. Clark

Centre for Hydrology, University of Saskatchewan, Canmore, AB, Canada

Correspondence to:

M. P. Clark,

[email protected]

Search for more papers by this author
Richard M. Vogel

Richard M. Vogel

Tufts University, Medford, MA, USA

Search for more papers by this author
Jonathan R. Lamontagne

Jonathan R. Lamontagne

Tufts University, Medford, MA, USA

Search for more papers by this author
Naoki Mizukami

Naoki Mizukami

National Center for Atmospheric Research, Boulder, CO, USA

Search for more papers by this author
Wouter J. M. Knoben

Wouter J. M. Knoben

Centre for Hydrology, University of Saskatchewan, Canmore, AB, Canada

Search for more papers by this author
Guoqiang Tang

Guoqiang Tang

Centre for Hydrology, University of Saskatchewan, Canmore, AB, Canada

Search for more papers by this author
Shervan Gharari

Shervan Gharari

Centre for Hydrology, University of Saskatchewan, Saskatoon, SK, Canada

Search for more papers by this author
Jim E. Freer

Jim E. Freer

Centre for Hydrology, University of Saskatchewan, Canmore, AB, Canada

Search for more papers by this author
Paul H. Whitfield

Paul H. Whitfield

Centre for Hydrology, University of Saskatchewan, Canmore, AB, Canada

Search for more papers by this author
Kevin R. Shook

Kevin R. Shook

Centre for Hydrology, University of Saskatchewan, Saskatoon, SK, Canada

Search for more papers by this author
Simon Michael Papalexiou

Simon Michael Papalexiou

Centre for Hydrology, University of Saskatchewan, Saskatoon, SK, Canada

Search for more papers by this author
First published: 02 August 2021
Citations: 47

Abstract

The goal of this commentary is to critically evaluate the use of popular performance metrics in hydrologic modeling. We focus on the Nash-Sutcliffe Efficiency (NSE) and the Kling-Gupta Efficiency (KGE) metrics, which are both widely used in hydrologic research and practice around the world. Our specific objectives are: (a) to provide tools that quantify the sampling uncertainty in popular performance metrics; (b) to quantify sampling uncertainty in popular performance metrics across a large sample of catchments; and (c) to prescribe the further research that is, needed to improve the estimation, interpretation, and use of popular performance metrics in hydrologic modeling. Our large-sample analysis demonstrates that there is substantial sampling uncertainty in the NSE and KGE estimators. This occurs because the probability distribution of squared errors between model simulations and observations has heavy tails, meaning that performance metrics can be heavily influenced by just a few data points. Our results highlight obvious (yet ignored) abuses of performance metrics that contaminate the conclusions of many hydrologic modeling studies: It is essential to quantify the sampling uncertainty in performance metrics when justifying the use of a model for a specific purpose and when comparing the performance of competing models.

Key Points

  • We provide tools to quantify the sampling uncertainty in the Nash-Sutcliffe Efficiency (NSE) and Kling Gupta Efficiency (KGE) metrics

  • Our large-sample analysis demonstrates that there is substantial sampling uncertainty in the estimates of NSE and KGE

  • We prescribe further research to improve the estimation interpretation, and use of system-scale performance metrics in hydrologic modeling

1 Introduction

A performance metric summarizes the accuracy of a model. In hydrologic modeling, system-scale performance metrics are typically based on the differences between simulated and observed streamflow at the catchment outlet. The most popular system-scale performance metrics in hydrologic modeling are the Nash Sutcliffe Efficiency (urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0001; Nash & Sutcliffe, 1970) and the Kling Gupta Efficiency (urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0002; Gupta et al., 2009). System-scale performance metrics are widely used as an objective function in model calibration, to justify the use of a model for a specific purpose, and to compare competing models.

The use of performance metrics is constrained by their substantial sampling uncertainty (Lamontagne et al., 2020; Newman, Clark, Sampson, et al., 2015). Such sampling uncertainty can make it difficult to justify the use of a model for specific applications or to compare competing models. For example, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0003 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0004 have historically been used to define a “good” model, for example, defined as models with urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0005 (or urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0006) scores above an arbitrarily defined threshold (e.g., see Beven & Binley, 1992; Moriasi et al., 2015). It is uncommon to consider the sampling uncertainty in system-scale metrics when classifying a model as “good” and justifying its use for a specific application. Similarly, it is uncommon to consider the sampling uncertainty in performance metrics when comparing alternative models or during optimization. Given these limitations, it is possible that the selection of models using these metrics cannot be supported, and their conclusions may be suspect.

The purpose of this commentary is to critically evaluate performance metrics that are habitually used in hydrologic modeling. Our specific objectives are three-fold: (a) provide tools to quantify the sampling uncertainty in performance metrics; (b) quantify the sampling uncertainty in the popular performance metrics across a large sample of catchments; (c) prescribe further research that is needed to improve the estimation, interpretation, and use of performance metrics in hydrologic modeling. Our overall intent is to highlight the obvious (yet ignored) abuses of system-scale performance metrics that contaminate the conclusions from many hydrologic modeling studies.

The remainder of this paper is organized as follows. Section 2 reviews the development of model performance metrics commonly used in hydrologic modeling. Section 6 introduces the database of existing hydrologic model simulations used in this study. Sections 10 and 11 present the results and discussion. Section 17 summarizes the main conclusions of this study.

2 Review of System-Scale Performance Metrics

We examine both the theoretical properties of the Mean Squared Error (MSE), the NSE, and the KGE, as well as their estimation from actual data. We use standard statistical notation where hats denote the sample estimators of theoretical statistics, that is, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0007, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0008, and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0009 define the sample estimators of the theoretical MSE, NSE and KGE statistics. This distinction is necessary to separate the theoretical properties of performance metrics, which do not depend on data, from their sample estimators, which depend on the characteristics of the data in a given modeling application, such as skewness, coefficient of variation, periodicity, persistence, and outliers (Lamontagne et al., 2020).

The MSE, NSE and KGE statistics can be summarized as follows. The MSE is the single most widely used performance metric in the fields of signal processing (Wang & Bovik, 2009) and statistics in general (see Everitt, 2002). The NSE is simply a normalized variant of the MSE (see Equation 6 below). The development of KGE was motivated by algebraic decompositions of the MSE into bias, variance, and correlation components. KGE is only loosely related to NSE and thus MSE, with a complex relationship between NSE and KGE that depends on several factors. For general cases, the relationship between NSE and KGE depends on the coefficient of variation (CV) of the observations (see Equation A1 or sample-based examples for various values of CV in Figure A1 in Knoben et al., 2019, or Equation 10 in Lamontagne et al., 2020). In the special case of unbiased models, the relationship between NSE and KGE still remains complex (e.g., see Figure 1 and Equation 12 of Lamontagne et al., 2020). Lamontagne et al. (2020, Section 3) document the unusual conditions under which NSE and KGE are equivalent.

2.1 Mean Squared Error (MSE)

The MSE is a metric that evaluates the goodness of fit between model simulations and observations (Fisher, 1920). The MSE is defined as
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0010(1)
where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0011 is the expectation operator, and the random variables urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0012 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0013 define the time series of the model simulations and observations. Once data are introduced, the MSE metric can be estimated from a sample of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0014 pairs of model simulations and observations:
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0015(2)
where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0016 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0017 define the model simulations and observations for time step t. Note that the lower-case values in Equation 2, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0018 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0019 denote sample realizations from the theoretical random variables urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0020 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0021.
The expectation in Equation 1 can be expanded to (e.g., see Lamontagne et al., 2020)
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0022(3)
where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0023 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0024 denote means of the random variables urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0025 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0026, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0027 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0028 denote the variance of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0029 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0030, and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0031 defines the Pearson correlation between urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0032 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0033. The expansion in Equation 3 was previously derived by Murphy (1988) using sample estimators of the various terms, rather than their population values.
Equation 3, as defined in Murphy (1988), is algebraically identical to Equation 5 in Gupta et al. (2009). Expanding the squared difference in standard deviation as urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0034, then
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0035(4)
and substituting Equation 4 in 3, the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0036 metric can be written as
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0037(5)

Equation 5 provides an algebraic decomposition the MSE that includes the bias in the mean (the first term), the standard deviation (the second term) and the covariance (the third term). Note from Equation 5 that the algebraic decomposition of the MSE is not particularly effective because the second and third terms are not independent of one another (see also Gupta et al., 2009; Mizukami et al., 2019).

2.2 The Nash-Sutcliffe Efficiency (NSE)

The urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0038 is an estimator of a standardized skill score that measures the fractional improvement over a benchmark. The theoretical version of NSE is
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0039(6)
The algebraic decomposition of the NSE can be derived by making use of the decomposition in Equation 3. Substituting Equation 3 into 6 provides a decomposition of the NSE
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0040(7)
Equation 7 is the estimator version in Murphy (1988), his Equation 11, which is identical to the “new” decomposition of NSE presented by Gupta et al. (2009) in their Equation 4, that is,
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0041(8)
where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0042 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0043. As in Equation 5, the algebraic decomposition of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0044 is limited because the variance and correlation terms cannot be separated cleanly.

2.3 The Kling-Gupta Efficiency (KGE)

The KGE metric differs from the NSE metric in that it is not derived from the MSE; KGE is simply the Euclidean distance computed using the coordinates of bias, standard deviation, and correlation (Gupta et al., 2009). The theoretical version of the KGE metric is
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0045(9)
where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0046. Note that the definition of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0047 in Equation 9 is different from the definition of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0048 in Equation 8. The bias terms are related as urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0049 (Knoben et al., 2019), where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0050 is the coefficient of variation in the observations.

3 Data and Methods

3.1 Large-Sample Model Simulations for the CAMELS Catchments

In this study we analyze hydrologic model simulations from a large sample of catchments across the contiguous USA (Figure 1). Our analysis uses existing hydrologic model simulations from the Variable Infiltration Capacity model (VIC version 4.1.2h) applied to the 671 catchments in the CAMELS data set (Catchment Attributes and MEteorology for Large-sample Studies). Mizukami et al. (2019) provide details on the large-sample VIC configuration; Newman, Clark, Sampson, et al. (2015) and Addor et al. (2017) provide details on the hydrometeorological and physiographical characteristics of the CAMELS catchments. The CAMELS catchments are those with minimal human disturbance (i.e., minimal land use changes or disturbances, minimal water withdrawals), and are hence almost exclusively smaller, headwater-type catchments (median basin size of 336 km2).

Details are in the caption following the image

Location and mean elevation of the catchments in the CAMELS data set.

The calibration and evaluation procedure used by Mizukami et al. (2019) is as follows. The VIC model is forced using the daily basin-average meteorological data described by Maurer et al. (2002) and calibrated and evaluated using streamflow data obtained from the USGS National Water Information System server (http://waterdata.usgs.gov/usa/nwis/sw). The VIC model is calibrated using the dynamically dimensioned search (DDS, Tolson & Shoemaker, 2007) algorithm. In each of the 671 CAMELS catchments, the VIC model is calibrated separately for urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0051 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0052 (Mizukami et al., 2019). The hydrometeorological data are split into a calibration period (October 1, 1999–September 30, 2008) and an evaluation period (October 1, 1989–September 30, 1999), with a prior 10-years warm-up period. To maximize the sample size in our analysis, we analyze urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0053 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0054 computed over the combined 19-years calibration and evaluation period (October 1, 1989–September 30, 2008).

3.2 Analysis of the Influence of Individual Data Points

The uncertainties in system-scale performance metrics may be large because the estimates are shaped by a small fraction of the simulation-observation pairs (Clark et al., 2008; Fowler et al., 2018; Lamontagne et al., 2020; McCuen et al., 2006; Newman, Clark, Sampson, et al., 2015; Wright et al., 2019); that is, a small number of simulation-observation pairs have a disproportionate influence on performance metrics. In particular, there is enormous sampling variability associated with streamflow statistics in arid regions (see also Ye et al., 2021). The influence of individual data points can be quantified by successively deleting observations and evaluating their impact on a statistic of interest (e.g., see Efron, 1992; Hampel et al., 1986)—such methods are commonly used in applications of the Jackknife method.

It is straightforward and intuitive to calculate the influence of individual data points on the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0055 estimates. Let urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0056 be the squared difference between simulations urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0057 and observations urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0058 for a given time step urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0059, and let urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0060 be the ranked values of squared errors for all time steps, where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0061 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0062 are respectively the smallest and largest errors. The influence of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0063 largest errors on the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0064 estimates, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0065, is simply
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0066(10)
where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0067. Equation 10 is used in two ways: first, we set urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0068 to quantify the influence of the 10 days with the largest errors on the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0069 estimates; second, we identify the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0070 largest observations that contribute to 50% of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0071 estimates.

3.3 Quantifying Uncertainties in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0072 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0073 Estimates

It is particularly important to quantify the sampling uncertainty in model performance metrics when the error distributions exhibit heavy tails, as is the case with the errors obtained from daily streamflow simulations. Parallels to this problem are in the meteorological community, where it is common to quantify the uncertainty in the performance or skill metrics used to describe probabilistic forecasts of rare events (e.g., Bradley et al., 2008; Jolliffe, 2007).

Some attractive approaches to quantify sampling uncertainty are based on the bootstrap (e.g., Vogel & Shallcross, 1996), because they are relatively easy to implement and understand, and because they replace complex theoretical statistical methods with simple brute-force computations (see the Appendix A). Clark and Slater (2006) used bootstrap methods to quantify uncertainties in the performance metrics that they used to evaluate probabilistic estimates of precipitation extremes. Bootstrap methods have also been used to quantify the uncertainty in urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0074 estimates (Ritter & Muñoz-Carpena, 2013). Bootstrap methods are likely to find increasing use in hydrology due to the ease with which they can be applied compared to more complex methods. Given their simplicity it is indeed surprising how few examples of the bootstrap there have been in hydrology.

The sampling uncertainty in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0075 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0076 estimates is quantified using a mixture of Jackknife and Bootstrap methods. First, we use the Jackknife and Bootstrap methods to compute the standard error in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0077 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0078 estimates. These methods resample from the original data sample using the Non-overlapping Block Bootstrap (NBB) strategy of Carlstein (1986), using data blocks of length one year. The use of data blocks of length one year reduces the issues with substantial seasonal non-stationarity in shorter data blocks, while preserving the within-year autocorrelation and seasonal periodicity of streamflow series. Bootstrapping methods are only effective if the blocks used are approximately independent. Second, we use the Bootstrap methods to compute tolerance intervals for the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0079 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0080 estimates, where the 90% tolerance intervals are defined as the difference between the 95th and 5th percentile of the empirical probability distribution of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0081 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0082 estimates. Tolerance intervals differ from confidence intervals, because tolerance intervals are intervals corresponding to a random variable, rather than random confidence intervals around some true value. These bootstrap tolerance intervals are computed using 1,000 bootstrap samples. Finally, we use the Jackknife-After-Bootstrap method (Efron, 1992) to estimate the standard error in the Bootstrap tolerance intervals, which enables us to evaluate how sensitive the resulting uncertainty intervals are to individual years (blocks). The implementation details of the uncertainty quantification methods discussed above are summarized in the Appendix A; the open-source “gumboot” package has been developed to quantify the sampling uncertainty in performance metrics (https://github.com/CH-Earth/gumboot; https://cran.r-project.org/package=gumboot).

It is important to note that the methods implemented here quantify the sampling uncertainty in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0083 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0084 estimates for a given hydrologic model and a given sample of streamflow observations. The model itself will contain uncertainty (e.g., uncertainty in the meteorological inputs; uncertainty in the model parameters and model structure). The observations used to compute the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0085 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0086 estimates also contain uncertainty, especially for the high flow extremes that can have a large influence on the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0087 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0088 estimates. The model and data uncertainty are not explicitly included in the estimates of sampling uncertainty (we will return to this point in Section 13).

4 Results

The probability distribution of squared errors between model simulations and observations have heavy tails, meaning that the estimates of sum-of-squared error statistics can be heavily influenced by a small fraction of the simulation-observation pairs (Clark et al., 2008; Fowler et al., 2018; Lamontagne et al., 2020; Newman, Clark, Sampson, et al., 2015). To document this issue, Figure 2 uses Equation 10 to quantify the influence of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0089 largest errors on the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0090 estimates, repeating the analysis of Newman, Clark, Sampson, et al. (2015) with the VIC model. Figure 2a quantifies the influence of the 10 individual days with the largest errors on the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0091 estimates–Figure 2a demonstrates that, in many catchments, 10 days in the 19-year period contribute to over 50% of the sum-of-squared errors between simulated and observed streamflow. Figure 2b identifies the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0092 largest observations that jointly contribute 50% of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0093 estimate, expressed as a percentage of the total sample length urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0094. Figure 2b demonstrates that, in many catchments, 50% of the sum-of-squared errors is caused by less than 0.5% of the simulation-observation pairs. These results suggest that there will be large uncertainty in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0095 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0096 metrics.

Details are in the caption following the image

Contribution of subset of days to the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0097 estimate. The upper plot shows the fraction of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0098 estimate contributed by the 10 days with the highest error. The lower plot shows the percentage of days that contribute to 50% of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0099 estimate.

Figure 3 quantifies the uncertainty in urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0100 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0101 across the CAMELS catchments, illustrating considerable uncertainty in both the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0102 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0103 values. Figure 3 illustrates that the 90% tolerance intervals for both NSE and KGE (as obtained by the bootstrap methods described in the Appendix A) are greater than 0.1 for more than half of the CAMELS catchments. The results in Figure 3 illustrate that both the bootstrap and jackknife methods yield consistent standard error estimates. The large uncertainty in urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0104 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0105 are evident when both urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0106 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0107 are used as a calibration target.

Details are in the caption following the image

Estimates of uncertainty in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0108 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0109 estimates across the CAMELS catchments. The uncertainty is quantified using standard error estimates (×2) obtained using Jackknife and Bootstrap estimates (see the Appendix A for implementation details), along with tolerance intervals computed as the difference between the 95th and 5th percentiles of the Bootstrap samples. Results are shown for calibrations obtained by maximizing the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0110 metric (upper plots) and by maximizing the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0111 metric (lower plots).

The jackknife-after-bootstrap methods enable an evaluation of the degree of precision and accuracy associated with the bootstrap tolerance intervals. While there is considerable sampling uncertainty in the tolerance intervals (estimated using the jackknife-after-bootstrap methods; Figure 4), that uncertainty is considerably smaller than the uncertainty associated with urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0112 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0113 as is shown in Figure 3. As we discuss in the next section, the sampling uncertainty depicted in Figure 3 may be under-estimated in situations where there is extremely high skewness in daily streamflows.

Details are in the caption following the image

Standard error in the Bootstrap tolerance intervals shown in Figure 3. The standard error in the Bootstrap tolerance intervals is estimated using the jackknife-after-bootstrap method of Efron (1992), as summarized in the Appendix A. Results are shown for calibrations obtained by maximizing the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0114 metric (upper plots) and by maximizing the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0115 metric (lower plots).

5 Discussion

5.1 It Is Necessary to Quantify the Uncertainty in Performance Metrics

The high uncertainty associated with the estimators urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0116 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0117 underscores the need to quantify the uncertainty in the performance metric estimators used in hydrologic modeling applications. Quantifying the sampling uncertainty in model evaluation statistics is easily accomplished using appropriate bootstrap methods. Moreover, bootstrap methods can be applied to any performance metric estimator. Quantifying the uncertainty in the performance metric estimators should arguably become a routine part of the hydrologic modeling enterprise. As our results show, the width of the 90% tolerance intervals associated with the estimators urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0118 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0119 are greater than 0.1 in at least half of analyzed catchments. Such wide 90% tolerance intervals indicate considerable uncertainty associated with each of these metrics. These results imply that the conclusions from many hydrologic modeling studies may not be justified in light of the high sampling uncertainty in system-scale performance metric estimators.

In spite of the ease with which the bootstrap may be applied as a post-processing approach to developing uncertainty intervals, there is a need for additional research on methods to quantify the sampling uncertainty. Our experiments (not shown) demonstrate that traditional bootstrap methods may severely under-estimate the sampling uncertainty in the estimators urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0120 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0121 in situations where there is extremely high skewness (see also Chernick & LaBudde, 2011). These under-estimates in uncertainty occur because bootstrap methods “recycle” the observations, and the bootstrap samples do not adequately encapsulate the uncertainty associated with the few extraordinary errors in the thick upper tail of the error distribution. Indeed, our Jackknife-after-Bootstrap analyses demonstrate that there are large standard errors in our bootstrap estimates of uncertainty in urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0122 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0123. Thus, given the extremely high skewness of daily streamflow observations in some watersheds, we recommend future research which compares the uncertainty intervals derived from various bootstrap methods against the uncertainty intervals derived from more advanced stochastic methods (e.g., Papalexiou, 2018).

5.2 It Is Necessary to Improve the Estimates of System-Scale Performance Statistics

A variety of approaches can be introduced to improve estimates of the theoretical NSE and KGE statistics; that is, to develop more robust estimates urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0124 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0125 that have lower sampling uncertainty. For example, Fowler et al. (2018) calculated the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0126 metric separately for each year before averaging across years; Lamontagne et al. (2020) introduced alternative estimators of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0127 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0128 based on a bivariate lognormal monthly mixture model. Variance reduction methods introduced to the field of machine learning and statistics (e.g., Nelson & Schmeiser, 1986) can be used to improve estimates of the theoretical NSE and KGE performance metrics. More generally, the approaches of bagging and bragging could be tested, where the performance metrics are estimated using the median or the mean of multiple bootstrap samples (Berrendero, 2007). Further work is needed to better understand the characteristics of data points that have high leverage in order to devise methods that improve estimates of the theoretical NSE and KGE statistics.

5.3 It Is Necessary to Put Performance Metrics in Context

The growing field of model benchmarking seeks to put performance metrics into context, for example, by asking the question if models meet our a-priori expectations, or if models adequately use the information that is available to them. The recent efforts in model benchmarking have focused on defining lower and upper benchmarks to provide context for model performance (Nearing et al., 2018; Newman et al., 2017; Seibert et al., 2018). Lower benchmarks evaluate the extent to which models surpass expectations (Seibert, 2001), for example, the extent to which model simulations perform better than a benchmark such as climatology, persistence, simulations from another model (Wilks, 2011), or departures from the seasonal cycle (Knoben et al., 2020; Schaefli & Gupta, 2007). A key component of defining the lower benchmark is defining our a-priori expectations of model capabilities. We define the upper benchmark to quantify the predictability of the system, that is, the maximum information content in the forcing-response data (Nearing et al., 2018; Newman et al., 2017). For example, Best et al. (2015) recently demonstrated that many mechanistic land models were out-performed by simple statistical models, implying that modern land models were not adequately using the information that is available to them. Much work still needs to be done to quantify our expectations for model performance (the lower benchmark) as well as to quantify system-scale predictability (the upper benchmark).

Benchmarking is important in the context of performance metrics because the NSE and KGE have rather weak a-priori expectations of model performance. The NSE uses the variance of the observations as the benchmark. This means that urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0129 if the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0130 is smaller than the variance of the observations. In other words, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0131 if the model simulations are better than the reference case where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0132 for all time steps. Knoben et al. (2019) points out that the KGE estimates do not have the same benchmark as NSE estimates: the implied benchmark associated with estimates of NSE, that is, that model simulations are always equal to the observed mean (i.e., urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0133) occurs when the estimate of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0134 (i.e., when the estimate of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0135). The observed mean is often used as a benchmark with the KGE metric as well, imposing old expectations on a new metric. Using stricter, purpose-specific benchmarks can give a clearer idea of model strengths and weaknesses.

It is also necessary to evaluate the system-scale performance metrics in the context of the uncertainties in the model inputs (e.g., spatial meteorological forcing data), the uncertainties in the hydrologic model (e.g., uncertainties in model parameters and model structure), and the uncertainties in the system-scale response (e.g., streamflow observations). Many groups are now developing ensemble spatial meteorological forcing fields in order to understand how uncertainties in the model forcing data affect uncertainties in the hydrologic model simulations (e.g., Clark & Slater, 2006; Cornes et al., 2018; Frei & Isotta, 2019; Newman, Clark, Craig, et al., 2015; Tang, Clark, Papalexiou, et al., 2021). There are also a wealth of approaches to quantify hydrologic model uncertainty. Vogel (2017) introduced the concept of stochastic watershed models (SWM), which involve methods for generating likely stochastic traces of daily streamflow from deterministic watershed models. All such methods of developing SWMs reviewed by Vogel (2017), including the very generalized blueprint introduced by Montanari and Koutsoyiannis (2012), may be employed to develop uncertainty intervals associated with either streamflow predictions or other water resource system variables. There is now also substantial effort dedicated to quantifying uncertainty in streamflow observations (e.g., see the comparison of uncertainty techniques by Kiang et al., 2018 and also Coxon et al., 2015 and Mansanarez et al., 2019). The key issue is that the most uncertain observations of streamflow are in the upper tail; these observations also have the most influence on the KGE and NSE metrics. Further research is needed to understand how these sources of uncertainty are manifest in system-scale performance metrics.

5.4 It Is Necessary to Understand the Limitations of System-Scale Performance Metrics

It is well known that minimizing the sum-of-squared errors in calibration results in simulated streamflows with smaller variance than the observations (e.g., Gupta et al., 2009). This occurs because of the interplay between estimates of the variance of the flows and correlation in NSE described in Section 3–specifically, the quantity urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0136 appears in both the second and third terms in Equation 8, meaning that NSE is maximized when urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0137. This is problematic because optimization studies that minimize the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0138 (or maximize the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0139) result in urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0140 because urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0141 is always smaller than unity. Mizukami et al. (2019) illustrate these issues when using urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0142 as an objective function in large-sample hydrologic model calibration study. They showed that the calibrated simulations had substantial under-estimates of high flow events, such as the annual peak flows that are used for flood frequency estimation. Underestimation of variance, as well as all other upper moments, is a general problem associated with simulation models and is not limited to use of a particular objective function (see Farmer & Vogel, 2016).

There are also problems with the KGE metric. As discussed by Santos et al. (2018), the definition of the bias term in KGE, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0143, can lead to very large values of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0144 (and hence low KGE scores) when urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0145 is small. Such problems with amplified urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0146 values are potentially more pronounced for variables where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0147 crosses zero (e.g., log-transformed flows, temperature) because urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0148 could be very small. Citing drawbacks of the NSE as justification, part of the community has switched to using urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0149 over urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0150. We argue that this did not solve but only changed the problems related to system-scale performance metrics. It is important to be aware of the theoretical behavior of system-scale performance metrics, along with their limits of applicability, and use additional metrics that are tailored to suit specific applications.

5.5 It Is Necessary to Use Additional Performance Metrics

A key problem with system-scale performance metrics is that they do not make adequate use of the full information content in the data. Gupta et al. (2008) point out that global calibration of hydrologic models (e.g., using urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0151 or urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0152 as the objective function) entails compressing the information in the model output and observations into a single performance metric, and then using that single metric to infer values of multiple model parameters and all aspects of hydrological processes. Such global calibration methods can lead to problems of compensatory parameters, providing the “right” results for the wrong reasons (Kirchner, 2006). Specifically, parameters in one part of the model may be assigned unrealistic values that compensate for unrealistic parameter values in another part of the model, or that compensate for errors in the model forcing data and weaknesses in model structure (Clark & Vrugt, 2006). Addressing this problem requires asking a different question: Instead of asking “how good is my model?”, it may be more appropriate to ask “What is my model good for?” This second question is more relevant when designing a modeling experiment for a specific application.

One approach is to develop alternative system-scale performance metrics. This includes the efforts to develop variants of KGE–for example, Kling et al. (2012) introduced a modified version of KGE, termed urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0153, by using the ratio of the simulated and observed coefficient of variation (urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0154) instead of the ratio of the simulated and observed standard deviation. Their intent is to reduce the impact of bias on the variability term in KGE. Pool et al. (2018) developed alternative estimates of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0155 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0156 for use with urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0157, with the intent of reducing the impact of outliers. Note that the Pool et al. (2018) estimates of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0158 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0159 are for different theoretical statistics than the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0160 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0161 statistics that are used in Equations 8 and 9 (see Barber et al., 2019; Lamontagne et al., 2020). Other alternative system-scale metrics include variable transformations, such as the log-transform or Box-Cox transform (to reduce skewness, and focus more on low flows), or methods to compare distributions of modeled extremes to observed extremes. The work to develop alternative system-scale performance metrics recognizes that estimates of correlation-based metrics are often inflated, in the sense that high values can occur for mediocre and poor models, and that estimators of correlation-based metrics are sensitive to outliers and data asymmetry (Legates & McCabe, 1999; Willmott, 1981; see also Mo et al., 2014; Barber et al., 2019).

In this context, it is worth pointing out that it is straightforward to redefine the KGE metric to address the problems with the amplified urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0162 values described above. For example, the bias component of the mean in the KGE metric could be represented as urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0163, as it is in the NSE metric. It is hence straightforward to modify the KGE metric such that
urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0164(11)
where, as in Equation 7, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0165. The urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0166 metric has been used by Tang et al. (2021a2021b). These modifications to the KGE metric avoid the amplified urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0167 values when urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0168 is small. Note that since urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0169 is constrained to be positive, the zero-bounded structure of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0170 means that normalizing by urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0171 will not have the same problems as normalizing by urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0172 in the original KGE or urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0173 metrics.

Another approach is to use additional non-global metrics (e.g., multiple diagnostic signatures of hydrologic behavior). For example, much of the research on model calibration and evaluation now focuses on multi-criteria methods, including analysis of trade-offs among multiple objective functions (e.g., Fenicia et al., 2007; Yapo et al., 1998), analysis of the temporal variability of model errors (Coxon et al., 2014; Reusser et al., 2009), and scrutinizing diagnostic signatures of hydrologic behavior in order to identify model weaknesses (Gupta et al., 2008; Rakovec et al., 2016). A key part of this analysis is to understand the sensitivity of different non-global metrics to individual parts of a model (e.g., Markstrom et al., 2016; Van Werkhoven et al., 2009). As such, these alternative metrics can focus attention on aspects of the model that may be more relevant for specific modeling applications.

6 Conclusions

The goal of this commentary is to critically evaluate the performance metrics that are habitually used in hydrologic modeling. Our focus is on the Nash-Sutcliffe Efficiency (NSE) and the Kling-Gupta Efficiency (KGE) metrics, which are both widely used in science and applications communities around the world. Our contributions in this paper are three-fold:
  1. We provide tools to enable hydrologic modelers to quantify the sampling uncertainty in system-scale performance metrics. We use the non-overlapping block bootstrap method to obtain probability distributions and associated tolerance intervals of estimates of NSE and KGE, and we use the jackknife-after-bootstrap method to obtain estimates of standard error of those bootstrap tolerance intervals. These comparisons enable us to ensure that even though the tolerance intervals display sampling variability, that variability is always considerably smaller than the tolerance intervals themselves, thus providing a nice validation of the precision of the tolerance intervals.

  2. We quantify the sampling uncertainty in system-scale performance metrics across a large sample of catchments. Our results show that the probability distribution of squared errors between model simulations and observations have heavy tails, meaning that the estimates of sum-of-squared error statistics can be shaped by just a few simulation-observation pairs (Figure 2). This leads to substantial uncertainty in the estimators urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0174 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0175 (Figures 3 and 4). The implication of these results is that the conclusions from many hydrologic modeling studies are based on values for these metrics that fall well within the metrics' uncertainty bounds. Such conclusions may thus not be justified.

  3. We define further research that is, needed to improve the estimation, interpretation, and use of system-scale performance metrics in hydrological modeling

More generally, our commentary highlights the obvious (yet ignored) abuses of performance metrics that contaminate the conclusions of many hydrologic modeling studies. We look forward to additional studies that improve the scientific basis of model evaluation.

Acknowledgments

We appreciate the constructive comments from the four reviewers. Martyn Clark, Wouter Knoben, Guoqiang Tang, Shervan Gharari, Jim Freer, Paul Whitfield, Kevin Shook, and Simon Papalexiou were supported by the Global Water Futures program, University of Saskatchewan.

    Appendix A: The Jackknife and Bootstrap Methods

    In this study, we use two resampling methods, the Jackknife and the Bootstrap, to estimate the empirical probability distribution of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0176 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0177 estimators for each of the 671 CAMELS catchments. These methods estimate the empirical probability distribution of a given statistic by drawing or resampling a number of independent samples from the original sample of data.

    The following sub-sections describe the implementation of the Jackknife and Bootstrap methods, including the resampling strategies, the Jackknife and Bootstrap estimates of standard error, and the Jackknife estimates of the standard error in the bootstrap-derived empirical probability distributions of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0178 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0179.

    A1 The Jackknife and Bootstrap Resampling Strategies

    The Jackknife method is a structured approach of resampling without replacement where observations are successively deleted from the original sample of data. A Jackknife sample is the data set that remains after deleting the ith observation, or deleting the ith block of observations, that is,
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0180(A1)

    The value of the ith Jackknife replicate is the value of the estimator urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0181. In our case, the ith Jackknife replicate is urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0182 or urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0183. The Jackknife method is useful in cases where it is desirable to conduct structured analysis of the deleted point statistics.

    The Bootstrap method is much more flexible than the Jackknife method. The Bootstrap method uses the approach of resampling with replacement. A Bootstrap sample is obtained by using a random number generator to make urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0184 independent draws from the original sample of data (Efron & Tibshirani, 1986), that is,
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0185(A2)
    and the process is repeated to generate urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0186 samples, that is, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0187. Then, for each sample urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0188 compute the statistic of interest, that is, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0189. The empirical probability distribution of the statistic of interest can then be calculated using all of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0190 samples.

    When implementing these resampling methods, it is necessary to ensure independence between each draw from the original sample of data (Carlstein, 1986; Künsch, 1989; Vogel & Shallcross, 1996). Specifically, the errors in daily streamflow simulations are characterized by substantial periodicity and persistence – this creates complex temporal dependence structures on time scales from days (e.g., errors in the simulations of recessions after a storm event) to seasons (e.g., errors in the simulations of seasonal snow accumulation and melt, or errors in the seasonal cycle of transpiration). To address these issues, we implement a non-overlapping block resampling strategy that was developed for the Bootstrap method, the Non-overlapping Block Bootstrap (NBB) of Carlstein (1986). This approach identifies urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0191 subseries of data of length urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0192, where each sub-series of data is statistically independent. In our implementation, the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0193 subseries are each of the 19 water years urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0194, where the water years span the period Oct 1st–Sep 30th (e.g., water year 1990 is the period Oct 1st1989–Sep 30th 1990).

    The non-overlapping block resampling strategy is used for both the Jackknife and Bootstrap methods. The Jackknife sample for a given water year is the data set that remains after deleting the ith water year. For example, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0195, where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0196 contains the daily time series of simulation-observation pairs for all years except water year 2002, and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0197 is the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0198 or urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0199 estimates using all daily data except in 2002. The Bootstrap method samples water years with replacement: A given bootstrap sample may include a given water year more than once, or a given sample may not include a given water year at all. The Bootstrap samples that do not have a given water years (e.g., all Bootstrap samples without water year 2002) open up opportunities to quantify the standard errors in the Bootstrap estimates of the empirical probability distributions (using the Jackknife-After-Bootstrap method introduced by Efron, 1992; we will discuss this implementation in Section A3).

    A2 Jackknife and Bootstrap Estimates of Standard Errors in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0200 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0201 Estimates

    The Jackknife estimates of standard error can be obtained by first considering the case where the standard error estimates are not needed (Efron & Gong, 1983). The average of the jackknife sample, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0202, is
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0203(A3)
    with the ith observation given as urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0204. The standard error of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0205 is then (Efron & Gong, 1983)
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0206(A4)
    with urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0207.
    Equation A4 can be extended to compute the standard error for any statistic of interest. If we let urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0208 be the deleted point value for a given statistic (Efron, 1992), then the jackknife estimate of the statistic of interest, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0209, can be defined as
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0210(A5)
    where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0211 is the estimate of the statistic using all observations and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0212. The standard error of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0213 is then
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0214(A6)
    The Bootstrap estimate of standard error is more straightforward: It is simply the standard deviation of the Bootstrap samples, that is,
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0215(A7)
    where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0216.

    A3 Jackknife Estimates of Standard Error in the Bootstrap-Derived Probability Distributions

    The Bootstrap estimates of the empirical probability distributions create a conundrum: whilst outliers can cause large uncertainty in the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0217 or urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0218 estimates, the outliers can also create large uncertainty in the Bootstrap estimates of the empirical probability distributions. It is hence necessary to estimate the standard error in the Bootstrap methods.

    Estimates of the standard error in the Bootstrap methods can be computed easily using the Jackknife-After-Bootstrap method of Efron (1992). In the previous discussion we noted that the non-overlapping block resampling strategy opens up opportunities to quantify the standard errors in the Bootstrap estimates of the empirical probability distributions. Specifically, for a given water year we can compute a Jackknife sample using all of the Bootstrap samples that do not include that water year. When such Jackknife samples are constructed for all water years, the Jackknife method can be used to estimate standard error in the Bootstrap estimates (the Jackknife-After-Bootstrap method).

    The Jackknife-After-Bootstrap method is implemented as follows (Efron, 1992). Our starting point is the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0219 estimates of the statistic of interest, that is, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0220, that were computed from the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0221 samples urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0222 obtained from the Bootstrapping. Recall that each of the urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0223 samples is constructed by making urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0224 draws from the original sample of data, that is, urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0225. Given this information, we can calculate the proportion of each Bootstrap sample that equals a given observation urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0226, that is, (Efron, 1992),
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0227(A8)
    and define the resampling vector for a given observation,
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0228(A9)
    It is then straightforward to identify the subset of bootstrap samples where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0229 (i.e., the subset of bootstrap samples that do not include the observation urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0230) and define samples of the statistic of interest where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0231,
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0232(A10)
    where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0233 and urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0234 is the statistic of interest for all bootstrap samples. It is then possible to compute statistics from the subset of Bootstrap samples, that is,
    urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0235(A11)
    where urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0236 may be a statistic such as the fifth or 95th percentile.

    The Jackknife estimate of standard error uses Equation A6 with urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0237 as the value of the ith Jackknife replicate in place of urn:x-wiley:00431397:media:wrcr25486:wrcr25486-math-0238.

    Data Availability Statement

    The data for the large-domain model simulations are publicly available at the National Center for Atmospheric Research at https://ral.ucar.edu/solutions/products/camels. The source code to quantify the sampling uncertainty in performance metrics (the “gumboot” package) is available at https://github.com/CH-Earth/gumboot.