The Abuse of Popular Performance Metrics in Hydrologic Modeling
Abstract
The goal of this commentary is to critically evaluate the use of popular performance metrics in hydrologic modeling. We focus on the NashSutcliffe Efficiency (NSE) and the KlingGupta Efficiency (KGE) metrics, which are both widely used in hydrologic research and practice around the world. Our specific objectives are: (a) to provide tools that quantify the sampling uncertainty in popular performance metrics; (b) to quantify sampling uncertainty in popular performance metrics across a large sample of catchments; and (c) to prescribe the further research that is, needed to improve the estimation, interpretation, and use of popular performance metrics in hydrologic modeling. Our largesample analysis demonstrates that there is substantial sampling uncertainty in the NSE and KGE estimators. This occurs because the probability distribution of squared errors between model simulations and observations has heavy tails, meaning that performance metrics can be heavily influenced by just a few data points. Our results highlight obvious (yet ignored) abuses of performance metrics that contaminate the conclusions of many hydrologic modeling studies: It is essential to quantify the sampling uncertainty in performance metrics when justifying the use of a model for a specific purpose and when comparing the performance of competing models.
Key Points

We provide tools to quantify the sampling uncertainty in the NashSutcliffe Efficiency (NSE) and Kling Gupta Efficiency (KGE) metrics

Our largesample analysis demonstrates that there is substantial sampling uncertainty in the estimates of NSE and KGE

We prescribe further research to improve the estimation interpretation, and use of systemscale performance metrics in hydrologic modeling
1 Introduction
A performance metric summarizes the accuracy of a model. In hydrologic modeling, systemscale performance metrics are typically based on the differences between simulated and observed streamflow at the catchment outlet. The most popular systemscale performance metrics in hydrologic modeling are the Nash Sutcliffe Efficiency (; Nash & Sutcliffe, 1970) and the Kling Gupta Efficiency (; Gupta et al., 2009). Systemscale performance metrics are widely used as an objective function in model calibration, to justify the use of a model for a specific purpose, and to compare competing models.
The use of performance metrics is constrained by their substantial sampling uncertainty (Lamontagne et al., 2020; Newman, Clark, Sampson, et al., 2015). Such sampling uncertainty can make it difficult to justify the use of a model for specific applications or to compare competing models. For example, and have historically been used to define a “good” model, for example, defined as models with (or ) scores above an arbitrarily defined threshold (e.g., see Beven & Binley, 1992; Moriasi et al., 2015). It is uncommon to consider the sampling uncertainty in systemscale metrics when classifying a model as “good” and justifying its use for a specific application. Similarly, it is uncommon to consider the sampling uncertainty in performance metrics when comparing alternative models or during optimization. Given these limitations, it is possible that the selection of models using these metrics cannot be supported, and their conclusions may be suspect.
The purpose of this commentary is to critically evaluate performance metrics that are habitually used in hydrologic modeling. Our specific objectives are threefold: (a) provide tools to quantify the sampling uncertainty in performance metrics; (b) quantify the sampling uncertainty in the popular performance metrics across a large sample of catchments; (c) prescribe further research that is needed to improve the estimation, interpretation, and use of performance metrics in hydrologic modeling. Our overall intent is to highlight the obvious (yet ignored) abuses of systemscale performance metrics that contaminate the conclusions from many hydrologic modeling studies.
The remainder of this paper is organized as follows. Section 2 reviews the development of model performance metrics commonly used in hydrologic modeling. Section 6 introduces the database of existing hydrologic model simulations used in this study. Sections 10 and 11 present the results and discussion. Section 17 summarizes the main conclusions of this study.
2 Review of SystemScale Performance Metrics
We examine both the theoretical properties of the Mean Squared Error (MSE), the NSE, and the KGE, as well as their estimation from actual data. We use standard statistical notation where hats denote the sample estimators of theoretical statistics, that is, , , and define the sample estimators of the theoretical MSE, NSE and KGE statistics. This distinction is necessary to separate the theoretical properties of performance metrics, which do not depend on data, from their sample estimators, which depend on the characteristics of the data in a given modeling application, such as skewness, coefficient of variation, periodicity, persistence, and outliers (Lamontagne et al., 2020).
The MSE, NSE and KGE statistics can be summarized as follows. The MSE is the single most widely used performance metric in the fields of signal processing (Wang & Bovik, 2009) and statistics in general (see Everitt, 2002). The NSE is simply a normalized variant of the MSE (see Equation 6 below). The development of KGE was motivated by algebraic decompositions of the MSE into bias, variance, and correlation components. KGE is only loosely related to NSE and thus MSE, with a complex relationship between NSE and KGE that depends on several factors. For general cases, the relationship between NSE and KGE depends on the coefficient of variation (CV) of the observations (see Equation A1 or samplebased examples for various values of CV in Figure A1 in Knoben et al., 2019, or Equation 10 in Lamontagne et al., 2020). In the special case of unbiased models, the relationship between NSE and KGE still remains complex (e.g., see Figure 1 and Equation 12 of Lamontagne et al., 2020). Lamontagne et al. (2020, Section 3) document the unusual conditions under which NSE and KGE are equivalent.
2.1 Mean Squared Error (MSE)
Equation 5 provides an algebraic decomposition the MSE that includes the bias in the mean (the first term), the standard deviation (the second term) and the covariance (the third term). Note from Equation 5 that the algebraic decomposition of the MSE is not particularly effective because the second and third terms are not independent of one another (see also Gupta et al., 2009; Mizukami et al., 2019).
2.2 The NashSutcliffe Efficiency (NSE)
2.3 The KlingGupta Efficiency (KGE)
3 Data and Methods
3.1 LargeSample Model Simulations for the CAMELS Catchments
In this study we analyze hydrologic model simulations from a large sample of catchments across the contiguous USA (Figure 1). Our analysis uses existing hydrologic model simulations from the Variable Infiltration Capacity model (VIC version 4.1.2h) applied to the 671 catchments in the CAMELS data set (Catchment Attributes and MEteorology for Largesample Studies). Mizukami et al. (2019) provide details on the largesample VIC configuration; Newman, Clark, Sampson, et al. (2015) and Addor et al. (2017) provide details on the hydrometeorological and physiographical characteristics of the CAMELS catchments. The CAMELS catchments are those with minimal human disturbance (i.e., minimal land use changes or disturbances, minimal water withdrawals), and are hence almost exclusively smaller, headwatertype catchments (median basin size of 336 km^{2}).
The calibration and evaluation procedure used by Mizukami et al. (2019) is as follows. The VIC model is forced using the daily basinaverage meteorological data described by Maurer et al. (2002) and calibrated and evaluated using streamflow data obtained from the USGS National Water Information System server (http://waterdata.usgs.gov/usa/nwis/sw). The VIC model is calibrated using the dynamically dimensioned search (DDS, Tolson & Shoemaker, 2007) algorithm. In each of the 671 CAMELS catchments, the VIC model is calibrated separately for and (Mizukami et al., 2019). The hydrometeorological data are split into a calibration period (October 1, 1999–September 30, 2008) and an evaluation period (October 1, 1989–September 30, 1999), with a prior 10years warmup period. To maximize the sample size in our analysis, we analyze and computed over the combined 19years calibration and evaluation period (October 1, 1989–September 30, 2008).
3.2 Analysis of the Influence of Individual Data Points
The uncertainties in systemscale performance metrics may be large because the estimates are shaped by a small fraction of the simulationobservation pairs (Clark et al., 2008; Fowler et al., 2018; Lamontagne et al., 2020; McCuen et al., 2006; Newman, Clark, Sampson, et al., 2015; Wright et al., 2019); that is, a small number of simulationobservation pairs have a disproportionate influence on performance metrics. In particular, there is enormous sampling variability associated with streamflow statistics in arid regions (see also Ye et al., 2021). The influence of individual data points can be quantified by successively deleting observations and evaluating their impact on a statistic of interest (e.g., see Efron, 1992; Hampel et al., 1986)—such methods are commonly used in applications of the Jackknife method.
3.3 Quantifying Uncertainties in the and Estimates
It is particularly important to quantify the sampling uncertainty in model performance metrics when the error distributions exhibit heavy tails, as is the case with the errors obtained from daily streamflow simulations. Parallels to this problem are in the meteorological community, where it is common to quantify the uncertainty in the performance or skill metrics used to describe probabilistic forecasts of rare events (e.g., Bradley et al., 2008; Jolliffe, 2007).
Some attractive approaches to quantify sampling uncertainty are based on the bootstrap (e.g., Vogel & Shallcross, 1996), because they are relatively easy to implement and understand, and because they replace complex theoretical statistical methods with simple bruteforce computations (see the Appendix A). Clark and Slater (2006) used bootstrap methods to quantify uncertainties in the performance metrics that they used to evaluate probabilistic estimates of precipitation extremes. Bootstrap methods have also been used to quantify the uncertainty in estimates (Ritter & MuñozCarpena, 2013). Bootstrap methods are likely to find increasing use in hydrology due to the ease with which they can be applied compared to more complex methods. Given their simplicity it is indeed surprising how few examples of the bootstrap there have been in hydrology.
The sampling uncertainty in the and estimates is quantified using a mixture of Jackknife and Bootstrap methods. First, we use the Jackknife and Bootstrap methods to compute the standard error in the and estimates. These methods resample from the original data sample using the Nonoverlapping Block Bootstrap (NBB) strategy of Carlstein (1986), using data blocks of length one year. The use of data blocks of length one year reduces the issues with substantial seasonal nonstationarity in shorter data blocks, while preserving the withinyear autocorrelation and seasonal periodicity of streamflow series. Bootstrapping methods are only effective if the blocks used are approximately independent. Second, we use the Bootstrap methods to compute tolerance intervals for the and estimates, where the 90% tolerance intervals are defined as the difference between the 95th and 5th percentile of the empirical probability distribution of the and estimates. Tolerance intervals differ from confidence intervals, because tolerance intervals are intervals corresponding to a random variable, rather than random confidence intervals around some true value. These bootstrap tolerance intervals are computed using 1,000 bootstrap samples. Finally, we use the JackknifeAfterBootstrap method (Efron, 1992) to estimate the standard error in the Bootstrap tolerance intervals, which enables us to evaluate how sensitive the resulting uncertainty intervals are to individual years (blocks). The implementation details of the uncertainty quantification methods discussed above are summarized in the Appendix A; the opensource “gumboot” package has been developed to quantify the sampling uncertainty in performance metrics (https://github.com/CHEarth/gumboot; https://cran.rproject.org/package=gumboot).
It is important to note that the methods implemented here quantify the sampling uncertainty in the and estimates for a given hydrologic model and a given sample of streamflow observations. The model itself will contain uncertainty (e.g., uncertainty in the meteorological inputs; uncertainty in the model parameters and model structure). The observations used to compute the and estimates also contain uncertainty, especially for the high flow extremes that can have a large influence on the and estimates. The model and data uncertainty are not explicitly included in the estimates of sampling uncertainty (we will return to this point in Section 13).
4 Results
The probability distribution of squared errors between model simulations and observations have heavy tails, meaning that the estimates of sumofsquared error statistics can be heavily influenced by a small fraction of the simulationobservation pairs (Clark et al., 2008; Fowler et al., 2018; Lamontagne et al., 2020; Newman, Clark, Sampson, et al., 2015). To document this issue, Figure 2 uses Equation 10 to quantify the influence of the largest errors on the estimates, repeating the analysis of Newman, Clark, Sampson, et al. (2015) with the VIC model. Figure 2a quantifies the influence of the 10 individual days with the largest errors on the estimates–Figure 2a demonstrates that, in many catchments, 10 days in the 19year period contribute to over 50% of the sumofsquared errors between simulated and observed streamflow. Figure 2b identifies the largest observations that jointly contribute 50% of the estimate, expressed as a percentage of the total sample length . Figure 2b demonstrates that, in many catchments, 50% of the sumofsquared errors is caused by less than 0.5% of the simulationobservation pairs. These results suggest that there will be large uncertainty in the and metrics.
Figure 3 quantifies the uncertainty in and across the CAMELS catchments, illustrating considerable uncertainty in both the and values. Figure 3 illustrates that the 90% tolerance intervals for both NSE and KGE (as obtained by the bootstrap methods described in the Appendix A) are greater than 0.1 for more than half of the CAMELS catchments. The results in Figure 3 illustrate that both the bootstrap and jackknife methods yield consistent standard error estimates. The large uncertainty in and are evident when both and are used as a calibration target.
The jackknifeafterbootstrap methods enable an evaluation of the degree of precision and accuracy associated with the bootstrap tolerance intervals. While there is considerable sampling uncertainty in the tolerance intervals (estimated using the jackknifeafterbootstrap methods; Figure 4), that uncertainty is considerably smaller than the uncertainty associated with and as is shown in Figure 3. As we discuss in the next section, the sampling uncertainty depicted in Figure 3 may be underestimated in situations where there is extremely high skewness in daily streamflows.
5 Discussion
5.1 It Is Necessary to Quantify the Uncertainty in Performance Metrics
The high uncertainty associated with the estimators and underscores the need to quantify the uncertainty in the performance metric estimators used in hydrologic modeling applications. Quantifying the sampling uncertainty in model evaluation statistics is easily accomplished using appropriate bootstrap methods. Moreover, bootstrap methods can be applied to any performance metric estimator. Quantifying the uncertainty in the performance metric estimators should arguably become a routine part of the hydrologic modeling enterprise. As our results show, the width of the 90% tolerance intervals associated with the estimators and are greater than 0.1 in at least half of analyzed catchments. Such wide 90% tolerance intervals indicate considerable uncertainty associated with each of these metrics. These results imply that the conclusions from many hydrologic modeling studies may not be justified in light of the high sampling uncertainty in systemscale performance metric estimators.
In spite of the ease with which the bootstrap may be applied as a postprocessing approach to developing uncertainty intervals, there is a need for additional research on methods to quantify the sampling uncertainty. Our experiments (not shown) demonstrate that traditional bootstrap methods may severely underestimate the sampling uncertainty in the estimators and in situations where there is extremely high skewness (see also Chernick & LaBudde, 2011). These underestimates in uncertainty occur because bootstrap methods “recycle” the observations, and the bootstrap samples do not adequately encapsulate the uncertainty associated with the few extraordinary errors in the thick upper tail of the error distribution. Indeed, our JackknifeafterBootstrap analyses demonstrate that there are large standard errors in our bootstrap estimates of uncertainty in and . Thus, given the extremely high skewness of daily streamflow observations in some watersheds, we recommend future research which compares the uncertainty intervals derived from various bootstrap methods against the uncertainty intervals derived from more advanced stochastic methods (e.g., Papalexiou, 2018).
5.2 It Is Necessary to Improve the Estimates of SystemScale Performance Statistics
A variety of approaches can be introduced to improve estimates of the theoretical NSE and KGE statistics; that is, to develop more robust estimates and that have lower sampling uncertainty. For example, Fowler et al. (2018) calculated the metric separately for each year before averaging across years; Lamontagne et al. (2020) introduced alternative estimators of and based on a bivariate lognormal monthly mixture model. Variance reduction methods introduced to the field of machine learning and statistics (e.g., Nelson & Schmeiser, 1986) can be used to improve estimates of the theoretical NSE and KGE performance metrics. More generally, the approaches of bagging and bragging could be tested, where the performance metrics are estimated using the median or the mean of multiple bootstrap samples (Berrendero, 2007). Further work is needed to better understand the characteristics of data points that have high leverage in order to devise methods that improve estimates of the theoretical NSE and KGE statistics.
5.3 It Is Necessary to Put Performance Metrics in Context
The growing field of model benchmarking seeks to put performance metrics into context, for example, by asking the question if models meet our apriori expectations, or if models adequately use the information that is available to them. The recent efforts in model benchmarking have focused on defining lower and upper benchmarks to provide context for model performance (Nearing et al., 2018; Newman et al., 2017; Seibert et al., 2018). Lower benchmarks evaluate the extent to which models surpass expectations (Seibert, 2001), for example, the extent to which model simulations perform better than a benchmark such as climatology, persistence, simulations from another model (Wilks, 2011), or departures from the seasonal cycle (Knoben et al., 2020; Schaefli & Gupta, 2007). A key component of defining the lower benchmark is defining our apriori expectations of model capabilities. We define the upper benchmark to quantify the predictability of the system, that is, the maximum information content in the forcingresponse data (Nearing et al., 2018; Newman et al., 2017). For example, Best et al. (2015) recently demonstrated that many mechanistic land models were outperformed by simple statistical models, implying that modern land models were not adequately using the information that is available to them. Much work still needs to be done to quantify our expectations for model performance (the lower benchmark) as well as to quantify systemscale predictability (the upper benchmark).
Benchmarking is important in the context of performance metrics because the NSE and KGE have rather weak apriori expectations of model performance. The NSE uses the variance of the observations as the benchmark. This means that if the is smaller than the variance of the observations. In other words, if the model simulations are better than the reference case where for all time steps. Knoben et al. (2019) points out that the KGE estimates do not have the same benchmark as NSE estimates: the implied benchmark associated with estimates of NSE, that is, that model simulations are always equal to the observed mean (i.e., ) occurs when the estimate of (i.e., when the estimate of ). The observed mean is often used as a benchmark with the KGE metric as well, imposing old expectations on a new metric. Using stricter, purposespecific benchmarks can give a clearer idea of model strengths and weaknesses.
It is also necessary to evaluate the systemscale performance metrics in the context of the uncertainties in the model inputs (e.g., spatial meteorological forcing data), the uncertainties in the hydrologic model (e.g., uncertainties in model parameters and model structure), and the uncertainties in the systemscale response (e.g., streamflow observations). Many groups are now developing ensemble spatial meteorological forcing fields in order to understand how uncertainties in the model forcing data affect uncertainties in the hydrologic model simulations (e.g., Clark & Slater, 2006; Cornes et al., 2018; Frei & Isotta, 2019; Newman, Clark, Craig, et al., 2015; Tang, Clark, Papalexiou, et al., 2021). There are also a wealth of approaches to quantify hydrologic model uncertainty. Vogel (2017) introduced the concept of stochastic watershed models (SWM), which involve methods for generating likely stochastic traces of daily streamflow from deterministic watershed models. All such methods of developing SWMs reviewed by Vogel (2017), including the very generalized blueprint introduced by Montanari and Koutsoyiannis (2012), may be employed to develop uncertainty intervals associated with either streamflow predictions or other water resource system variables. There is now also substantial effort dedicated to quantifying uncertainty in streamflow observations (e.g., see the comparison of uncertainty techniques by Kiang et al., 2018 and also Coxon et al., 2015 and Mansanarez et al., 2019). The key issue is that the most uncertain observations of streamflow are in the upper tail; these observations also have the most influence on the KGE and NSE metrics. Further research is needed to understand how these sources of uncertainty are manifest in systemscale performance metrics.
5.4 It Is Necessary to Understand the Limitations of SystemScale Performance Metrics
It is well known that minimizing the sumofsquared errors in calibration results in simulated streamflows with smaller variance than the observations (e.g., Gupta et al., 2009). This occurs because of the interplay between estimates of the variance of the flows and correlation in NSE described in Section 3–specifically, the quantity appears in both the second and third terms in Equation 8, meaning that NSE is maximized when . This is problematic because optimization studies that minimize the (or maximize the ) result in because is always smaller than unity. Mizukami et al. (2019) illustrate these issues when using as an objective function in largesample hydrologic model calibration study. They showed that the calibrated simulations had substantial underestimates of high flow events, such as the annual peak flows that are used for flood frequency estimation. Underestimation of variance, as well as all other upper moments, is a general problem associated with simulation models and is not limited to use of a particular objective function (see Farmer & Vogel, 2016).
There are also problems with the KGE metric. As discussed by Santos et al. (2018), the definition of the bias term in KGE, , can lead to very large values of (and hence low KGE scores) when is small. Such problems with amplified values are potentially more pronounced for variables where crosses zero (e.g., logtransformed flows, temperature) because could be very small. Citing drawbacks of the NSE as justification, part of the community has switched to using over . We argue that this did not solve but only changed the problems related to systemscale performance metrics. It is important to be aware of the theoretical behavior of systemscale performance metrics, along with their limits of applicability, and use additional metrics that are tailored to suit specific applications.
5.5 It Is Necessary to Use Additional Performance Metrics
A key problem with systemscale performance metrics is that they do not make adequate use of the full information content in the data. Gupta et al. (2008) point out that global calibration of hydrologic models (e.g., using or as the objective function) entails compressing the information in the model output and observations into a single performance metric, and then using that single metric to infer values of multiple model parameters and all aspects of hydrological processes. Such global calibration methods can lead to problems of compensatory parameters, providing the “right” results for the wrong reasons (Kirchner, 2006). Specifically, parameters in one part of the model may be assigned unrealistic values that compensate for unrealistic parameter values in another part of the model, or that compensate for errors in the model forcing data and weaknesses in model structure (Clark & Vrugt, 2006). Addressing this problem requires asking a different question: Instead of asking “how good is my model?”, it may be more appropriate to ask “What is my model good for?” This second question is more relevant when designing a modeling experiment for a specific application.
One approach is to develop alternative systemscale performance metrics. This includes the efforts to develop variants of KGE–for example, Kling et al. (2012) introduced a modified version of KGE, termed , by using the ratio of the simulated and observed coefficient of variation () instead of the ratio of the simulated and observed standard deviation. Their intent is to reduce the impact of bias on the variability term in KGE. Pool et al. (2018) developed alternative estimates of and for use with , with the intent of reducing the impact of outliers. Note that the Pool et al. (2018) estimates of and are for different theoretical statistics than the and statistics that are used in Equations 8 and 9 (see Barber et al., 2019; Lamontagne et al., 2020). Other alternative systemscale metrics include variable transformations, such as the logtransform or BoxCox transform (to reduce skewness, and focus more on low flows), or methods to compare distributions of modeled extremes to observed extremes. The work to develop alternative systemscale performance metrics recognizes that estimates of correlationbased metrics are often inflated, in the sense that high values can occur for mediocre and poor models, and that estimators of correlationbased metrics are sensitive to outliers and data asymmetry (Legates & McCabe, 1999; Willmott, 1981; see also Mo et al., 2014; Barber et al., 2019).
Another approach is to use additional nonglobal metrics (e.g., multiple diagnostic signatures of hydrologic behavior). For example, much of the research on model calibration and evaluation now focuses on multicriteria methods, including analysis of tradeoffs among multiple objective functions (e.g., Fenicia et al., 2007; Yapo et al., 1998), analysis of the temporal variability of model errors (Coxon et al., 2014; Reusser et al., 2009), and scrutinizing diagnostic signatures of hydrologic behavior in order to identify model weaknesses (Gupta et al., 2008; Rakovec et al., 2016). A key part of this analysis is to understand the sensitivity of different nonglobal metrics to individual parts of a model (e.g., Markstrom et al., 2016; Van Werkhoven et al., 2009). As such, these alternative metrics can focus attention on aspects of the model that may be more relevant for specific modeling applications.
6 Conclusions

We provide tools to enable hydrologic modelers to quantify the sampling uncertainty in systemscale performance metrics. We use the nonoverlapping block bootstrap method to obtain probability distributions and associated tolerance intervals of estimates of NSE and KGE, and we use the jackknifeafterbootstrap method to obtain estimates of standard error of those bootstrap tolerance intervals. These comparisons enable us to ensure that even though the tolerance intervals display sampling variability, that variability is always considerably smaller than the tolerance intervals themselves, thus providing a nice validation of the precision of the tolerance intervals.

We quantify the sampling uncertainty in systemscale performance metrics across a large sample of catchments. Our results show that the probability distribution of squared errors between model simulations and observations have heavy tails, meaning that the estimates of sumofsquared error statistics can be shaped by just a few simulationobservation pairs (Figure 2). This leads to substantial uncertainty in the estimators and (Figures 3 and 4). The implication of these results is that the conclusions from many hydrologic modeling studies are based on values for these metrics that fall well within the metrics' uncertainty bounds. Such conclusions may thus not be justified.

We define further research that is, needed to improve the estimation, interpretation, and use of systemscale performance metrics in hydrological modeling
More generally, our commentary highlights the obvious (yet ignored) abuses of performance metrics that contaminate the conclusions of many hydrologic modeling studies. We look forward to additional studies that improve the scientific basis of model evaluation.
Acknowledgments
We appreciate the constructive comments from the four reviewers. Martyn Clark, Wouter Knoben, Guoqiang Tang, Shervan Gharari, Jim Freer, Paul Whitfield, Kevin Shook, and Simon Papalexiou were supported by the Global Water Futures program, University of Saskatchewan.
Appendix A: The Jackknife and Bootstrap Methods
In this study, we use two resampling methods, the Jackknife and the Bootstrap, to estimate the empirical probability distribution of the and estimators for each of the 671 CAMELS catchments. These methods estimate the empirical probability distribution of a given statistic by drawing or resampling a number of independent samples from the original sample of data.
The following subsections describe the implementation of the Jackknife and Bootstrap methods, including the resampling strategies, the Jackknife and Bootstrap estimates of standard error, and the Jackknife estimates of the standard error in the bootstrapderived empirical probability distributions of and .
A1 The Jackknife and Bootstrap Resampling Strategies
The value of the ith Jackknife replicate is the value of the estimator . In our case, the ith Jackknife replicate is or . The Jackknife method is useful in cases where it is desirable to conduct structured analysis of the deleted point statistics.
When implementing these resampling methods, it is necessary to ensure independence between each draw from the original sample of data (Carlstein, 1986; Künsch, 1989; Vogel & Shallcross, 1996). Specifically, the errors in daily streamflow simulations are characterized by substantial periodicity and persistence – this creates complex temporal dependence structures on time scales from days (e.g., errors in the simulations of recessions after a storm event) to seasons (e.g., errors in the simulations of seasonal snow accumulation and melt, or errors in the seasonal cycle of transpiration). To address these issues, we implement a nonoverlapping block resampling strategy that was developed for the Bootstrap method, the Nonoverlapping Block Bootstrap (NBB) of Carlstein (1986). This approach identifies subseries of data of length , where each subseries of data is statistically independent. In our implementation, the subseries are each of the 19 water years , where the water years span the period Oct 1st–Sep 30th (e.g., water year 1990 is the period Oct 1st1989–Sep 30th 1990).
The nonoverlapping block resampling strategy is used for both the Jackknife and Bootstrap methods. The Jackknife sample for a given water year is the data set that remains after deleting the ith water year. For example, , where contains the daily time series of simulationobservation pairs for all years except water year 2002, and is the or estimates using all daily data except in 2002. The Bootstrap method samples water years with replacement: A given bootstrap sample may include a given water year more than once, or a given sample may not include a given water year at all. The Bootstrap samples that do not have a given water years (e.g., all Bootstrap samples without water year 2002) open up opportunities to quantify the standard errors in the Bootstrap estimates of the empirical probability distributions (using the JackknifeAfterBootstrap method introduced by Efron, 1992; we will discuss this implementation in Section A3).
A2 Jackknife and Bootstrap Estimates of Standard Errors in the and Estimates
A3 Jackknife Estimates of Standard Error in the BootstrapDerived Probability Distributions
The Bootstrap estimates of the empirical probability distributions create a conundrum: whilst outliers can cause large uncertainty in the or estimates, the outliers can also create large uncertainty in the Bootstrap estimates of the empirical probability distributions. It is hence necessary to estimate the standard error in the Bootstrap methods.
Estimates of the standard error in the Bootstrap methods can be computed easily using the JackknifeAfterBootstrap method of Efron (1992). In the previous discussion we noted that the nonoverlapping block resampling strategy opens up opportunities to quantify the standard errors in the Bootstrap estimates of the empirical probability distributions. Specifically, for a given water year we can compute a Jackknife sample using all of the Bootstrap samples that do not include that water year. When such Jackknife samples are constructed for all water years, the Jackknife method can be used to estimate standard error in the Bootstrap estimates (the JackknifeAfterBootstrap method).
The Jackknife estimate of standard error uses Equation A6 with as the value of the ith Jackknife replicate in place of .
Open Research
Data Availability Statement
The data for the largedomain model simulations are publicly available at the National Center for Atmospheric Research at https://ral.ucar.edu/solutions/products/camels. The source code to quantify the sampling uncertainty in performance metrics (the “gumboot” package) is available at https://github.com/CHEarth/gumboot.