Volume 58, Issue 5 e2021WR031818
Research Article
Free Access

Achieving Robust and Transferable Performance for Conservation-Based Models of Dynamical Physical Systems

Feifei Zheng

Corresponding Author

Feifei Zheng

College of Civil Engineering and Architecture, Zhejiang University, Hangzhou, China

Correspondence to:

F. Zheng,

[email protected]

Search for more papers by this author
Junyi Chen

Junyi Chen

College of Civil Engineering and Architecture, Zhejiang University, Hangzhou, China

Search for more papers by this author
Holger R. Maier

Holger R. Maier

School of Civil, Environmental and Mining Engineering, The University of Adelaide, Adelaide, SA, Australia

Search for more papers by this author
Hoshin Gupta

Hoshin Gupta

Department of Hydrology and Atmospheric Sciences, The University of Arizona, Tucson, AZ, USA

Search for more papers by this author
First published: 19 May 2022
Citations: 6

Abstract

Because physics-based models of dynamical systems are constrained to obey conservation laws, they must typically be fed long sequences of temporally consecutive (TC) data during model calibration and evaluation. When memory time scales are long (as in many physical systems), this requirement makes it difficult to ensure distributional similarity when partitioning the data into independent, TC, calibration and evaluation subsets. The consequence can be poor and/or uncertain model performance when applied to new situations. To address this issue, we propose a novel strategy for achieving robust and transferable model performance. Instead of partitioning the data into TC calibration and evaluation periods, the model is run in continuous simulation mode for the entire period, and specific time steps are assigned (via a deterministic data-allocation approach) for use in computing the calibration and evaluation metrics. Generative adversarial testing shows that this approach results in consistent calibration and evaluation data subset distributions. When tested using three conceptual rainfall-runoff models applied to 163 catchments representing a wide range of hydro-climatic conditions, the proposed “distributionally consistent (DC)” strategy consistently resulted in better overall performance than achieved using the traditional “TC” strategy. Testing on independent data periods confirmed superior robustness and transferability of the DC-calibrated models, particularly under conditions of larger runoff skewness. Because the approach is generally applicable to physics-based models of dynamical systems, it has the potential to significantly improve the confidence associated with prediction and uncertainty estimates generated using such models.

Key Points

  • A novel strategy is proposed to calibrate and validate CRR models by discarding the use of time-consecutive data

  • The proposed approach achieves robust and transferable CRR model performance based on results from 163 catchments

  • The proposed method is generally applicable to all physics-based models with potential to significantly improve model prediction confidence

Plain Language Summary

When developing models (whether physical or conceptual) it is common practice to partition the available data into separate calibration and evaluation subsets. Further, these subsets need to consist of temporally consecutive data due to the long residence times (memory) of the system state. Such an approach can result in low robustness and poor generalization ability, due to significant variation in the hydro-climatic conditions represented by the two subsets. Here, we show that by discarding the idea of partitioning historical data into time-consecutive subsets and instead running model simulations in a continuous manner through the entire data set, it becomes possible to ensure that the data used for calibration and evaluation come from statistically similar distributions, thereby improving the quality of the calibrated model. An additional benefit is that the model states need only be initialized once at the outset of the simulation. An important feature of our strategy is that, by removing the requirement for that calibration and evaluation data to consist of temporally consecutive observations, a very large number of data partitions can be achieved, which significantly improves the possibility of ensuring distributional similarity of the two subsets, thereby improving robustness and generalization ability of the model.

1 Introduction

1.1 The Problem of Calibrating Conservation-Based Models of Dynamical Physical Systems

Physics-based models of dynamical systems are the mainstay of how causal predictive understanding is developed in the Earth Sciences (Dodov & Foufoula-Georgiou, 2004). Such models impose theoretical prior knowledge as physical constraints on the dynamic Markovian evolution of the system state in response to inputs and boundary conditions (Rosatti, 2002). Because these models are constrained to obey mass, energy, and/or momentum balance within a fixed control volume (Pascolini-Campbell et al., 2020), we will hereafter use the term “conservation-based models (CBMs)” to distinguish them from machine learning (ML) models that focus primarily on information flows and that are not regularized to obey physical conservation principles. CBMs are used across the full range of Earth Science disciplines; hydrological examples used for streamflow prediction (among other things) include relatively simple spatially lumped physical–conceptual rainfall-runoff models such as SIMHYD (Vaze et al., 2010) and more complex spatially distributed process-based rainfall-runoff models such as TOPKAPI (Janabi et al., 2021).

The strength of the CBM approach is its ability to represent causal relationships in a manner that is consistent with physical principles and to assign “meaning” to the various components, fluxes, and state variables of the model (Quesnel & Ajami, 2019). This provides a basis for the progressive development of scientific understanding. However, because CBMs typically incorporate a variety of simplifying assumptions, and because all of the information necessary for complete parameter specification is usually unavailable, such models must be calibrated and evaluated before they can be applied with any degree of confidence to new situations. Calibration refers to the process of “tuning” model parameters using “location-specific” (point or regional) historical data, while evaluation refers to the process of testing to ensure that the calibrated model is able to provide consistent simulations/predictions when applied under conditions that may not have been fully (or sufficiently well) represented by the available historical data (Guo et al., 2020).

Further, since simulations generated by the CBM typically depend (to greater or lesser extent) on the initial values assigned to the system states, initialization errors/uncertainties associated with system states can have long residence times (memory) that may persist for extended periods of time (Hoell et al., 2017). Consequently, CBMs cannot, in principle, be calibrated/trained using short, randomized, minibatches of data (as is traditionally done with data-based ML models) without biasing the calibrated values of the model parameters in unpredictable ways (Zheng et al., 2018). In other words, CBM calibration requires the use of temporally consecutive observations, to ensure that their states evolve in a manner that is consistent with the state of the dynamic physical system that the model is intended to represent (e.g., rainfall-runoff process).

For example, in the case of conceptual rainfall-runoff (CRR) models, the traditional approach is to partition the available observational data into two independent time-consecutive subsets, one of which is used for model calibration and the other for evaluation (Zhou et al., 2021). We will hereafter refer to this as the “temporally consecutive (TC)” strategy. This partitioning can be done so that each subset represents 50% of the available data (other fractions such as 80/20 or 70/30 are sometimes used), but with the requirement that each subset consists of TC data records. When the inputs from each subset are applied to the model to generate sequence trajectories of model outputs, the states of the model must be set to reasonable initial values. To handle this, it is common to specify a “burn-in/spin-up” period (typically 3–12 months, depending on system residence times) at the beginning of the simulation sequence during which the outputs of the model are treated as being unreliable due to initialization errors (Guo et al., 2020). Note that the calibration (or evaluation) subset can not only correspond to the first or second half of the observational time period but also be any TC portion within the total data provided that the aforementioned practice of allocating a spin-up period for each nonconsecutive sequence is followed (Guo et al., 2020).

The rationale for using two time-consecutive subsets for CBM calibration and evaluation is to attempt to achieve an independent basis for checking whether or not the result of the calibration process is “consistent” (Martinez & Gupta, 2011). The idea is that the model calibration process can be considered successful if simulations generated on the calibration and evaluation subsets are mutually consistent. In other words, a satisfactory model calibration should ensure the behavioral performance on the evaluation subset is essentially (statistically) similar to that obtained on the calibration subset. This implies that the model has not been “over-fit” to the calibration portion of the data. Provided that the available historical data are sufficiently representative of the kinds of events that the system can be expected to experience, consistency of model performance between the calibration and evaluation periods can serve to indicate confidence that the model will continue to perform well under new conditions (Guo et al., 2020), such as when the model is used for forecasting/prediction. Poor consistency between calibration and evaluation period performance is therefore taken as an indication of “low transferability” (Coron et al., 2012; Gibbs et al., 2018).

In this regard, when using the traditional “TC” strategy for CBM model calibration, it is observed that model performance is often notably worse on the evaluation subset than on the calibration subset (Broderick et al., 2016). In hydrology, this fact has been commented on since at least the 1980s (V. K. Gupta & Sorooshian, 1985; Hsu et al., 1995; Sorooshian et al., 1983; Yapo et al., 1996) and was recently confirmed by an extensive study (Guo et al., 2020) that revealed a major reason to be significant variation/difference in the hydro-climatic conditions experienced by the system between calibration and evaluation subsets. To oversimplify the problem, if the calibration subset mainly represents relatively wet hydro-climatic conditions, then model performance can deteriorate significantly when applied to relatively dry conditions and vice versa (Coron et al., 2012; Vaze et al., 2010).

Therefore, to achieve robust and consistent model performance across different hydro-climatic conditions, it is generally recommended that the calibration subset be selected in such a manner that it contains hydrologically relevant information about the full range of kinds of events that the watershed can be expected to experience (i.e., the calibration subset is “informationally similar” to the entire data). In the context of statistical modeling, various strategies have been developed to select a representative subset of the available samples for calibration and evaluation. Generally, these methods aim to create two data subsets with similar statistical properties, based purely on data analytical methods such as distance-based approaches (Kennard & Stone, 1969; Snee, 1977) and D-optimality methods (e.g., Wu et al., 1996). Subsequently, many other methods have been proposed in the literature for data splitting; for a review see Reitermanova (2010) and for a comprehensive comparative study see Xu and Goodacre (2018).

More importantly, in the context of hydrological modeling, several studies have been undertaken previously regarding the quality and amount of information contained in data used for model calibration and evaluation. For example, Wagener et al. (2003) showed that the information content in a data series is not uniformly distributed and that by using only a certain period of observation one can obtain information regarding the hydrological characteristics of a catchment. Montanari and Toth (2007) demonstrated that a long series of data is not required for calibration; essentially, the only information required is the spectral density function of the actual process simulated by the model. Bastola et al. (2011) suggested that the way to achieve this is to use a sufficiently long calibration period so that a wide range of dry, medium, and wet conditions is encompassed. Similarly, Li et al. (2012) and Gibbs et al. (2018) suggested that the calibration period should be selected to be as hydro-climatically similar as possible to the conditions under which the model is to be applied (e.g., the evaluation period). Seibert and Beven (2009) argued that just a few runoff measurements can contain as much of the information as the entire runoff time series, with an example being that a single event consisting of 10 observations during high flows could provide the same information as that from continuous 3 months of data (Seibert & McDonnell, 2013). They also proved that maximum flows series contain more information than minimum or mean flows. Melsen et al. (2014) showed that the season (5 months) with the highest precipitation is sufficient to give a robust simulation of high flows over the full observation period.

While some progress has arguably been made in procedures for improving the robustness and transferability of CRR models across different data periods, such procedures still retain a somewhat ad hoc quality, being based in subjective assessments of the consistency of data distributions across time. Consequently, model performance is often unsatisfactory, especially when dealing with watersheds with different hydro-climatic conditions and where extensive data records are not available (Guo et al., 2020). As mentioned above, the low robustness and transferability of CBM models across time is a consequence of lack of consistency in data properties (distribution, etc.) across calibration and evaluation subsets. Therefore, an important task is to find a suitable way to ensure that the data used for model calibration are informationally similar to the entire historical record, while the remaining data to be used for model evaluation are also informationally similar. In other words, both calibration and evaluation subsets are informationally similar to the fullest extent. However, this goal is difficult, if not impossible, to achieve using the traditional “TC” strategy for partitioning the data.

To realize this outcome, one possible strategy is to partition the data into numerous short “batches” of TC data and to then distribute the batches between calibration and evaluation subsets in such a manner so as to achieve statistical (hydro-climatic) consistency. This is actually similar to the way in which minibatches are assigned to training, selection, and evaluation subsets in a data-based ML procedure. However, such a data-partitioning strategy becomes impractical in the case of CBMs, because the model states must be suitably initialized at the beginning of each batch of TC data. Since a model spin-up period must be assigned to mitigate the effects of initialization errors, one rapidly runs into the problem of not having enough data for the procedure to be statistically sound (i.e., one is severely limited in the number of data partitions/batches that can be achieved).

The focus of this paper is to propose and test a way around this model consistency problem. At the outset, we recognize that the fundamental goal of model development is to achieve at a result that does not over-fit to the data used for calibration and that is properly informed by the full variety of hydro-climatic conditions represented by the historical data available for model development and testing. As such, the “TC” strategy for partitioning the data is simply a possible means to this end, but one that has not proven to be successful. Accordingly, it behooves us to test alternative strategies that may actually (demonstrably) achieve the desired goal.

1.2 A Proposed Solution

Our proposed approach, developed and tested in this paper, is to discard the idea of partitioning the historical data set into time-consecutive subsets for use in model calibration and evaluation. Instead, the model is to be run in continuous manner from start to finish through the entire data set, so that initialization of the model states (and model spin-up) need only be done once at the outset of the simulation. Of course, this requires that the available data record be free from gaps in the forcing (input) data; in situations where such is not possible, model reinitialization must necessarily be done after each gap, but this problem will have to be dealt with regardless of the calibration strategy to be followed.

Now, the model initialization problems are minimized and temporal continuity of the simulations is assured. It therefore becomes possible to assign the model output data in a distributionally consistent (DC) manner to the desired calibration and evaluation subsets without needing to preserve or account for temporal ordering when computing the model performance metrics (also, temporal gaps in the output data are fundamentally not a problem). In fact, if so desired, the so-called “randomized minibatch” procedure commonly used in ML can also be implemented during model calibration so that parameter estimation (model training) can be conducted using stochastic gradient descent. Whereas in ML methods, the randomized minibatches consist of mutually indexed input and output data values, the difference here is that only the system output data are nonsequentially allocated to the calibration and evaluation subsets. We will refer to this as the “DC” strategy for model calibration and evaluation, regardless of whether a full-batch (entire calibration data) or a minibatch procedure is used during the parameter optimization/tuning process.

An important feature of the DC strategy describe above is that, since the output observations selected for model calibration (or evaluation) do not need to be TC, a huge number of calibration–evaluation data partitions can be achieved. For example, if a 50/50 partition is used, a total of NCN/2 (N is the number of output data values) possible data partitions are available for use. This significantly improves the possibility to ensure the distributional similarity of the output data in the calibration and the evaluation subsets (of course, this is conditional on the properties of the given input data sequence over which we have little or no control).

It now becomes possible to explore a variety of different strategies (randomized, deterministic, or hybrid) for data allocation into calibration and evaluation subsets, such as the repeated k-fold cross validation strategy that has been commonly used in the ML domain (Nakatsu, 2020). In this study, based on findings from the data-driven models as reported in Chen et al. (Journal of Hydrology, in review), we adopt a deterministic method for data allocation that has been shown to ensure a high degree of distributional similarity. The adopted method called MDUPLEX (see Section 2.2) is a modified version of the deterministic DUPLEX method (Snee, 1977), meaning that only one data partition is achieved for any given data. Our reason for adopting a deterministic data-partitioning approach will become apparent when we discuss the experimental design of our demonstration case study, where we test the DC strategy by applying several different CRR models to a large sample of catchments. For actual implementation of a single model to a single location, there is no reason why the stochastic SBSS-P (or other) method cannot be applied instead, in order to explore the model calibration and performance uncertainties associated with data sampling variability (Wu et al., 2013).

1.3 Contributions of This Study

The key contributions of this study include the following:
  1. A proposal of a consistent strategy for calibration of CBMs that focuses on the primary goal of achieving a model that does not over-fit to the calibration data and is properly informed by the full variety of climatic conditions represented by the historical data. This is achieved by discarding the idea of time-consecutive partitioning and instead ensuring that the output information used for both calibration and evaluation is distributionally similar, conditional on the properties of the system inputs over which little or no control can be exerted. The proposed method is generally applicable to any CBM, such as are common in the Earth Sciences.

  2. Demonstration of the use of a deterministic data allocation method (MDUPLEX) for partitioning the data into distributionally similar calibration and evaluation subsets. In addition, a generative adversarial approach is used to demonstrate the consistency of the distribution of the two data subsets.

  3. A large-sample assessment of the proposed DC model calibration method using three CRR models applied to data from 163 catchments that span a wide range of hydro-climatic conditions. We benchmark our results against various implementations of the traditional TC strategy. Further, we assess and compare the performance of the two strategies as a function of varying properties of the data (i.e., behavioral properties of the catchments).

To be clear, the core part of the present study is the proposal of the “DC” strategy for CRR model calibration and evaluation (Contribution (i)), which differs significantly from our previous studies. Specifically, Zheng et al. (2018) and Guo et al. (2020) have respectively identified the potential impacts of different data splits for model calibration and evaluation on data-driven and RCC models, but they did not provide solutions to address the problem. Subsequently, Chen et al. (2021, under review) proposed several improved data splitting methods that are only applicable to data-driven rainfall-runoff models. Accordingly, the contributions of those three studies are significantly different to the main contribution of the present study as stated above. In addition, a novel approach related to generative adversarial networks (GANs) has been employed in this study to assess distributional similarity between calibration and evaluation data set. The proposed new data allocation method is introduced in Section 2, the details of our experimental case study are presented in Section 3, and the results are discussed in Section 4. We conclude with a discussion and future outlook in Section 5.

2 Methodology

2.1 The Proposed Calibration Method

When calibrating a CBM, the entire available data set (D) is commonly partitioned into a time-consecutive calibration subset (CS) and a time-consecutive evaluation subset (ES). We can express this data partitioning as D = {DCS, DES}. Ideally, the data allocation should be done in a manner such that both DCS and DES are as fully representative as possible of the statistical/informational properties of the entire available data period D. Unfortunately, this condition is difficult to achieve in practice.

Without loss of generality, let us assume that all of the data are to be used for either model calibration or model evaluation (this serves to require model spin-up only at the beginning of the entire period as shown in Figure 1). Figures 1a–1c illustrate three different typical implementations of such an approach, where the calibration data subset (DCS) can correspond to the beginning, end, or some intermediate time-consecutive period of the data set D. This DCS is used to compute the value of the performance metric, urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0001, to be optimized during model calibration, in order to obtain good estimates of the model parameters urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0002. Due to the Markovian nature of the system state, the model must be run from beginning to end of the entire data set D at each iteration of the optimization algorithm, whereas the performance metric urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0003 used to drive the optimization process is computed over only the calibration data period corresponding to DCS. The remaining (consecutive or nonconsecutive) period DES is used for model performance evaluation. For example, if the performance metric to be optimized during model calibration is the Mean Squared Error, we have
urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0004(1)
where urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0005 is total number of data points in DCS, urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0006 is the ith observation data point in DCS, and urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0007 is its corresponding model simulated value, which is a function of the model parameters urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0008.
Details are in the caption following the image

Illustration of the traditional (a–c) and proposed (d) model calibration strategies.

Because of the time-consecutive nature of the period chosen for model calibration, it is only possible to achieve a relatively small number of distinct calibration–evaluation data partitions using this approach. Further, it becomes difficult to ensure that both DCS and DES are fully representative of the statistical/informational properties of the entire available data set D, which inevitably results in low transferability of model performance across time. For example, Figure 1 illustrates a situation where the early portion of D consists of relatively “low magnitude” events, while the late portion consists of relatively “high magnitude” events. A model calibrated using the early portion (Figure 1a) will not “see” the high magnitude events during parameter estimation. The resultant model can therefore fail to provide satisfactory performance when applied to the late portion (i.e., during evaluation) and, by extension on other (new, as of yet unseen) events of relatively high magnitude. Similarly is expected for the other two traditional data allocation cases (Figures 1b and 1c).

In contrast, when training a data-based model, due to lack of the requirement for the data points in DCS and DES to be time consecutive (because memory is assessed in terms of a finite, relatively small, number of previous lagged time steps), a large number of overlapping calibration–evaluation data partitions can be achieved (Zheng et al., 2018). Upon reflection, the requirement that DCS and DES consist of time-consecutive periods is really just conventional practice. In other words, there is no reason why the metric value urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0009 used for model calibration needs to be based on residuals computed for time-consecutive periods of the data set. Provided that the model is always run in continuous simulation model over the entire data set D, meaning that the input data must be fed in in time-consecutive fashion, one can always choose any set of (time-consecutive or non-time-consecutive) output data points to be used when computing urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0010.

Accordingly, as illustrated by Figure 1d, one could choose every alternate, or tenth, or even any randomly selected set of nonrepeated and non-time-ordered time output values in D to make up DCS and DES. However, to be completely clear, because of the dynamical state space conservation-based nature of the model (in which internal memory states representing conserved quantities are updated at each time step and carried forward), the model needs to be run in time-ordered fashion for the entire data period D.

By discarding the requirement that DCS and DES consist of time-consecutive system/model output data values, it becomes considerably easier to ensure that the data used for model calibration and performance evaluation are fully representative of the entire range of hydro-climatic events experienced by the watershed. In other words, the points in DCS and DES can be selected in a statistically representative manner to index the full range of system behaviors. In rainfall-runoff modeling, these behaviors can be represented by streamflow magnitudes (across the entire flow duration curve), hydrological processes (precipitation driven and nondriven, snowmelt dominated and not, evapotranspiration driven and nondriven, etc.), and/or hydro-climatic conditions (cold and warm, water limited and energy limited) that might influence the evolution of the system.

An additional benefit, is that by classifying different (behavioral) portions of the data to be statistically represented when computing the calibration period metric(s) (see discussion of diagnostic system signatures in H. V. Gupta et al. [2008]), it more readily facilitates being able to assess in detail how well the model can be expected to perform under those conditions, when the model is applied to new situations. In other words, the evaluation process becomes more representative, robust, and diagnostically informative about the quality of the calibrated model.

A possible objection is that the DC approach makes it difficult, if not impossible, to conduct the model performance assessment on a portion of the data that is somehow “independent” of the data portion used for model calibration. However, one must bear in mind that “independence” in the context of CBM’s is a relative context, since any system state/output depends unavoidably on the past input/output history and initial state of the system. Accordingly, even when D is partitioned into time-consecutive calibration–evaluation periods {DCS, DES}, it is effectively impossible to achieve dynamical independence of the trajectories represented by DCS and DES. In principle, a relative degree of “independence” is only possible when
  1. A sufficiently long temporal gap is left between the last temporally ordered data point in DCS and the first temporally ordered data point in DES, with the hope/assumption that any dynamic state-variable continuity effects are damped out during the gap and are therefore no longer relevant (e.g., there is minimal memory carry-over from the past to influence the future behavior of the system).

  2. The system state at the beginning of DES is randomly reinitialized and a burn-in period is used to minimize initialization effects.

Given this reality, a more important question is whether the goal of achieving dynamical independence (between DCS and DES) outweighs the importance of achieving a robust and stable calibration of the model—one that can be relied upon to enable consistent predictive performance across the full range of hydro-climatic conditions historically measured. We argue that the latter is a much more important goal and should therefore exert a stronger influence on the design of the model calibration process.

2.2 The MDUPLEX Data Allocation Approach

A key requirement of the DC method is the ability to effectively partition the model output data into DC portions to be used for calibration and evaluation. Once the time-consecutive requirement is removed, this result can be effectively achieved via a variety of formal data allocation methods that have been developed in the context of ML. In the context of data-driven hydrological modeling of a large sample of catchments, Zheng et al. (2018) found that different data allocation strategies can produce remarkably different evaluation period performance. Further, none of the methods tested (including the semideterministic SS approach [Baxter & Bartlett, 2001], deterministic DUPLEX method [Snee, 1977], stochastic SBSS-N method [May et al., 2010], and stochastic SBSS-P method [May et al., 2010]) was able to provide satisfactory performance as measured by simulation bias and variance. In a follow-up study, Chen et al. (2021, Journal of Hydrology, in review) showed that two new methods (a stochastic approach entitled SOMPLEX and a deterministic approach entitled MDUPLEX) were able to overcome these aforementioned limitations when applying to data-driven models.

For the work reported here, we show results using the deterministic MDUPLEX method for partitioning the system output data into calibration and evaluation subsets. Besides keeping the analysis relatively simple, the main variability we are concerned with is performance over a large number of catchments spanning different hydro-climatic conditions. Use of the stochastic SOMPLEX (or other) approach would provide distributions of performance at each catchment, adding another degree of freedom (related to sampling variability) that is unnecessary for the purposes of this study. In practice, however, when calibrating a model to a specific single location, application of a stochastic method would enable a more complete assessment of prediction uncertainty.

MDUPLEX is derived from the traditional DUPLEX method. DUPLEX proceeds by allocating the two data points that lie farthest (Euclidean distance) from all others in the data set to DCS and the next farthest pair to DES. To illustrate, if we assume a data set containing just eight data points A, B, C, D, E, F, G, and H where A < B < C … G < H, the points (A, H) will be assigned to DCS, and the points (B, G) will be assigned to DES. This strategy ensures that the extreme events (dry or wet hydrologic data points) tend to be equally distributed into the two data subsets. Once this first step has been completed, the single-linkage distance metric between the already-assigned points (A, B, G, and H) and the data points (C, D, E, and F) to be assigned are computed to determine the assignment strategy. Specifically, assuming a much larger number of points in the data set, we continue this assignment process (alternating between DCS and DES; Snee, 1977). So, the single-linkage distance metric is used to quantify the difference between the two data points to be assigned and the data points already in the subset, where the larger distance represents greater difference in hydrology property. This strategy is designed to ensure hydrology diversity within DCS and DES, and hence the final data in each subset will have similar distributional statistics.

In this manner, subsequent data are iteratively sampled pair by pair until one subset is filled, after which the remaining data (which are often “normal” data points) are assigned to the other subset (Snee, 1977). Because DUPLEX sampling is fully deterministic, only one allocation is obtained for any given data, resulting in zero sample variance. However, when the size of DCS is larger than that of DES, the result can be that significant bias is observed during model evaluation (Zheng et al., 2018), because a substantially larger number of normal (less extreme) data points are allocated to the larger (calibration) subset. This biases the calibration toward normal events and causes the model to have relatively poor performance on extreme events, resulting in pessimistic assessment of model performance during evaluation.

To address this issue, the MDUPLEX (modified DUPLEX) method employs a strategy in which all of the data are simultaneously allocated into different basic sampling pools as indicated in Figure 2 (Chen et al., 2021, Journal of Hydrology, in review), followed by the use of the DUPLEX data-partitioning method applied to each sampling pool. This differs from the traditional DUPLEX method in that the sampling process is carried out only once to generate the selection and evaluation subsets, with all of the remaining data (with small distances) being assigned to the calibration subset. For the case that we desire DCS to be larger than DES, the result is better statistical similarity between these two subsets, which improves model transferability under new conditions.

Details are in the caption following the image

The pseudo code for MDUPLEX applied to conceptual rainfall-runoff (CRR) modeling.

2.3 Use of GAN-Based Adversarial Validation to Assess Distributional Similarity

To assess distributional similarity between DCS and DES, we employ an approach related to GANs (Vanderlooy & Hüllermeier, 2008). In the GAN deep learning approach, a “generative” model generates samples that are intended to be statistically indistinguishable to those from some desired target set, while a “discriminative” model evaluates the probability that the generated samples do actually come from that target set (Bengio et al., 2014). As such, the generative and discriminative models are pitted in an adversarial relationship (Goodfellow et al., 2014) wherein both models are trained simultaneously, so that the discriminative model learns to get better at discriminating between samples from different distributions, while the generative model learns to get better at generating samples that are statistically similar to those from the target distribution (the metaphor of generating and detecting counterfeit currency is often used to explain this process). Such a competition drives both models to improve until the generated samples are indistinguishable from those of the target distribution.

Here, we use an adversarial validation approach derived from the GAN technology to assess the extent to which DCS and DES are statistically distinguishable (Ishihara et al., 2021). Specifically, a binary classifier is trained to predict whether a sample from DCS belongs to DES or not, where a level of classification performance that is better than random guessing indicates that the statistical distributions of DCS and DES are distinguishable different. To quantify performance of the trained binary classifier, the area (denoted as AUC) under the receiver operating characteristic (ROC) curve is used (Provost & Fawcett, 2001). AUC varies on [0,1] (see Hand & Till, 2001), where AUC equal to either 0 or 1 indicates statistical distinguishability with 100% probability while AUC = 0.5 indicates inability to distinguish between samples from the two distributions (Bradley, 1997). The python package “XGBoost” used to compute the AUC values is freely available from https://github.com/dmlc/xgboost.

3 Data, Models, and Calibration Methodology

3.1 Rainfall-Runoff Data

A set of 163 Australian catchments, representing a wide range of climatic conditions and physical conditions (areas and slopes), was used in this study. These stations consist of high-quality historical daily runoff records at individual catchment outlets, provided by the Australian Bureau of Meteorology (via http://www.bom.gov.au/water/hrs/). To drive the CRR models, we used catchment-average rainfall and PET, extracted from the Australian Water Availability Project (AWAP) gridded data set together with extracted catchment boundaries (Guo et al., 2020). The data records include daily time step precipitation (P, mm), potential evapotranspiration (PET, mm), and runoff (Q, mm). Figure 3a shows the density distribution of catchment data length in years; the record lengths range from 25 to 70 years. As shown in Figure 3b, the distributions of annual mean P, PET, and Q values for these catchments span broad ranges when measured in terms of equivalent catchment precipitation. The catchment area ranges from 4.5 to 47,651.5 km2, representing a significant size variation as shown in Figure 3c. The average slopes and the forest cover of these catchments range from 5.2% to 23.2% and 69.5%–96.5%, respectively.

Details are in the caption following the image

Density distributions of data properties of the 163 catchments.

3.2 The Conceptual Rainfall-Runoff Models

To demonstrate the ability of the proposed calibration method to improve model transferability, we use three daily time step mass-conservation-based CRR models of varying complexity: GR4J (Perrin et al., 2003), AWBM (W. Boughton, 2004), and CMD (Croke & Jakeman, 2004). All three models treat catchments as being spatially lumped and estimate catchment-outlet runoff from catchment-averaged rainfall (P) and potential evapotranspiration (PET) data, but vary in conceptual representation of internal hydrological processes, resulting in different input-state-output behavioral responses. GR4J has four parameters to be estimated, including soil moisture store capacity, groundwater exchange rate, 1-day runoff production store capacity, and time-base for unit hydrograph. AWBM also has four parameters, but conceptualizes the catchment as having three soil moisture stores of increasing capacity (W. C. Boughton, 1993). CMD has five parameters, and partitions net rainfall in a manner similar to GR4J, but differs by using the moisture deficit in the soil moisture store to determine such partitioning. The models are implemented in the hydromad R package (http://hydromad.catchment.org/; Andrews & Guillaume, 2013). For details see Guo et al. (2020).

3.3 Model Calibration Methodology

For model calibration, we use the Kling–Gupta Efficiency (KGE) as performance metric (H. V. Gupta et al., 2009; Kling et al., 2012) and the Shuffled Complex Evolution global optimization procedure for parameter optimization. KGE can vary [urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0011] with KGE = 1 representing perfect model performance (simulations exactly matching observations). We use a 60:40 ratio between lengths of DCS and DES (relative to the entire data set D), which is typical for CRR model studies (Guo et al., 2018); these results are reported in Sections 4.1-4.3. As a benchmark for comparison, we report results when the entire data set is used for model calibration (referred to as ALL); this strategy would not normally be used for model calibration as it does not facilitate the model evaluation step as a basis for ensuring that model overfitting has not occurred. To investigate the effects of more unbalanced data allocation, we also tested 70:30 and 80:20 ratios but found little sensitivity to varying the ratio; these additional results are reported in Supporting Information S1. Note that the proposed DC sampling method aims to ensure that the selected calibration data set has a similar probability distribution to that of the evaluation data set, conditioned on a given data splitting proportion. What would be interesting and worthwhile to investigate is how many data points are sufficient to ensure model calibration performance is within an acceptable tolerance for model forecasting purposes (Melsen et al., 2014; Seibert & Beven, 2009), which is an important future study focus; we leave this investigation for future work.

To assess the relative benefit of employing the proposed MDUPLEX data allocation strategy, we implement three traditional (time-continuous) data allocation strategies, referred to as Trad1, Trad2, and Trad3; see Figure 1 and Table 1. As shown in Table 1, we compute the KGE performance metric over DCS and DES (reported as KGECS and KGEES, respectively) and also over the entire data D (reported as KGEALL).

Table 1. Different Data Allocation Strategies and the Terminology in This Study
Data allocation strategies The proposed MDUPLEX data allocation strategy Denoted as MDUPLEX
Traditional time-continuous method using the first 60% of data for calibration Denoted as Trad1
Traditional time-continuous method using the data period from 20% to 80% for calibration Denoted as Trad2
Traditional time-continuous method using the last 60% of data for calibration Denoted as Trad3
Using all of the data for calibration Denoted as ALL
Performance metrics KGE value computed over all of the data KGEALL
KGE value computed for the calibration data KGECS
KGE value computed for the evaluation data KGEES
KGEES − KGECS urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0012

4 Results and Discussions

4.1 Comparison of Overall Performance

Figure 4a presents the distributions of AUC values obtained using different data allocation strategies. In contrast to the traditional (time-continuous) strategies, for which AUC is widely dispersed on the range 0.40–0.95, the data allocation achieved using MDUPLEX results in AUCs between 0.50 and 0.58, that are close to the desired 0.5. This shows that MDUPLEX is able to produce a data allocation between DCS and DES that is significantly more statistically similar than achieved using the traditional method.

Details are in the caption following the image

Distributions of AUC and KGEALL values achieved using different data allocation methods.

Figures 4b–4d show the resulting distributions of “overall” model performance, evaluated in terms of KGEALL, which is computed using all of the data (including the data portions used for both calibration and evaluation). Note these results are based on the 60:40 data allocation ratio, where 60% of the data is used to calibrate the model. We see clearly that the MDUPLEX data allocation method (red solid line) closely mirrors the performance achieved when all of the data are used for model calibration (black solid line). Further, its performance is better (shifted to the right and with a higher peak) than achieved using the three traditional (time-continuous) data allocation methods (Trad1, Trad2, and Trad3), for all three models. The fact that MDUPLEX provides similar results to when all of the data are used for calibration is interesting as it implies that the 60% allocation achieved by MDUPLEX results in a statistically representative sample that preserves most (if not all) of the important information content provided by the available data.

4.2 Transferability Performance

The results in Section 4.1 show that the MDUPLEX data allocation method can produce overall larger KGEALL values relative to the traditional sampling approaches. This qualitatively implies that the proposed “DC” strategy has a greater likelihood to achieve acceptable KGE values (Guo et al., 2020). Conditioned on this finding, this section focuses on the transferability of model performance. The urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0013 difference in performance metric values computed on the evaluation and calibration data subsets is an indicator of model transferability. Figure 5 shows the distributions of urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0014 (for all three models) obtained using the different data allocation strategies. The traditional approaches tend to result in wider distribution of primarily negative urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0015, which corresponds to that the urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0016 becomes lower during evaluation than during calibration (performance deterioration). In contrast, the urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0017 distribution for MDUPLEX tends to be narrowly peaked around urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0018 indicating a statistical tendency to similar metric performance on both subsets.

Details are in the caption following the image

Distributions of urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0019 (KGEES − KGECS) for the three CRR models obtained using different data allocation strategies applied to 163 catchments.

Similarly, Figure 6 shows the distributions of differences in model parameters from those obtained when all of the data are used for model calibration. The x-axis shows results of percentage relative parameter errors for the GR4J model, computed as urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0020, where urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0021 is the parameter value obtained using all of the data for calibration and urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0022 is the estimated parameter value using the data allocation methodology. A smaller absolute value of RE indicates that the corresponding data allocation method is able to identify model parameter values that are more similar to those obtained when using all the available data for calibration, implying that the data subset used for model calibration is distributed in a manner that is informationally similar to that of the entire data. As shown in Figure 6, the distribution of RE values tends to be unbiased and more tightly peaked around 0.0 when using the MDUPLEX approach compared to the three traditional data-partitioning strategies, indicating more consistent parameter estimates.

Details are in the caption following the image

Distributions of percentage relative error (RE%) in parameter estimates for the four parameters (x1, x2, x3, and x4) of the GR4J model calibrated using the proposed MDUPLX and three traditional methods. A smaller absolute value of RE indicates that the corresponding data allocation method is able to identify model parameter values that are more similar to those obtained using all of the available data.

4.3 Variation of Performance With Data Skewness

Figure 7 digs deeper into the results to investigate how performance of the different data allocation methods varies with the degree of skewness of the runoff distribution associated with a particular catchment (higher skewness corresponds to lower frequency of high/extreme streamflow events). For these plots, we classify the catchments into four groups representing low [0, 10], low-moderate (10, 20], moderate (20, 30], and moderate-high (30, 60] skewness. Figure 7a shows that the AUC values for MDUPLEX remain consistently close to 0.5 regardless of data skewness, whereas AUC for the traditional methods become progressively worse with increasing skewness (indicating progressive statistical divergence between DES and DCS). Figures 7b–7d show how model performance assessed in terms of KGEALL varies as a consequence of different data skewness for the three models. From these figures, it is clear that the proposed MDUPLEX method is relatively insensitive to the skewness property of the streamflow data, whereas performance of the traditional time-consecutive data allocation methods tends to significantly decline with increasing data skewness.

Details are in the caption following the image

Subplot (a) shows distributions of AUC as a function of catchment runoff skewness, while subplots (b)–(d) show distributions of KGEALL as a function of catchment runoff skewness for all 163 catchments.

Similarly, Figure 8 shows how the distributions of urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0023 (evaluation period minus calibration period KGE) vary with degree of runoff skewness for the three models. Again, urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0024 remains close to zero for MDUPLEX, but tends to get progressively more negative with increasing degree of runoff data skewness for the traditional methods. Figure 9 shows the RE(%) values between the estimated GR4J model parameters and parameters calibrated using all the available data as a function of data skewness. We see that RE tends to be larger for the traditional data-partitioning methods when applied to catchments with larger runoff skewness. In contrast, the proposed method exhibits relatively better (and more robust) ability to identify parameter values that match those obtained via the use of the entire data, although a moderate degree of uncertainty is observed for catchments with high skewness. Similar results are obtained for the other two CRR models.

Details are in the caption following the image

Distributions of urn:x-wiley:00431397:media:wrcr26023:wrcr26023-math-0025 (KGEES − KGECS) for the three CRR models, obtained using different data allocation strategies applied to 163 catchments with different degrees of runoff data skewness.

Details are in the caption following the image

Distributions of parameter RE (%) for the GR4J model obtained using different data allocation strategies applied to 163 catchments with different degrees of runoff data skewness, where x1, x2, x3, and x4 are four parameters.

We further investigate the relationship between model performance and catchment size. Figure 10a shows that the MDUPLEX method (the proposed DC method) exhibits high levels of performance in maintaining the distribution properties between the calibration and evaluation subsets over different catchment sizes, while the traditional approach shows an overall deterioration of performance as catchment size increases. In terms of KGEALL, both MDUPLEX and Trad1 generate slightly lower values when the catchment size becomes larger. As shown in Figure 10c, catchment size does not significantly affect the methods’ performance in estimating parameter values (obtaining similar RE values relative to the parameter values estimated using all available data). Similar observations are made for the other two CRR models and parameter values. Combining results in Figures 7-10, it can be deduced that the high complexity of the underlying relationship between rainfall and runoff is affected not only by catchment size but also by many other factors such as land use type and catchment slopes, etc.

Details are in the caption following the image

Distributions of AUC, KGEALL, and parameter RE (%) for the GR4J model obtained using different data allocation strategies applied to 163 catchments with different sizes, where x2 is one parameter of the GR4J model.

4.4 Performance Comparison Between the Proposed DC Method and Another State-of-the-Art Approach

To further demonstrate the effectiveness of the proposed DC method, we implemented another state-of-the-art model calibration method called the depth function approach (Singh & Bárdossy, 2012) to enable a comparative evaluation of performance. Specifically, the depth function has been proposed as an approach for identifying the critical hydrological events in a time series, that is, any event in a series that has high hydrological variability; the model is then calibrated using those selected events. We used the R package “ddalpha” (https://search.r-project.org/CRAN/refmans/ddalpha/html/depth.halfspace.html) to select critical hydrological events for the 163 catchments considered, using the Halfspace Depth (HSD) function (Dyckerhoff & Mozharovskyi, 2016; Singh & Bárdossy, 2012). The CRR model parameters were identified using these selected events, followed by computation of KGEALL based on all the data. As shown in Figure 11, the HSD approach results in worse performance relative to the proposed MDUPLEX-based DC method (Figure 11a). In addition, the performance of HSD deteriorates at a significantly faster rate compared to that of the MDUPLEX method with increasing catchment skewness (Figure 11b). Similar observations were made for the other two CRR models.

Details are in the caption following the image

Distributions of KGEALL as well as KGEALL versus runoff skewness for the GR4J model obtained using different data allocation strategies applied to 163 catchments, where the terminology can be found in Table 1.

5 Conclusions

Calibration of CBMs of dynamical physical systems is traditionally performed by partitioning the available data into time-consecutive periods that are used separately for model development (including structure selection and parameter estimation) and performance evaluation. The latter step is necessary to verify that the model has not been “force-fit” to the calibration data and can therefore be used with relative confidence to generate simulations/predictions under new conditions. However, poor evaluation performance is typically obtained when traditional methods for allocating the data between calibration and evaluation subsets are used. A major reason is the difficulty of generating statistically consistent partitions of the data, due to being constrained to require that the partitions consist of continuous time-consecutive sequences. This problem becomes even more severe when the data span a wide range of different behavioral conditions, such that the distributions of system data tend to be statistically complex, due to which it becomes difficult to ensure data partitions having a high degree of statistical similarity.

To address this issue, we propose discarding the requirement of temporal continuity when allocating the model (and observed) output data into calibration and evaluation data partitions, while retaining the necessary requirement that the model be run in time-consecutive mode to generate the required simulations. Instead, we propose and test a “DC” strategy for output data partitioning (conditional on the available input data sequence) that helps to ensure that the model calibration process is informed by the fullest possible range of behavioral conditions represented by the available data, while simultaneously resulting in calibration and evaluation data subsets that are mutually statistically consistent.

Our testing of a deterministic implementation of this (DC) strategy against several versions of the traditional (time-consecutive) strategy shows that it results in superior performance in robustness and transferability, particularly under conditions of larger runoff skewness. Adversarial testing showed that the MDUPLEX data allocation strategy results in data partitions that are much more statistically consistent than those achieved using the traditional approach and that it consistently results in similar values for performance metrics computed on each subset. Further, calibrated model performance was found to be relatively insensitive to skewness of the system output data, in contrast to the traditional approaches.

While the proposed method has, to date, only been assessed in the context of hydrological CRR modeling, the nature of this model calibration problem is broadly relevant to conservation-based (input-state-output) models of dynamical physical systems, regardless of scientific discipline. Future work should more broadly investigate applicability of the method to other fields.

Equally important is the need to investigate (for each type of model/application) which system attributes/behaviors need to be statistically represented in a consistent manner when generating the calibration and evaluation data partitions. In this study, we focused primarily on output data skewness as being informative of process differences across different hydro-climatic regimes, but other attributes (such as system modes that can be characterized as precipitation-driven/nondriven, snow-accumulation/snowmelt dominated, evapo-transpiration driven/nondriven, and water limited/energy limited) might also prove to be important when characterizing the system. In this regard, data-driven approaches could prove useful for discovering what these representative behaviors/conditions might be. One limitation of the present study is the demonstration of the proposed DC strategy using Australian catchments only. Future studies should apply the proposed method to other catchment types, such as snow-dominated and/or largescale catchments which might exhibit high variations in information content and hydrological processes (Bilish et al., 2020; Kuentz et al., 2017).

Finally, we have examined only a deterministic version of this approach for data allocation and in doing so have ignored the unavoidable effects that data sampling variability can have on model performance uncertainty. To account for the latter, a corresponding stochastic method (such as SOMPLEX, Chen et al. [2021 (Journal of Hydrology, in review)]) should be considered. Provided that computational budgets permit, practical implementations of single models to specific locations would likely benefit from the added insights that a stochastic approach would provide. As always, we invite discussion and collaboration on these and other issues of model development and testing.

Acknowledgments

This work is funded by the National Natural Science Foundation of China (51922096 and 52179080) and Excellent Youth Natural Science Foundation of Zhejiang Province, China (LR19E080003). The last author (Gupta) acknowledges partial support from the Australian Research Council (ARC) through the Centre of Excellence for Climate Extremes grant CE170100023. We also appreciate the assistance provided by Dr Danlu Guo and Mengtian Lu in analyzing the rainfall-runoff data.

    Data Availability Statement

    All study catchments are available from the Australian Bureau of Meteorology (BoM) Hydrological Reference Stations, which can be downloaded from http://www.bom.gov.au/water/hrs/. The historical rainfall and PET data were from the BoM Australian Water Availability Project (AWAP), at http://www.csiro.au/awap/.