Volume 44, Issue 12
Regular Article
Free Access

Toward improving the reliability of hydrologic prediction: Model structure uncertainty and its quantification using ensemble-based genetic programming framework

Kamban Parasuraman

Kamban Parasuraman

Centre for Advanced Numerical Simulation, Department of Civil and Geological Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada

Search for more papers by this author
Amin Elshorbagy

Amin Elshorbagy

Centre for Advanced Numerical Simulation, Department of Civil and Geological Engineering, University of Saskatchewan, Saskatoon, Saskatchewan, Canada

Search for more papers by this author
First published: 05 December 2008
Citations: 32

Abstract

[1] Uncertainty analysis is starting to be widely acknowledged as an integral part of hydrological modeling. The conventional treatment of uncertainty analysis in hydrologic modeling is to assume a deterministic model structure, and treat its associated parameters as imperfectly known, thereby neglecting the uncertainty associated with the model structure. In this paper, a modeling framework that can explicitly account for the effect of model structure uncertainty has been proposed. The modeling framework is based on initially generating different realizations of the original data set using a non-parametric bootstrap method, and then exploiting the ability of the self-organizing algorithms, namely genetic programming, to evolve their own model structure for each of the resampled data sets. The resulting ensemble of models is then used to quantify the uncertainty associated with the model structure. The performance of the proposed modeling framework is analyzed with regards to its ability in characterizing the evapotranspiration process at the Southwest Sand Storage facility, located near Ft. McMurray, Alberta. Eddy-covariance-measured actual evapotranspiration is modeled as a function of net radiation, air temperature, ground temperature, relative humidity, and wind speed. Investigating the relation between model complexity, prediction accuracy, and uncertainty, two sets of experiments were carried out by varying the level of mathematical operators that can be used to define the predictand-predictor relationship. While the first set uses just the additive operators, the second set uses both the additive and the multiplicative operators to define the predictand-predictor relationship. The results suggest that increasing the model complexity may lead to better prediction accuracy but at an expense of increasing uncertainty. Compared to the model parameter uncertainty, the relative contribution of model structure uncertainty to the predictive uncertainty of a model is shown to be more important. Furthermore, the study advocates that the search to find the optimal model could be replaced by the quest to unearth possible models for characterizing hydrological processes.

1. Introduction

[2] In addition to the inherent stochastic nature of the hydrological processes, ambiguity due to data, model parameters, and model structure underscores the need for uncertainty estimates being part of the hydrological model building exercise. Hence, over the last few decades, hydrological modeling has witnessed a paradigm shift in the way models were built to characterize hydrological processes. Research in hydrological modeling is being directed toward developing models, which are not only accurate but also reliable, by including uncertainty analysis as a part of the model building exercise. Nevertheless, most traditional approaches to hydrological model uncertainty have dealt with the hypothesis of a deterministic model structure with parameters treated as imperfectly known [Beven and Binley, 1992; Kuczera and Paren, 1998; Vrugt et al., 2003; Pappenberger et al., 2005; Wilby, 2005]. The improbability estimated by these traditional approaches may uncover only a minor share of the actual uncertainty, since they neglect the uncertainty associated with the model structure by assuming it as deterministic.

[3] In order to overcome the limitations inherent in the conventional treatment of uncertainty in hydrological modeling, studies by Shamseldin et al. [1997], Georgakakos et al. [2004], Beven [2006], Duan et al. [2007], Ajami et al. [2007], Vrugt and Robinson [2007], and Parasuraman et al. [2007a], advocate the use of multi-model methods for hydrologic predictions. In multi-model methods, instead of relying on the predictions from a particular deterministic model, the method relies on predictions from several competing models, which are different both in terms of their structure and their related parameters. The initiative of adopting multi-model methods in hydrologic prediction was first instigated by Shamseldin et al. [1997]. Although the scheme of multi-model methods is relatively new to hydrologic modeling, the concept has been in practice in economic and weather forecasting since the late 1960s [e.g., Bates and Granger, 1969; Dickinson, 1973; Newbold and Granger, 1974; Dickinson, 1975; Thompson, 1976]. Shamseldin et al. [1997] used five different rainfall-runoff models on eleven different catchments, and concluded that better discharge estimates can be obtained by combining the model outputs of different models. Georgakakos et al. [2004] used eleven different models for flow simulation on six different catchments located in the south-central United States, and concluded that multi-model ensemble are promising in estimating streamflows. Duan et al. [2007] used ensemble predictions from three different hydrologic models using three distinct objective functions to illustrate the advantages of adopting Bayesian model averaging (BMA) in generating more accurate streamflow predictions. Ajami et al. [2007] proposed a framework, Integrated Bayesian Uncertainty Estimator (IBUNE), to account for the major sources of uncertainty in hydrologic rainfall-runoff predictions. Within the IBUNE framework, Ajami et al. [2007] considered multi-model ensemble predictions from three different conceptual models to account for the model structure uncertainty. The study by Vrugt and Robinson [2007] used predictions from eight competing models to evaluate the performance of ensemble Kalman filter and BMA in improving the predictive skills of streamflow models.

[4] Regardless of the significant progress made in addressing the model structure uncertainty in hydrological modeling, most of the above mentioned studies do not consider a comprehensive number of plausible models to account for the uncertainty manifesting from the model structure. Usually, a limited number of conceptual mathematical models are considered as plausible models for characterizing a particular hydrological process, and the model structure uncertainty is calculated relying on the outputs from these few competing models. The uncertainty calculated by this approach may be biased and under estimated, as the competing models considered in these studies are subjective and may not cover a significant portion of the plausible model space. In addition, the highly nonlinear nature of the hydrological processes compounds the issue of manual identification of the set of plausible models required for addressing the uncertainty emanating from the model structure deficiency.

[5] In general, considering a limited number of conceptual models, and thereby restricting model sampling to be within a small portion of the plausible model space, may lead to biased and unrealistic model structure uncertainty evaluation. Hence developing a framework that relies on predictions from a large number of competing models, identified from an exhaustive search of the plausible model space, would make quantification of model structure uncertainty more realistic. In its pursuit, an ensemble-based genetic programming (GP) framework is proposed in this study. The ensemble-based GP framework exploits the ability of GP in exploring the plausible model space in a comprehensive manner, and thereby resulting in a more realistic quantification of model structure uncertainty. Genetic programming, proposed by Koza [1992] is a promising inductive data-driven technique capable of constructing populations of models using stochastic search methods, namely evolutionary algorithms. An important characteristic of GP is that, both the variables and constants of the candidate models are optimized. Hence, compared to other regression techniques, it is not required to choose the model structure a priori. This attribute makes GP a strong candidate to characterize the model structure uncertainty. In water related studies, GP has been applied to model different geophysical processes including, but not limited to, flow over a flexible bed [Babovic and Abbott, 1997]; rainfall-runoff process [Whigham and Crapper, 2001; Savic et al., 1999]; runoff forecasting [Khu et al., 2001]; urban fractured-rock aquifer dynamics [Hong and Rosen, 2002]; temperature downscaling [Coulibaly, 2004]; rainfall-recharge process [Hong et al., 2005]; soil moisture [Makkeasorn et al., 2006]; evapotranspiration [Parasuraman et al., 2007b]; and saturated hydraulic conductivity [Parasuraman et al., 2007a].

[6] The main objective of this paper is to build an ensemble-based genetic programming framework to address the issue of model structure uncertainty in hydrological modeling. The performance of the proposed framework is demonstrated with regards to its ability in modeling the Eddy-covariance (EC)-measured actual evapotranspiration flux from the Southwest Sand Storage (SWSS) facility, a reconstructed watershed located near Ft. McMurray, Northern Alberta, Canada. The specific objectives of the paper includes: (1) demonstrating the ability of the ensemble-based genetic programming framework in quantifying the model structure uncertainty, (2) investigating the effect of model complexity with regard to its predictive accuracy and uncertainty, in characterizing the EC-measured evapotranspiration flux, and (3) identifying the relative contribution of the model parameter and the model structure uncertainty to the predictive uncertainty of the model.

2. Genetic Programming

[7] Genetic Programming (GP), introduced by Koza [1992], is an evolutionary algorithm based on the concepts of natural selection and genetics. The concept of GP was derived from more elementary evolutionary computation methods like Evolutionary Programming [Fogel et al., 1966], Genetic Algorithms [Holland, 1975], and Evolution Strategies [Schwefel, 1981]. In simple terms, GP can be considered as an extension of genetic algorithms, where the objective is to evolve an optimal set of model parameters, within a pre-defined parameter space and model structure. GP extends this search to include the model space, so that both the model structure and the associated model parameters can be optimized in unison. Genetic symbolic regression (GSR) is a special application of GP in the area of symbolic regression, where the objective is to find a mathematical expression in symbolic form, that provides an optimal fit between a finite sample of values of the independent variable and its associated values of the dependent variable [Koza, 1992]. GSR can be considered as an extension of numerical regression problems, where the objective is to find the set of numerical coefficients that best fits a predefined model structure (linear, quadratic, or polynomial). Nevertheless, GSR does not require the functional form to be defined a priori, as GSR involves finding the optimal mathematical expression in symbolic form (both the discovery of the correct functional form and the appropriate numerical coefficients) that defines the predictand-predictor relationship. More information on GP is given by Koza [1992] and Babovic and Keijzer [2000].

[8] Figure 1 shows the flowchart of the GSR paradigm. For a given problem, the first step is to define the functional and terminal sets, along with the objective function and the genetic operators. The functional set and the terminal set are the main building blocks of the GSR, and hence, their appropriate identification is central in developing a robust GSR model. The functional set consists of basic mathematical operators {+, −, *, /, sin, exp, …} that may be used to form the model. The choice of the operators considered in the functional set depends upon the degree of complexity of the problem to be modeled. The terminal set consists of independent variables and constants. The constants can either be physical constants (e.g., Earth's gravitational acceleration, specific gravity of fluid) or randomly generated constants. Different combinations of functional and terminal sets are used to construct a population of mathematical models. Each model (individual) in the population can be considered as a potential solution to the problem. The mathematical models are usually coded in a parse tree form. For example, Figure 2 shows the parse tree notation of a mathematical model f(x, y, z) = (x*y)/(2 + z). In Figure 2, the connection points are called nodes, and it can be seen that the inner nodes of the parse tree are made up of functions, and the terminal nodes are made up of variables and constants. Therefore, in GP terminology, the variables and constants are referred to as terminals and the functions are referred to as non-terminals. The depth of the parse tree shown in Figure 2 is three. An objective or fitness or cost function is used to evaluate the value or fitness of each individual in the population. Usually mean squared error or root mean squared error is used as the objective function [e.g., Savic et al., 1999; Coulibaly, 2004; Hong et al., 2005]. Genetic operators include crossover, and mutation, and they are discussed in detail later in this section.

Details are in the caption following the image
Flowchart of the GSR paradigm.
Details are in the caption following the image
Parse tree notation. The connection points are called as nodes. The inner nodes are made up of functions and the terminal nodes are made up of variables and constants.

[9] Once the functional and terminal sets are defined, the next step is to generate the initial population for a given population size. The initial population can be generated in a multitude of ways, including, the full method, grow method, and ramped half-and-half method. In the full method, the new trees are generated by assigning non-terminal nodes until a pre-specified initial maximum tree depth is reached, and the last depth level is limited to the terminal nodes. The full method usually results in perfectly balanced trees with branches of same length. In the grow method, each new node is randomly chosen between the terminals and the non-terminals, with the terminals making up the nodes at the initial maximum tree depth. As a consequence, the grow method usually results in highly unbalanced trees. The ramped half-and-half method is a combination of the full and the grow methods. For each depth level considered, half of the individuals are initialized using the full method and the other half using the grow method. The ramped half-and-half method is shown to produce highly diverse trees, both in terms of size and shape [Koza, 1992], and thereby provides a good coverage of the search space. More information on the different methods of generating the initial population is given by Koza [1992]. Once initialized, the fitness of each individual (mathematical model) in the population is evaluated based on the selected objective function. The better the fitness of an individual, the greater is the chance of the individual breeding into the next generation. In this study, root mean squared error is used as the objective function, and a lower value of RMSE indicates better fitness. At each generation, new sets of models are evolved by applying the genetic operators: crossover and mutation [Koza, 1992; Babovic and Keijzer, 2000]. These new models are termed offspring, and they form the basis for the next generation.

[10] After the fitness of the individual models in the population is evaluated, the next step is to carry out selection. The objective of the selection process is to create a temporary population called the mating pool, which can be acted upon by genetic operators: crossover and mutation. Selection can be carried out by several methods like truncation selection, tournament selection, and roulette wheel selection [Koza, 1992]. As roulette wheel selection is one of the most commonly used methods including Koza [1992], it has been adopted in this study. Roulette wheel is constructed by proportioning the space in a roulette wheel based on the fitness of each model in the population. The selection process ensures that the models with better fitness have more chance of being breed into the next generation.

[11] Crossover is carried out by initially choosing two parent models from the mating pool, and selecting random crossover points for each of the parents. On the basis of the selected crossover points, the corresponding sub-tree structures are swapped between the parents to produce two different offspring with different characteristics. The number of models undergoing crossover depends upon the chosen probability of crossover, Pc. Mutation involves random alteration of the parse tree at the branch or node level. This alteration is done on the basis of the probability of mutation, Pm. For an overview of different types of computational mutations, readers are referred to Babovic and Keijzer [2000]. While the role of the crossover operator is to generate new models, which did not exist in the old population, the mutation operator guards the search against premature convergence by constantly introducing new genetic material into the population. Figure 3 demonstrates the crossover and mutation operators. The crossover point between Parent 1 and Parent 2 is shown by the dashed line, and the corresponding sub-tree structures are swapped, resulting in Offspring 1 and Offspring 2. In Offspring 1, the terminal node has undergone mutation (2 replaced by 5). The genetic operators, crossover and mutation, are shown to produce new models (Offspring), which are structurally different from their parent models (Figure 3). These operators ensure that the model space is sampled thoroughly. After the initial population has been acted upon by the genetic operators, the resultant individuals form the new population for the next generation. This iterative process is carried out for a predetermined number of iterations or until a specified value of cost function is reached.

Details are in the caption following the image
Crossover coupled with mutation. The dashed lines represent the crossover points and the shaded region represents the mutated node.

[12] In a nutshell, the basic steps involved in GP can be summarized as follows:

[13] 1. Identify the functional and terminal sets, along with the fitness measure.

[14] 2. Generate initial population randomly from functional and terminal sets.

[15] 3. On the basis of the fitness measure, evaluate the fitness of each individual.

[16] 4. Apply the selection operator. The better the fitness of an individual, the higher is the chance of that individual being selected and breed into the next generation (survival of the fittest). This temporary population is termed as mating pool.

[17] 5. On the basis of the probability of crossover (Pc), pairs of individuals from the mating pool are chosen and crossover function is performed.

[18] 6. The next step is to apply the mutation operator based on the probability of mutation (Pm). Mutation helps in ensuring that no point in the individual search space remains unexplored.

[19] 7. Copy the resultant individuals to a new population.Repeat steps 3 to 7 for a predetermined number of iterations or until a specified value of cost function is reached.

3. Site Description and Data Set Statistics

[20] The EC-measured actual evapotranspiration data from the SWSS facility, located near Ft. McMurray, Alberta, Canada, is considered in this study. The SWSS is currently the largest operational tailings dam in the world, holding approximately 435 million cubic meters of material, covering 25 km2, and standing approximately 40 m high with a 20H:1V side-slope ratio. Side-slopes are constructed as 100 m wide berms connected by 10% slopes to form an overall slope of 5%. Soils consist of mine tailings sand overlain with 0.4 to 0.8 m of topsoil that is a mixture of peat and secondary mineral soil with a clay loam texture. Both vegetation species and composition vary across the SWSS, with dominant groundcover including horsetail (Equisetum arvense), fireweed (Epilobium angustifolia), sow thistle (Sonchus arvense), and white and yellow sweet clover (Melilotus alba, Melilotus officinalis). Tree and shrub species include Siberian larch (Larix siberica), hybrid poplar (Populus sp. hybrid), trembling aspen (Populus tremuloides), white spruce (Picea glauca), and willow (Salix sp.). For the SWSS facility, the ground-water table is located well below the rooting zone, at a depth between 0.8 and 1.0 m, and hence do not directly contribute to the evapotranspiration process. Accurate estimation of actual evapotranspiration from the reconstructed watersheds is of vital importance as it plays a major role in the water-balance of the system, which links directly to ecosystem restoration strategies. The weather station located atop the SWSS facility measured the air temperature (AT) (°C), ground temperature (GT) (°C), net radiation (NR) (Wm−2), relative humidity (RH), and wind speed (WS) (ms−1). Turbulent fluxes of heat and water vapor were measured using a CSAT3 sonic anemometer and thermometer (Campbell Scientific) and an LI-7500 CO2/H2O gas analyzer (Li-Cor). Ground heat flux was measured using a CM3 radiation and energy balance (REBS) ground heat flux plate placed at 0.05 m depth. In EC technique, the covariance of vertical wind speed with temperature and water vapor is used to estimate the sensible heat (H) and latent heat (LE) fluxes. More information on the EC technique is given by Drexler et al. [2004]. Raw turbulence measurements were made at 10 Hz and fluxes were calculated using 30-minute block averages with a 2-D coordinate rotation.

[21] The hourly EC-measured LE flux (the product of the latent heat of vaporization and evapotranspiration) at the SWSS facility from May 15, 2005 to Sept. 10, 2005 is considered in this study. The total precipitation during this period is 275 mm, and the average day-time reference evaporation rate is 0.27 mm/hr. Nevertheless for modeling purpose, the day time (08:00 hrs.–20:00 hrs.) evapotranspiration alone is considered. Of the available data set, the first two-thirds, and the remaining thirds were delineated for the training and the testing sets respectively. Disregarding the missing values, which are distributed randomly over the modeling period, the number of instances considered for training and testing purposes are 787 and 408, respectively. Since evapotranspiration is commonly perceived as being highly dependent on climatic variables, the EC-measured LE flux is modeled as a function of NR, AT, GT, RH, and WS. Although it is reasonable to assume that there exists within and between days autocorrelation in the data, the effect of autocorrelation is not considered in this analysis as the available data set is not continuous over the modeling period. The descriptive statistics, along with the correlation matrix of the data sets used for training and testing are presented in Tables 1 and 2, respectively. The coefficient of variation (CV) of different variables during training and testing are comparable (Tables 1 and 2). Compared to other independent variables, NR has the highest correlation with the LE flux during both training (Table 1) and testing (Table 2).

Table 1. Descriptive Statistics and Correlation Matrix of the Training Data Set
NR (W/m2) AT (°C) GT (°C) RH WS (m/s) LE (W/m2)
Minimum −80.31 2.80 7.25 0.15 0.45 3.86
Maximum 726.31 29.19 22.73 0.96 7.65 464.80
Mean 275.88 18.73 16.46 0.50 3.22 158.10
SD 166.40 5.20 3.28 0.17 1.47 77.87
CV 0.60 0.28 0.20 0.34 0.46 0.49
Correlations
NR (W/m2) 1.00
AT (°C) 0.30 1.00
GT (°C) 0.21 0.79 1.00
RH −0.34 −0.73 −0.47 1.00
WS (m/s) −0.02 0.00 0.02 −0.11 1.00
LE(W/m2) 0.70 0.55 0.58 −0.45 0.08 1.00
Table 2. Descriptive Statistics and Correlation Matrix of the Testing Data Set
NR (W/m2) AT (°C) GT (°C) RH WS (m/s) LE (W/m2)
Minimum −59.23 7.39 9.93 0.27 0.45 3.00
Maximum 640.12 28.13 20.02 0.96 8.37 390.75
Mean 193.01 16.89 15.65 0.58 3.12 130.32
SD 140.91 4.61 2.27 0.16 1.48 71.61
CV 0.73 0.27 0.14 0.27 0.47 0.55
Correlations
NR (W/m2) 1.00
AT (°C) 0.32 1.00
GT (°C) 0.22 0.77 1.00
RH −0.34 −0.81 −0.76 1.00
WS (m/s) 0.18 0.05 0.20 −0.20 1.00
LE (W/m2) 0.73 0.51 0.59 −0.62 0.38 1.00

4. Ensemble-Based Genetic Programming Framework

[22] The ensemble-based genetic programming framework is constructed by synergistically combining the self-organizing ability of GP with the non-parametric bootstrap method [Efron and Tibshirani, 1993]. The central idea of this framework is to initially generate possible realizations of the actual data set using the non-parametric bootstrap method, and then exploiting the ability of GP to explore the plausible model space in a comprehensive manner that may be characteristic to the different realizations of the actual data set. By this way, an ensemble of competing models can be identified that may be representative of the actual data set. Because of the self-organizing ability of GP, the competing models identified within this framework could be distinctly different both in terms of their model structure and their associated parameters. Hence using the predictions of these competing models can lead to a comprehensive, unbiased, and realistic quantification of model structure uncertainty.

[23] The bootstrap method adopted in the ensemble-based genetic programming framework assumes that the training data set is a good representation of the original population, and that this data set is only one particular realization of that population. Hence training GP on different realizations of the population would result in GP arriving at different characteristic model structures, which are representative to the resampled data sets. Since the resulting model structures are different, the predictions of LE from these competing models are also different. In order to account for such uncertainty in prediction, this study adopted a bootstrap size, B, as 60, and hence 60 independent data sets (TB), of size N, were generated by repeated random resampling with replacement of the training data set (T), of size N. Hence each bootstrap data set TB may have many instances of T repeated several times, while other instances may be left out. It should be noted that the different realizations produced by the bootstrap method are different not in terms of relative values, but only in the order and occurrence of the data instances. Since TB contains a different realization of T, the resulting GP models for each TB would be different both in terms of their structure and their associated parameters. The performance of the ensemble-based genetic programming framework, in terms of both its prediction accuracy and uncertainty, is then evaluated on the basis of the predictions from the ensemble of 60 competing models.

[24] The prediction accuracy and the related uncertainty of the ensemble-based genetic programming framework is calculated in the following manner: (1) since a bootstrap size of 60 is used in this study, for each of the resampled data set, the ensemble-based genetic programming framework results in 60 different competing models; (2) the performance of each of the 60 different competing models is then evaluated on the actual training and testing data sets, in terms of the root mean squared error (RMSE), the mean absolute relative error (MARE), the mean residual (MR), and the correlation coefficient (R); and (3) from the 60 different sets of performance indicators (RMSE, MARE, MR, and R), the overall performance of the ensemble-based genetic programming framework is evaluated by calculating the mean and standard deviations of each of the performance indicators, over the training and the testing ranges. While the mean of the performance measures indicates the prediction accuracy of the ensemble-based genetic programming framework, the standard deviation indicates the uncertainty associated with the corresponding performance indicator. This study adopted a multi-criterion approach (RMSE, MARE, MR, and R) in evaluating the overall performances of the adopted models, as each of the performance indicators provide different information about the predictive ability of the model. The performance indicators RMSE, MARE, MR, and R, are calculated using equations (1), (2), (3), and (4), respectively
equation image
equation image
equation image
equation image
where, yi and yi′ represents the measured and computed LE values, equation imagei and equation imagei represents the mean of the measured and computed LE values, respectively, and n represents the number of data instances. The RMSE statistic gives more weight to the high values in the process of squaring the difference between the observed and the predicted values, and hence more representative of the model's ability to predict away from the mean. The MARE provides an unbiased error estimate because it gives appropriate weight to all magnitudes of the predicted variable. The MR statistic illustrates the bias associated with the model, and the correlation statistic, R evaluates the linear correlation between the measured and the computed values.

[25] Adopting an ensemble technique not only assists in evaluating the uncertainty associated with the proposed modeling framework, but also helps in addressing one of the pertinent issues in GP, namely generalization. Iba [1999] applied the ensemble method of bagging and boosting within the framework of GP and obtained encouraging results in predicting financial data. Keijzer and Babovic [2000] and Folino et al. [2006] demonstrated that ensemble methods like bagging and boosting can reduce the generalization error in GP. Hence the proposed methodology of combining self-organizing algorithm (GP) with statistical resampling techniques is expected to reduce, if not fully overcome, the generalization error. The GP system used in this study is an adaptation of GPLAB [Silva, 2005], a GP toolbox for MATLAB. Since the values of the GP system parameters (e.g., crossover rate, mutation rate, population size) are problem dependent, the usual practice is to determine them using the trial-and-error process with the objective of minimizing the cost function during the training process. This study adopted a similar approach in arriving at the GP parameters, and the resulting parameter values are shown in Table 3. One of the main issues that need to be addressed in developing a GP system is that of “bloating”. Bloating refers to the exponential growth of redundant and functionally useless branches in the parse trees. This is caused by the genetic operators (crossover and mutation) in their quest to arrive at better solutions. Several bloat control techniques have been proposed in the literature, and reviews of these methods are given by Soule and Foster [1999], Poli [2003], and Silva and Costa [2004]. This study adopted the Heavy Dynamic Limit method proposed by Silva and Costa [2004], which is based on attaching a dynamic limit on the depth of the trees allowed in the population, initially set with a low value, and only raised and lowered when needed to accommodate an individual with better performance that would otherwise break the limit. More information on the heavy dynamic limit method is given by Silva and Costa [2004].

Table 3. GP Parameters
GP Parameter Value
Population size 20
Initialization method Ramped half-and-half
Sampling method Roulette
Maximum initial tree depth 8
Probability of crossover, Pc 0.6
Probability of mutation, Pm 0.3
Cost function RMSE
No. of generations 500

5. Models Adopted

[26] Investigating the relation between model complexity, prediction accuracy, and uncertainty, two sets of experiments were carried out by varying the level of mathematical operators (complexity) that can be used to define the predictand-predictor relationship. While the first set (GP(1)) uses just the additive operators, the second set (GP(2)) uses both the additive and the multiplicative operators to define the predictand-predictor relationship. In other words, while the functional set of GP(1) and GP(2) models are {+, −} and {+, −, /, *}, respectively, the terminal set of both models remains the same. Along with randomly generated constants, {NR, AT, GT, RH, WS} constitute the terminal set of both the GP(1) and GP(2) models. In general, with regards to the complexity of mathematical operators used to define the predictand-predictor relationship, GP(2) can be considered to be relatively more complex than the GP(1) model, and hence comparing their relative performance can highlight the relationship between model complexity, prediction accuracy and uncertainty.

[27] It should be noted that for both the GP(1) and GP(2) models, for each TB, both the model structure and the model parameters are evolved simultaneously by the self-organizing nature of the GP algorithm. Hence the predictive uncertainty calculated by the GP(1) and the GP(2) models involves the contribution of both the model structure and the model parameter uncertainty. In order to identify the relative individual contribution of the model structure uncertainty and the model parameter uncertainty to the predictive uncertainty, two variants (GP(1)′ and GP(2)′) of the models GP(1) and GP(2) were developed. In GP(1)′ and GP(2)′, instead of simultaneously evolving both the model structure and the model parameters using the self-organizing nature of GP, the model structure was pre-defined and only the corresponding model parameters were optimized. This way, the uncertainty due to model parameters can be determined. The model structure based on the actual training data set (without resampling), that would have been used otherwise in the traditional parametric uncertainty estimation methods to arrive at the representative model structure, is used for this analysis. The model structure based on equation (5), which represents the optimal GP-evolved model based on the same functional and terminal sets as adopted by the GP(1) model, but trained on the original training data set (without resampling), is used as the representative model structure for GP(1)′. Similarly for the GP(2)′ model, the representative model structure (equation (6)) is derived on the basis of the actual training data set (without resampling), using the same function and terminal sets as adopted by the GP(2) model. Equations (5) and (6) are represented as follows:
equation image
equation image

[28] Plots showing the comparison of the measured and the estimated LE by equations (5) and (6) are presented in Figure 4. For both the GP(1)′ and GP(2)′, their representative model structures were kept static, with their corresponding model parameters alone optimized to fit each of the 60 resampled data sets. Equations (5) and (6) have two and five parameters, respectively, which need to be optimized. The uncertainty estimate and the performance statistics of the GP(1)′ and the GP(2)′ can be calculated as outlined earlier. Comparing the performance of the model GP(1) with its variant GP(1)′, would indicate the influence of model structure uncertainty within a simple model. Similarly, comparing the performance of the model GP(2) with its variant GP(2)′, would indicate the influence of model structure uncertainty within a complex model. Prior to estimating LE using the models GP(1), GP(2), and their variants, GP(1)′, and GP(2)′, both the independent (NR, AT, GT, RH, WS) and the dependent variables (LE) are normalized by dividing each variable by its corresponding maximum value. This is done in order to overcome the problem of dimensional inconsistency, and to achieve better generalization. These standardized values, herein are simply referred to as NR, AT, GT, RH, WS, and LE. For a rational comparison of the different models adopted in this study, initially the resampled data sets (TB) are generated and predetermined so that the models can be trained on the same resampled data sets.

Details are in the caption following the image
Comparison of measured and estimated LE by (a) equation (5) and (b) equation (6), for the testing data set.

6. Results and Analysis

[29] The performances of the different models in terms of RMSE, MARE, MR, and R, during both training and testing, are presented in Table 4, and the values within the parenthesis indicate the uncertainty associated with the corresponding performance measure. In general, it is of interest to note that the RMSE for the testing data set is relatively better than that of the training data set (Table 4). As compared to the testing data set, the training data set is dominated with higher values (Tables 1 and 2) of LE flux; and since the RMSE statistic gives more weight to the high values, better performance of the models in terms of the RMSE statistics during testing is expected. During training, with the exception of MR statistics, the GP(2) model outperformed the GP(1) model in terms of the other performance statistics. Nevertheless, during testing, the GP(2) model consistently outperformed the GP(1) model in terms of all performance statistics (Table 4). Investigating the uncertainty associated with the GP(1) and GP(2) models, during training, with the exception of the uncertainty associated with the MARE and R statistics, the GP(2) model resulted in higher uncertainty estimates. Similarly during testing, compared to the GP(1) model, the GP(2) model resulted in higher uncertainty estimates for the RMSE, MARE, and MR statistics (Table 4). Bearing in mind that the GP(2) model is more complex than the GP(1) model, the above analysis indicates that increasing the complexity of the model to characterize the actual evapotranspiration process, may lead to better prediction accuracy but at the expense of higher uncertainty. This finding underscores the need for identifying an acceptable trade-off between the prediction accuracy and uncertainty, before choosing the level of details (complexity) that needs to be accommodated within the modeling framework. Compared to the uncertainty associated with the other performance indicators, the uncertainty associated with MR statistics is relatively high for both the GP(1) and the GP(2) models (Table 4), indicating that the model bias is an important factor that needs to be considered in evaluating the reliability of the models.

Table 4. Mean Performance Measures and Uncertainty (Between Brackets) of 60 Models per Class, During Training and Testing
Model Training Testing
RMSE MARE MR R RMSE MARE MR R
GP(1) 67.75 (2.49) 0.56 (0.05) 0.60 (5.27) 0.59 (0.04) 59.54 (2.01) 0.81 (0.06) −8.98 (9.59) 0.60 (0.03)
GP(2) 46.77 (4.46) 0.33 (0.04) 2.15 (7.61) 0.82 (0.02) 44.19 (5.44) 0.46 (0.08) 8.76 (11.05) 0.82 (0.03)
GP(1)′ 63.49 (0.09) 0.56 (0.01) −0.07 (2.51) 0.58 (0.00) 61.28 (0.61) 0.93 (0.02) −16.85 (2.33) 0.59 (0.00)
GP(2)′ 42.37 (0.31) 0.28 (0.01) 1.79 (1.87) 0.84 (0.00) 39.21 (0.51) 0.32 (0.01) 12.63 (1.27) 0.86 (0.00)

[30] Since GP(1) and GP(2) represents the models where both the model structure and the associated parameters were treated as unknowns, and their corresponding variants, GP(1)′ and GP(2)′ represents the models that assumes a deterministic model structure with its parameters treated as unknowns, comparing their performance would help in identifying the relative contribution of model structure uncertainty and model parameter uncertainty to the predictive uncertainty of the model. During training, compared to the GP(1) model, its variant GP(1)′ resulted in better RMSE and MR statistics; and performed on par in terms of the MARE and R statistics. Nevertheless during testing, the performance of GP(1)′ is relatively worse than the GP(1) model in terms of all performance measures (Table 4). This indicates that the GP(1) model has better generalization ability than its variant GP(1)′. Examining the uncertainty associated with GP(1) and GP(1)′, the uncertainty of the GP(1)′ is significantly less than the uncertainty of the GP(1) model, during both training and testing (Table 4). This indicates that, compared to the model parameter uncertainty, the relative contribution of the model structure uncertainty to the predictive uncertainty of the model is more significant. Similarly, comparing the performance of the GP(2) model with its variant GP(2)′, with the exception of the MR statistics, GP(2)′ performed better than the GP(2) model, during both training and testing (Table 4). Investigating the uncertainty associated with the models GP(2) and GP(2)′, GP(2)′ resulted in significantly less uncertainty than the GP(2) model, highlighting the significant contribution made by the model structure uncertainty to the predictive uncertainty of the model (Table 4). The analysis can be extended further to assess the relative contribution of the model structure uncertainty and the model parameter uncertainty to the predictive uncertainty of the model, in relation to the complexity of the model. Compared to the simpler model (GP(1) and GP(1)′), the reduction in uncertainty that can be achieved by assuming a deterministic model structure with unknown parameters, as against the model which assumes both the model structure and the parameters as imperfectly known, is more pronounced in the case of the more complex model (GP(2) and GP(2)′) (Table 4). This proves that an increase in model complexity is accompanied by a relative increase in the contribution of the model structure uncertainty to the predictive uncertainty of the model.

[31] With the idea of enunciating the performance of the possible models identified within the ensemble-based genetic programming framework, scatter plots of the possible models in terms of their different performance measures on the testing data sets were plotted. Figure 5 shows the scatter plots of different performance measures for the relatively less complex model. In Figure 5, the top row represents the scatter plots for the GP(1) model, and the bottom row represents the scatter plots for its variant, GP(1)′. The scattering associated with the model, which accommodates the model structure uncertainty (GP(1)) is more pronounced compared to the scatter associated with the model that assumes the model structure to be deterministic (GP(1)′). Similarly, Figure 6 shows the scatter plots of different performance measures for the relatively more complex model. In Figure 6, the top row represents the scatter plots for the GP(2) model, and the bottom row represents the scatter plots for its variant GP(2)′. Similar to the previous comparison, the scatter associated with the model, which accommodates the model structure uncertainty is more pronounced (GP(2)) compared to the scatter associated with the model which assumes the model structure to be deterministic (GP(2)′).

Details are in the caption following the image
Scatter plots of different performance indices for the relatively less complex model on the testing data set. Each dot represents the performance of one plausible model. The top row indicates the model structure uncertainty (GP(1)) and the bottom row indicates the model parameter uncertainty (GP(1)′).
Details are in the caption following the image
Scatter plots of different performance indices for the relatively more complex model on the testing data set. Each dot represents the performance of one plausible model. The top row indicates the model structure uncertainty (GP(2)) and the bottom row indicates the model parameter uncertainty (GP(2)′).

[32] To highlight the relationship between model structure uncertainty and model parameter uncertainty as a function of model complexity, the uncertainty bounds associated with the RMSE statistics (RMSE ± Uncertainty), for all the models were plotted in Figure 7. Although the uncertainty bounds can be calculated with regards to each of the performance measures reported in Table 4, the uncertainty bounds based on the RMSE statistics alone are presented here, as it is the designated cost function within the ensemble-based genetic programming framework. From Figure 7 it is clearly evident that the uncertainty bounds of the GP(1) and GP(2) models are distinctly different. Figure 7 also reveals that increasing the complexity of the model could lead to better prediction accuracy but at the cost of higher uncertainty. As evident from Figure 7, for the less complex model, the uncertainty bounds of the GP(1) model, and its variant, GP(1)′, partly overlap each other. This indicates that for the less complex model, accounting for the model structure uncertainty could account for most, if not all, of the uncertainty associated with the model parameters. Nevertheless, for the more complex models, the uncertainty bounds of the GP(2)′ lies within the uncertainty envelope of the GP(2) model. This signifies that for a complex model, which is more realistic and rational, accounting for the model structure uncertainty could implicitly account for the uncertainty emanating from the model parameters. However, generalization of such findings needs application of the proposed framework to multiple case studies.

Details are in the caption following the image
Plot showing the relationship between model structure uncertainty and model parameter uncertainty as a function of model complexity. RMSE represents the root mean squared error, and UB and LB represent the upper and lower bounds, respectively.

[33] One of the other advantages of the ensemble-based genetic programming framework proposed in this study, is its ability to identify the most relevant input variables required for characterizing a particular hydrological process. The values in Table 5 represent the percentage of occurrence of each of the variables of the terminal set, in the 60 possible models. As in the case of GP(1) model, when only additive operators like “+” and “−” are included as part of the functional set, the corresponding percentage of occurrence of NR, AT, GT, RH, and WS, among the possible model structures identified by the ensemble-based genetic programming framework are 4.2%, 2.8%, 84.7%, 0%, and 8.3%, respectively. However, when multiplicative operators, “*” and “/” are added as part of the functional set, as in the case of GP(2) model, the percentage of occurrence of NR, AT, RH, and WS, increased to 35.5%, 11.9%, 6.6%, and 9.9%, respectively. Nevertheless, the percentage of occurrence of GT decreased to 36.1%. While the GP(1) model identified GT as the most important variable, the GP(2) model identified both NR and GT to be the most important variables in characterizing the EC-measured LE flux (Table 5). This illustrates that NR and GT exhibit strong associative relationship in estimating the EC-measured LE flux. The most relevant inputs identified by the framework proposed in this study reiterates the findings of Parasuraman et al. [2006, 2007b], where it has been argued that NR accounts for the “energy-limited” conditions; and GT, due to the strong link between the soil thermal properties and corresponding moisture status, as a surrogate of soil moisture accounts for the “supply-limited” conditions of the evapotranspiration process. Table 6 shows some of the plausible model structures identified by the GP(2) model, along with their RMSE statistics based on the testing data set. Although the plausible model structures identified by the GP(2) models are different both in terms of their structure and parameters, still they resulted in similar values of RMSE statistics, highlighting the presence of “equifinality” [Beven and Binley, 1992], with respect to the model structures.

Table 5. Percentage of Various Input Selections in the GP Models
NR AT GT RH WS
GP(1) 4.2 2.8 84.7 0.0 8.3
GP(2) 35.5 11.9 36.1 6.6 9.9
Table 6. Performance of Some Representative Plausible Model Structures Identified by the GP(2) Model on the Testing Data Set
Model Structure RMSE
LE = 0.12 * NR + 0.49 * NR * GT + 0.06 * GT + 0.25 * GT2 38.63
LE = 0.58 * NR * GT + 0.17 * GT * WS + 0.17 * GT 38.91
LE = 0.46 * NR + 0.07 * WS + 0.26 * GT2 39.29

7. Summary and Conclusions

[34] Uncertainty emanating from the model structure has not been accorded the status that it may merit if reliability of hydrologic models is considered an important criterion. The status-quo of uncertainty estimation in hydrologic modeling is to assume a deterministic model structure with its parameter treated as imperfectly known. The uncertainty estimated by these methods uncovers only a partial share of the actual uncertainty, since these methods neglect the uncertainty associated with the model structure, by assuming it as deterministic. Nevertheless, driven by the objective of improving the reliability of hydrologic models by accounting for the model structure uncertainty, the concept of adopting multi-model ensembles for hydrologic prediction is gaining interest lately. Although most of the literature on multi-model ensembles for hydrologic prediction endorses the idea of identifying plausible models to account for the model structure uncertainty, none of these methods base their recommendations after considering a comprehensive number of plausible models. Most of these methods use a limited number of conceptual mathematical models as plausible models to account for the model structure uncertainty, and hence the uncertainty calculated by these methods may be biased and underestimated. In this study, an ensemble-based genetic programming framework has been proposed, which exploits the ability of GP in exploring the plausible model space in a comprehensive manner, and thereby resulting in a more realistic quantification of model structure uncertainty. The EC-measured actual evapotranspiration flux from the SWSS facility, a reconstructed watershed located near Ft. McMurray, Alberta, Canada, is used to demonstrate the robustness of the modeling framework proposed in this study. The modeling framework was also used to (1) investigate the effect of model complexity with regards to their predictive accuracy and uncertainty, and (2) to identify the relative contribution of model parameter uncertainty and model structure uncertainty to the predictive uncertainty of the model.

[35] Results from this study indicate that increasing the complexity of a model could lead to improved prediction accuracy but at the expense of an increasing uncertainty, implying that it is hard, if not impossible, to achieve better prediction accuracy and uncertainty simultaneously. This finding underscores the need for identifying an acceptable trade-off between the prediction accuracy and uncertainty, before choosing the level of details (complexity) that needs to be accommodated within the modeling framework. Analyzing the relative contribution of the model structure uncertainty and the model parameter uncertainty to the predictive uncertainty of the model, the contribution of model structure uncertainty is shown to be more significant. On the basis of the results from this study, it is argued that model structure uncertainty is an important factor that needs to be accounted for if the intent is to develop hydrologic models, which are not only accurate, but also reliable. It is also shown that an increase in model complexity is accompanied by a relative increase in the contribution of the model structure uncertainty to the predictive uncertainty of the model. The results from this study corroborate the concept of “equifinality” [Beven and Binley, 1992], which argues that there are many different model structures and many different parameter sets within a chosen model structure that may be behavioral or acceptable in reproducing the observed behavior of a complex environmental system. The proposed framework can be further refined by selecting different training and testing sets, not conditional to a particular selection process, as it could result in better quantification of uncertainty. Finally, on the basis of this study, it is recommended that the search to find the optimal model could be replaced by the quest to unearth possible models for characterizing hydrological processes.

Acknowledgments

[36] The authors would like to thank Sean Carey, Carleton University, Canada, for providing the data set used in this study. Financial support from the University of Saskatchewan, Natural Science and Engineering Research Council (NSERC) of Canada, and Cumulative Environmental Management Association (CEMA) is gratefully acknowledged.