Volume 54, Issue 3
Research Article
Free Access

A Primer for Model Selection: The Decisive Role of Model Complexity

Marvin Höge

Corresponding Author

E-mail address: marvin.hoege@uni-tuebingen.de

Institute for Modelling Hydraulic and Environmental Systems (LS3)/SimTech, University of Stuttgart, Stuttgart, Germany

Center for Applied Geoscience, University of Tübingen, Tübingen, Germany

Correspondence to: M. Höge, E-mail address: marvin.hoege@uni-tuebingen.deSearch for more papers by this author
Thomas Wöhling

Department of Hydrology, Technical University of Dresden, Dresden, Germany

Lincoln Environmental Research, Lincoln Agritech, Hamilton, New Zealand

Search for more papers by this author
Wolfgang Nowak

Institute for Modelling Hydraulic and Environmental Systems (LS3)/SimTech, University of Stuttgart, Stuttgart, Germany

Search for more papers by this author
First published: 22 February 2018
Citations: 28

Abstract

Selecting a “best” model among several competing candidate models poses an often encountered problem in water resources modeling (and other disciplines which employ models). For a modeler, the best model fulfills a certain purpose best (e.g., flood prediction), which is typically assessed by comparing model simulations to data (e.g., stream flow). Model selection methods find the “best” trade‐off between good fit with data and model complexity. In this context, the interpretations of model complexity implied by different model selection methods are crucial, because they represent different underlying goals of modeling. Over the last decades, numerous model selection criteria have been proposed, but modelers who primarily want to apply a model selection criterion often face a lack of guidance for choosing the right criterion that matches their goal. We propose a classification scheme for model selection criteria that helps to find the right criterion for a specific goal, i.e., which employs the correct complexity interpretation. We identify four model selection classes which seek to achieve high predictive density, low predictive error, high model probability, or shortest compression of data. These goals can be achieved by following either nonconsistent or consistent model selection and by either incorporating a Bayesian parameter prior or not. We allocate commonly used criteria to these four classes, analyze how they represent model complexity and what this means for the model selection task. Finally, we provide guidance on choosing the right type of criteria for specific model selection tasks. (A quick guide through all key points is given at the end of the introduction.)

1 Introduction

Model selection is often explained as finding a compromise between the ability of a model to fit data and the complexity of the model required to do so. Fitting data comprises both matching known data and meeting yet unknown data in forecasts (Guthke, 2017; White, 2017). While there are clear definitions on measures for the quality of fit, this is not the case for model complexity. Therefore, a better understanding of complexity is decisive for appropriate model choice.

Complexity is an elusive property (Van Emden, 1971). Everyone has an intuitive notion about whether something is complex or not. However, neither there is a clear definition of what complexity actually means, nor a unique approach of quantification (Gell‐Mann, 1995a). The main motivation for defining and quantifying complexity can be summarized by three intentions (Bialek et al., 2001a): (1) to prove in general that things become more complex as they evolve, e.g., organisms and turbulent flow; (2) to measure how complex a certain explanation, hypotheses, or model of a phenomenon is; and (3) to evaluate how hard it is to logically describe the state of an interrelated system. These intentions refer to evolution, model complexity, and system complexity, respectively.

In the attempt to define complexity, various insights have been worked out (Gell‐Mann, 1995b) that are broadly accepted as properties of complexity: complexity is maximal between order and randomness—both a fully ordered system and a completely random system are considered to be not complex (Ladyman et al., 2012; Prokopenko et al., 2009; Rudnicki et al., 2016; Wiesner, 2015). Complexity is scale dependent (Allen et al., 2017)—while heterogeneity of a system property or structure on a small scale might be highly complex, it can sometimes be considered (quasi‐)homogeneous on a larger scale. Complexity is related to emergence (Bawden & Robinson, 2015)—a complex system is more than the sum of its parts due to interaction and interdependence (Bozdogan, 2000). Complexity is subextensive (Bialek et al., 2001b)—the complexity of multiple identical objects is smaller than the complexity of one object times the number of objects (Bawden & Robinson, 2015). A discussion of further properties of complexity can for example be found in Edmonds (1999, 2000). The conceptual understanding of these properties has led to numerous measures of complexity, of which many are collected in a nonexhaustive list (Du, 2016; Lloyd, 2001).

The above characteristics show that complexity shall not be confused with related but different system properties like chaos or entropy. Chaos can emerge from very simple systems such as a small system of nonlinear, coupled ordinary differential equations. Accordingly, chaotic systems are not necessarily complex systems. Complexity and chaos might share their relation to nonlinearity in the modeled dynamics, but chaos can emerge from both simple and complex systems (Baranger, 2000). Entropy—both in its physical or information‐theoretic meaning—grows monotonically with the level of disorder and is minimal for complete order, while complexity is maximal between the two extremes (Rudnicki et al., 2016). Also, entropy is additive (extensive) and complexity is only subextensive (Bawden & Robinson, 2015).

1.1 Model Complexity

When working with different hypotheses or models, we follow the second motivation for measuring complexity as mentioned right in the beginning: evaluating model complexity. This is a frequently encountered issue in many disciplines: quantifying the complexity of a certain conceptual or mathematical model is of major interest for model rating and selection in the search for good data fit with a low model complexity (e.g., Schöniger et al., 2015). This kind of model evaluation is frequently performed for hydro(geo)logic models (Martinez & Gupta, 2011; Pande et al., 2009; Poeter & Anderson, 2005; Schöniger et al., 2014; von Gunten et al., 2014). These range from fully integrated approaches based on partial differential equations (e.g., Mendoza et al., 2015; von Gunten et al., 2014) over semiphysical models to lumped (e.g., Arkensteijn & Pande, 2013) or regressive models that do not include physical laws (e.g., Zhang & Zhao, 2012). The two extremes (e.g., Bialek et al., 2001a) of model types are also referred to as mechanistic versus data‐driven models (Schoups et al., 2008), respectively. To bridge these extremes, metamodeling approaches that combine different types could be shown to merge their respective strengths, e.g., for supporting decision making (e.g., Fienen et al., 2016).

With increasing computational power, water resource models often became highly detailed about the processes they represent or approach (Hill, 2006; Orth et al., 2015; Pande et al., 2014; Refsgaard & Abbott, 1996). The more processes in nature are understood, the more there is a tendency to incorporate that knowledge as physical or semiphysical equations into mechanistic models (Hunt et al., 2007) or to use more advanced versions of data‐driven models at the other end of the spectrum of model classes. A common motivation to improve mechanistic models is that, sometimes, more physical relationships are able to constrain predictions better to a plausible physical range by incorporating more functional relations and parameters. Likewise, regression models with more terms mean more flexibility. Yet the major problem with extended models of any type is nonuniqueness in calibration and poor parameter identifiability (Schoups et al., 2008). Additionally, models with more terms and parameters are not necessarily better than models with less terms in matching the data (Perrin et al., 2001), unless the additional terms and parameters are chosen well. The same problems are encountered in ecology (e.g., Hooten & Hobbs, 2015; Johnson & Omland, 2004; Merow et al., 2014), psychology (e.g., Mulder & Wagenmakers, 2016; Pitt et al., 2002), socioeconomic sciences (e.g., Elliott & Timmermann, 2013), or machine learning (e.g., Friedman et al., 2001), naming just a few examples.

Adding parameters is qualitatively accepted as an increase in complexity. However, this only relates to the parametric complexity of the model (Vanpaemel, 2009) and does not include the mathematical functions which build the model (e.g., Pande et al., 2015). Further, it does not account for interaction of parameters and does not answer the question which model has the optimal complexity for a particular purpose (Claeskens, 2016). An optimal model is neither too complex (Myung, 2000; Orth et al., 2015; Warren & Seifert, 2011) nor too simple (Hunt et al., 2007; Mendoza et al., 2015) for the modeling task at hand.

As a rule of thumb stemming from regression models, a model which is considered to be too complex is prone to overfitting. This means it suffers from too high flexibility and poorly identifiable parameters and exhibits a large variance of predictions (Lever et al., 2016), unless large and informative data sets are available to constrain its parameters (Pande et al., 2015, and references therein). The opposite is true for too simple models that underfit. Then, the variance is small because the few model parameters are easily constrained with the available data. However, the model shows a high bias between its predictions and data, because it is too simple to accurately describe the system and match the data. Bias describes how well the model fits the data on average, i.e., the model's accuracy. Variance resembles the squared residuals around this average (as a measure for uncertainty), i.e., the precision of the model. Both are visualized in Figure 1a.

image

Concepts and effects of bias and variance. (a) Accuracy and precision visualized as bias and variance of shots on a target. Bias is the distance between the target center and the average position of shots. Variance is the spread of shots around their average. (b) Decomposition of total squared error into squared bias and variance (after e.g., Friedman et al., 2001). Bias is supposed to decrease and variance is supposed to grow with increasing model complexity, both due to growing model flexibility. Their superposition forms a minimum that marks optimal model complexity.

Underfitting and overfitting are direct results of these two parts of total error, and both of them deteriorate a model's ability to explain and forecast data (e.g., Guthke, 2017; White, 2017). The tension between these two make it difficult to justify picking a certain model as the appropriate one, because an underfitting model is missing a necessary detail while the reason for overfitting is unnecessary complexity (Vanlier et al., 2014). Following this idea of overfitting and underfitting, the optimal model of optimal complexity performs best in a bias‐variance trade‐off as depicted in Figure 1b.

1.2 Model Selection

To decide which model is optimal, formal methods for model selection (e.g., Claeskens, 2016; Guyon et al., 2010) can be applied—ideally to find the model that neither overfits nor underfits, practically the one that suffers the least from either of the two. All selection techniques follow some sort of Occam's razor, i.e., the principle of parsimony (e.g., Schöniger et al., 2014; Vandekerckhove et al., 2015). Briefly, this means as simple as possible, as complex as necessary. Model selection techniques seek to find an optimal trade‐off between the ability of the model to fit data and the model's required complexity to do so (Kemeny, 1953; Rasmussen & Williams, 2006). Having regressive models like polynomials in mind, it also becomes apparent why goodness of fit to data alone is not enough for model selection (Vrieze, 2012): a polynomial model of a sufficiently high degree will always fit a data series better than one of a lower degree. However, such a comparison does not consider the “effort” (higher degree) the better fitting model has to make. Hence, in model selection, model complexity is penalized. This means that a highly complex model has to justify its complexity by a corresponding high goodness of fit. A model that does not fit the data as well but is considered to be much simpler will rank better than a more complex one that does not provide a good enough fit to justify its higher complexity.

There are different interpretations of the term model selection (Claeskens, 2016; Guyon et al., 2010). Often, model selection is interpreted as adjusting model complexity, so model selection is rather about selecting the right degree of regularization (Arkensteijn & Pande, 2013; Marconato et al., 2013; Vaiter et al., 2015; Warren & Seifert, 2011). Regularization is essentially restricting the flexibility of a model, by penalizing extreme parameter values or predictions in order to counteract overfitting (Mallick & Yi, 2013). In this regard, model selection is sometimes also called model complexity control (Schoups et al., 2008).

Model selection in this primer refers to selecting one model out of several (NM) fully specified and competing candidate models. The model selection methods and their corresponding criteria discussed in the remainder of this primer all contain a model complexity representation but mean something different by it. Hence, the model ranking by a certain selection method might differ from the one by another method (Burnham & Anderson, 2002; Schöniger et al., 2014; Ye et al., 2008), depending on how model complexity is interpreted and measured. Still, model selection methods and their corresponding criteria are often chosen arbitrarily in practical application, regardless of the respective take on model complexity and regardless of what this means for the selected model in terms of being “best.”

The purpose of this primer is to give insight into how the notion of a model is defined through the eyes of different model selection criteria, which mentalities for defining complexity and for selecting models follow from that, how model selection methods quantify model complexity and how common selection criteria can be grouped and classified accordingly. It is outlined how these different classes can be used to pursue different goals in model selection.

The remainder of this article is structured as follows: the major philosophies and their fundamental differences in model selection are introduced in section 2. In section 3, we describe and discuss the most commonly used model selection criteria and their respective interpretations of model complexity. Then, we propose our classification system and discuss implications on how models are rated within each group in section 4. We suggest which type of model selection criterion is suitable for a particular modeling task. Finally, we summarize and conclude in section 5, what has to be considered when model selection is performed.

A quick guide addressing all key points of this article is given by (1) the sketch of purpose‐dependent model selection in section 2.1.5, (2) the so‐called levels of Bayesianism in section 2.2, (3) the resulting classification scheme in section 4.1, and (4) the outlined links between model selection classes and modeling tasks in section 4.4.

2 Philosophies in Model Selection

2.1 Consistency in Model Selection

The two major types of model selection are called consistent (e.g., Hurvich & Tsai, 1989; Shibata, 1986) and nonconsistent. The prefix “non” does not imply inconsistent, it simply refers to having different properties than consistent model selection. Yet both employ a trade‐off between fit and model complexity but pursue two different goals (Aho et al., 2014). A consistent model selection tries to identify which of the models produced the observed data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0001, i.e., the data‐generating process (DGP). Nonconsistent model selection focuses on the data yet to come and uses the already available “within‐sample” data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0002 to estimate which of the models might be best in predicting the future “out‐of‐sample” data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0003. In this sense, it selects the model that approaches the DGP best but does not seek to identify it. The ability of a model to predict “out‐of‐sample” data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0004 accurately and precisely is known as generalizability (Friedman et al., 2001).

Accordingly, consistent model selection is sometimes also called confirmatory (Aho et al., 2014), i.e., confirming the identified DGP by the given data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0005 in hindsight. Nonconsistent model selection is also called conservative (Leeb & Pötscher, 2009) or exploratory (Aho et al., 2014), i.e., the model selected to approach the DGP is appropriate to conservatively predict or explore new data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0006 in foresight.

In the past, it was discussed whether the two types of model selection are (anti)correlated (e.g., MacKay, 1992) or uncorrelated (e.g., Bishop, 1995) with each other. Although such behavior might appear coincidently, it was generally shown that any model selection method cannot be optimal in both respects (Arlot & Celisse, 2010; Hurvich & Tsai, 1989; Yang, 2005).

At first sight, this is confusing because, for the “best” model, there should be no difference in being selected based on highest predictive capability (for urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0007) or on being identified as the true model (for urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0008). However, depending on the DGP that shall be modeled and the ability of a model to resemble it, there are differences in the behavior and optimality of the two types of model selection.

These differences are elabotared by considering their respective takes on the dimensionality of the truth (section 2.1.1), their ability to provide best predictions (section 2.1.2), the mechanisms behind overfitting and underfitting (section 2.1.3), the risks that come with either model selection type (section 2.1.4) and the modeling purpose for which each type is ideally suited (section 2.1.5).

2.1.1 Truth Dimensionality

The dimensionality of the DGP (here also called truth or true model) that is modeled has to be considered to understand the two types of model selection. Dimensionality refers to the number of explanatory variables which are needed to model the quantity of interest urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0009 (Leeb & Pötscher, 2009), of which urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0010 and urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0011 are observed and potential instances, respectively.

Imagine a scenario, where the truth is finite dimensional, like a fully isolated system under controlled conditions. It can be parametrized by a finite number of functional relations and be represented by a parametric model candidate, i.e., a model with a finite set of parameters. In this case, consistent model selection (e.g., Schwarz, 1978; Shibata, 1986) is optimal to identify the true model among different candidates. It assumes (although it might be practically questionable) that the DGP is among the model candidates under consideration and has the goal to identify this true model (see Table 1). Therefore, as core property, consistent model selection converges toward the model candidate which represents the finite‐dimensional DGP best. With more and more data coming in, this preference for a candidate becomes stronger until it is clearly identified as the truth. Figuratively speaking, consistent model selection steadily closes the doors toward alternative model candidates.

Table 1. Conditions for Preferably Applying Either Nonconsistent or Consistent Model Selection Regarding the Data‐Generating Truth, the Respective Goals in Model Selection and the Relations to Data Including Overfitting and Underfitting
Nonconsistent Consistent
Truth Infinite dimensional Finite dimensional
Goal Find model with largest predictive capability (approach truth) Find data‐generating process (represent truth)
Data Model of higher complexity can be justified with more data Identification of truth becomes clearer with more data
Fit Too little flexibility (underfit) versus fitting of noise (overfit) Simpler than truth (underfit) versus more complex than truth (overfit)

Alternatively, let us think of a scenario in which the DGP is infinite dimensional, e.g., a linear stationary process with infinitely many parameters (Shibata, 1980). Such a scenario is also called nonparametric (Liu & Yang, 2011) because it cannot be represented by a parametric model. Hence, a finite‐dimensional model can also not be identified as DGP. Nonconsistent model selection is optimally suited for this scenario because it comes with a core property: it selects the model which promises the highest predictive capability for potential new data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0012, while at the same time the required model complexity to do this has still to be supported by the already observed amount of data (information) urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0013. This does not require the DGP to be one of the candidate models. The model set only has to be able to “grow” with the increasing amount of data points Ns (information) originating from that truth (Leeb & Pötscher, 2009): in theory, each new batch of information reveals more details about the unknown DGP. Then, a more complex model (e.g., of higher dimensionality) in the set can eventually be supported. With the ability to switch between model candidates, the DGP is progressively approached (Liu & Yang, 2011). Figuratively speaking, in nonconsistent model selection, the doors toward other models are still kept open.

2.1.2 Asymptotic Efficiency

Optimality of either nonconsistent or consistent model selection depends on the dimensionality of the truth and the corresponding model candidates. This can be demonstrated using so‐called asymptotic (loss) efficiency (Shibata, 1980). Let us briefly clarify the term “efficiency.” Here efficiency does not refer to efficiency with respect to time or resources, it rather refers to providing a high predictive capability for unseen data by using only a given finite sample size (Shibata, 1980). This is related to the minimax principle (Vrieze, 2012, and references therein): using limited information (data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0014) to select the model which minimizes the so‐called worst‐case risk or potential “out‐of‐sample” error and thus maximizes predictive capability.

In the large‐sample limit, asymptotic behavior can be explained best by using a so‐called loss L, here defined as the sum of squared errors between observations and model predictions (Shao, 1997). In a scenario where a finite dimensional data‐generating process urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0015 is represented by one of the candidate models, the loss urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0016 of an optimal model urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0017 will be equal to the true model's loss urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0018 in the limit of infinite sample size urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0019. Then, the probability of the true model to be equal to the optimal model follows urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0020 (Leeb & Pötscher, 2009; Shao, 1997). Only in this case, consistent model selection is also be asymptotically efficient because urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0021 would also provide the best predictions (Leeb & Pötscher, 2009; Shao, 1997).

In an infinite‐dimensional truth scenario, the DGP cannot be represented by a parametric model candidate. Hence, when switching models in nonconsistent model selection, the ratio urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0022 between the losses of optimal parametric model and the true model will asymptotically get closer to 1, but the two will not be equal (Shao, 1997; Zhang, 2010). In such a scenario, nonconsistent model selection is therefore asymptotically efficient (Shao, 1997), because even without identifying the true model it will always select the model that yields the best loss‐ratio given the current sample size and thereby successively approach the truth.

2.1.3 Mechanisms Behind Overfitting and Underfitting

Depending on the chosen model selection type, the alleged mechanisms behind overfitting and underfitting have a slightly different notion (see Table 1). In nonconsistent model selection, a model overfits if it fits noise rather than only the data and underfits if it considers variability in data to be noise while it is actually not. With more data that contain new information becoming available, the risk of fitting noise decreases and a model that was rated too complex (overfitting) before, can be justified and gets selected. In consistent model selection, an overfitting model is more complex than the truth and incorporates unnecessary processes which are not part of the true model, while an underfitting model lacks necessary functional relationships and is therefore too simple (Vanlier et al., 2014).

2.1.4 Risks in Model Selection

The previous insights regarding the two model selection types and the conditions for which they are ideally suited are summarized in Table 1. As a rough preliminary summary, one could say that nonconsistent criteria are optimal for the case of an infinite‐dimensional truth (nonparametric true model) while consistent criteria are optimal for the case of a finite dimensional truth (parametric true model; Liu & Yang, 2011).

The major problem is that we do not know in which scenario we are. And even if we knew the dimensionality of the truth, we could have an infinite‐dimensional truth with only several dimensions that are really relevant in terms of reproducing the observable effects in modeling (Leeb & Pötscher, 2009). In this a case, consistent criteria could work, too, in order to identify a finite dimensional model which parametrizes all relevant relations and neglects the irrelevant dimensions. Contrarily, we could have a parametric truth which has so many relevant dimensions, that the parametric candidate models (which are based on the limited current state of knowledge) cannot fully cover it. In that case, it does not make a difference whether the true model's dimensionality is finite but far beyond the dimensions that are relevant or actually infinite. Then, nonconsistent model selection might be the better choice since it is able to find the one model among the candidates which approaches this truth as closely as possible.

One could think that this is not a severe problem because all model selection criteria still somehow manage the trade‐off between fit and complexity. However, the modeler has to be aware of the risks that come with choosing one type of model selection over the other: if the data‐generating process was among the candidate models and could be identified in a consistent way, it would also be the ideal one for predictions. But usually we do not know whether this is the case. Then, we risk that (in a situation where no candidate model really represents the truth) a consistent criterion identifies one of the models as (apparently) “best” based on the given limited data. However, this model might be nonoptimal for predictions since consistent criteria are not designed to perform in a minimax‐optimal way (Yang, 2005). If our goal was to use the model for predictions afterward, we might still get presumably acceptable predictions, but consistent model selection was shown not to yield the optimal model choice if the true model is not among the candidates (Vrieze, 2012). Consistent methods tend to underfitting models in a play‐it‐safe manner and come with the risk of selecting a model which is too simple (Burnham & Anderson, 2002; Hurvich & Tsai, 1989).

Another model candidate might be better suited for predictions in such a situation which could be provided by a nonconsistent selection. This kind of model selection, however, runs the risk of not being able to identify a parametric true model even if it was among the candidates. A nonconsistent method could select a model which contains functional relations that are not really relevant in reality, but based on limited data (that are also subject to measurement error), they still appear to be relevant (Leeb & Pötscher, 2009). With this, the minimax property to get potentially best predictions using finite data might be fulfilled and will be honored by a nonconsistent criterion. However, nonconsistent model selection methods are prone to overfitting and tend toward models of higher complexity (Burnham & Anderson, 2002; Hurvich & Tsai, 1989). Further, in a scenario where a representation of the true model could actually be found among the candidates, this prevents that the selection converges toward it, ultimately preventing the modeler to understand the DGP. Nonconsistent model selection is not designed for this identification purpose and hence one cannot expect that the best representation of the truth has been found instead of an approximation that just seems to work well. Rather than risking to converge toward a wrong model and identify it as the truth, it (preliminarily) selects the model that promises the smallest risk of misprediction.

2.1.5 Purpose‐Dependent Model Selection

Plainly, this could be termed the dilemma of model selection: not knowing the (relevant) dimensionality of the truth and not knowing to which level our models approach or represent this truth, we are forced to make a decision between nonconsistent and consistent model selection. There is no final solution to this problem but the modeling task itself can suggest which of the two model selection types might be preferential for model selection. Aho et al. (2014) provide a list of guiding questions in this respect. A very general attempt to compress this guide is as follows: if prediction capability is the (primary) goal of the model selection procedure (without necessary process understanding), nonconsistent criteria are the right choice. If models are set up for (primarily) identifying the (physical) relationships which generated the observations, i.e., process understanding, the consistent view should be taken for model selection. Hence, the purpose of modeling should be at the core of choosing the right model selection method.

In water resources modeling, we usually face a natural system which is governed by presumably infinite‐dimensional physics. The key argument in favor of this is heterogeneity (possibly at all scales) of many natural systems. Hence, one could think that nonconsistent criteria are the preferential choice because then they should be asymptotically efficient. Yet since finite‐dimensional models can be sufficient, too, by covering all relevant relations, employing consistent criteria might be justified when process understanding is pursued. Then, the selection method shall not switch between model candidates with new data but identify the best representation of the DGP. A consistent model selection which switches models whenever new data come in primarily indicates that the true model is not among the candidates. Contrarily, for a nonconsistent method, switching between more suitable models is anticipated and resting with one candidate primarily indicates that a more complex model could be justified and is missing in the set. Model selection methods act in one way or the other. Therefore, being aware of this behavior might help the modeler to make a well‐informed decision about the appropriate choice of a selection method for her purpose. This is illustrated in the following thought experiment.

2.1.5.1 Illustrative Thought Experiment

The exploratory or confirmatory natures of the two model selection types can be illustrated by a simple thought experiment: imagine two modelers A and B who seek to model a controlled laboratory experiment (e.g., a tracer flow‐through column experiment). Due to the fully controlled conditions, it can be assumed that this lab‐scale truth is of (relevant) finite dimensionality. Modeler A, e.g., an engineer or manager, assumes that there are too many dimensions to be covered by a fixed parametric model but still wants to find the best model for future predictions. Accordingly, she picks a type of model which is allowed to grow with incoming new information and starts off with operational data‐driven models, e.g., regressive models. Modeler B, e.g., a fundamental scientist, wants to identify the true data‐generating process and hence prefers parametric physics‐based models. One might think that the two purposes are the same thing, but from the perspectives of nonconsistent versus consistent model selection, they are not.

Each of them starts with three models of their preferred model type with increasing complexity: a simple first model, a more complex second model and a highly complex third model. Let us assume that the second model of modeler B actually represents the truth (which is an idea borrowed from consistent selection), i.e., employs the right physical equations. On the same level of complexity, the second model of modeler A mimics the data best, but as a data‐driven empirical model it is clear that it cannot represent the true data‐generating process.

Both modelers collect and use the same data continuously in order to perform a model selection procedure as soon as a new batch of data, i.e., new and nonredundant information, comes in. According to her modeling purpose, modeler A uses a nonconsistent model selection criterion targeting the highest predictive performance. Modeler B performs consistent model selection to identify the truth and to understand the underlying physics. This procedure is shown schematically in Figure 2.

image

Differences in model rating following nonconsistent (A‐type) and consistent (B‐type) model selection for increasing data size. The models are rated on a normalized scale between 0 and 1. Models 1–3 resemble increasing stages of complexity. The A‐type models are data driven and the B‐type models are mechanistic. Both rankings start with identical uniform rating between the models before data is considered.

In Phase 0, before having any data, both modelers start with uniform model choice preferences across their candidate models. In Phase 1, with little data available, no complex model can be supported, so the simple first model of each modeler is selected. However, with more incoming and informative data (Phase 2), a more complex model provides a better trade‐off between fit and complexity. Hence, the second models of both modelers get selected by their respective criteria. With more and more data becoming available in Phase 3, the two rankings become fundamentally different in the large‐sample limit: for modeler B, the third physical model (which is more complex than the truth) will never stand a chance in a model selection process in the long run. Its additional complexity would be called excessive. However, the third data‐driven model of modeler A can be justified as the model with the best trade‐off between fit and complexity from an nonconsistent perspective.

This is because, for modeler B, the second model revealed itself as representing the data‐generating process, and as such a simpler (first model) or more complex (third) model is rejected by the consistent model selection procedure. For modeler A, it was clear from the beginning on that the truth is not among the data‐driven candidate models. Then, a more complex model is justifiable with more available observations. More data reduces the risk of just fitting noise, so a more complex model from the efficiency perspective is confident with yielding the best future predictions and wins the model selection.

The illustrated behavior of consistent model selection, i.e., to identify and stick to the best representation of the truth, can be found in Schöniger et al. (2015). In this study on mechanistic models for a laboratory‐scale artificial aquifer, several increasingly complex parametrizations of the hydraulic conductivity distribution are ranked. Under growing data size, the consistent selection procedure converges toward the model that represents the true zoned distribution, and it devaluates simpler (homogeneous) and more complex (geostatistical) approaches. Contrarily, the tendency of nonconsistent model selection to prefer increasingly complex models is demonstrated in Vrieze (2012) for regression models.

2.2 Bayesianism in Model Selection

While consistency refers to the goal of the model selection task at hand, Bayesianism refers to the statistical perspective used to achieve it. Many of the nonconsistent and consistent model selection methods are Bayesian to some degree. The Bayesian view allows assigning distributions to both data and parameters (M. Betancourt, A unified treatment of predictive model comparison, arXiv preprint arXiv:1506.02273, 2015). Therefore, generally expressed, model selection methods which are Bayesian consider model complexity as “a measure for stochastic dependence between observations and parameters” (van der Linde, 2012). Model selection can be Bayesian in one or more of the following respects:
  1. within‐model expert knowledge:

    incorporation of prior probability distribution for parameters;

  2. model‐representative quantities:

    measures of fit and complexity are marginalized over the probability distribution of the whole parameter space, and are not only (e.g., best fit) point estimates; and

  3. between‐models expert knowledge:

    model weights as prior ideas about ranking are used and updated, resembling model probabilities

The first level of Bayesianism in model selection addresses what the parameter space of a model looks like. In the Bayesian perspective, there is a probability measure (here represented for simplicity by a probability density function, pdf, urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0023) of parameter values urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0024 which expresses the belief what suitable parameter values could be. Even before observations are considered, there is such a belief and it is called the parameter prior pdf urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0025. Observations urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0026 are then used to update this prior knowledge to a conditional distribution called the parameter posterior pdf urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0027.

The second level addresses whether a point (single parameter set) or a marginalized (averaged over the parameter pdf) estimator (van der Linde, 2012) shall be used to evaluate goodness of fit and model complexity. Often, goodness of fit is evaluated with the best parameter calibration possible for a model, i.e., with the maximum likelihood estimator (MLE). Being one specific set of parameters, the MLE is one of the classic point estimators. Contrarily, the fully Bayesian spirit is to marginalize over the whole parameter pdf (Piironen & Vehtari, 2017), and to use averaged quantities such as the marginal likelihood to represent the overall fit. Marginal likelihood is often referred to as Bayesian model evidence (BME; e.g., Schöniger et al., 2014).

The third level refers two how models are compared, ranked or selected using model probabilities. From a Bayesian point of view, model selection is based on a belief in each model, again expressed as a probability P. There is a prior probability of each candidate model to be the model which has most likely generated the data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0028. These data can then be used to update the prior model weights to their respective posteriors, just as for parameters within the models.

For a model selection method to be Bayesian, at least the first level of Bayesianism has to be fulfilled. Figure 3 graphically summarizes the general classification scheme of model selection methods over the two stages covered in sections 2.1 and 2.2. First, the nonconsistent or consistent type has to be picked depending on the major purpose of modeling. Second, the incorporation of a Bayesian parameter prior (first level of Bayesianism) allows for a probabilistic treatment of parameters during the model selection task. The second and third levels of Bayesianism are added on top of that by specific methods, as discussed for the respective methods in section 3.

image

Classification system for a model selection methods with four classes: first nonconsistent versus consistent model selection and second using a Bayesian parameter prior or not.

3 Model Selection Criteria

All model selection methods that consider model complexity strike a trade‐off between goodness of fit urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0029 and model complexity urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0030 (e.g., Wit et al., 2012). In most general terms, there is a trade‐off score urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0031 that is expressed as
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0032(1)

Traditionally, a model is rated better under a certain selection method, the more negative urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0033 is. This implies a good fit of data (hence the negative sign) and a low complexity (hence the positive sign). The goodness of fit urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0034 is rather clear to interpret as the accuracy of the model, either based on a representative estimator like the maximum likelihood estimator (MLE), or based on an average fit, for example marginalized over the whole distribution of possible parameter values (van der Linde, 2012). Yet the way model complexity urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0035 is interpreted and quantified differs strongly between model selection methods, as will be discussed in detail in the following sections.

In nonconsistent model selection, the complexity term is constant or bounded (Leeb & Pötscher, 2009), i.e., does not grow with the data size Ns. This is schematically depicted in Figure 4. Hence, nonconsistent model selection allows switching to models of higher model complexity with more data as long as the higher complexity is compensated by an even stronger increase in goodness of fit. A special case of this would be nonparsimonious model selection where only the goodness of fit term urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0036 is used for rating models, and urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0037 is given by, e.g., the maximum likelihood, smallest root‐mean‐square error or another error metric. This implies urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0038 (see Figure 4) for all models and prevents generalizability or consistency for the selected model because no trade‐off is considered.

image

Schematic behavior of complexity term urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0039 under growing data size Ns for nonparsimonious (light grey dots), nonconsistent (grey line), and consistent (blue line) model selection with linear complexity growth as refernce (dashed black line).

Opposed to this, the complexity representation in consistent model selection grows with increasing sample size Ns. However, this growth must be slower than linear (called subextensive and shown in Figure 4; Bialek et al., 2001a): mathematically, for growing data size urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0040 the complexity term follows urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0041 and urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0042 (Leeb & Pötscher, 2009). Such growth might be contradictory to our intuitive understanding of complexity: if a model has a certain complexity, this complexity should not increase with increasing Ns. However, in consistent model selection, the model complexity penalty needs to grow in a way that the selection criterion can identify the true model, rather than justifying higher and higher model complexity with more and more data. While the goodness of fit will eventually get worse for all nontrue model candidates with more data, only the true model can balance the growing complexity penalty.

There are implicit methods which seek to assess the trade‐off score urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0043 directly, without decomposing it according to equation 1. A nonconsistent example is cross‐validation (CV; Gelman et al., 2014; Stone, 1977), a consistent example is the direct evaluation of Bayesian model evidence (BME) in Bayesian model selection (Hoeting et al., 1999; Schöniger et al., 2014). Alternatively, some model selection methods address urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0044 and urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0045 explicitly and then combine them to obtain urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0046. The explicit methods are called model selection criteria in the following, most of them being known as information criteria (IC; e.g., Spiegelhalter et al., 2014).

In their original derivation, many model selection criteria assume that residuals between observations urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0047 and model predictions urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0048 can be described as white noise (zero mean, uncorrelated, and finite variance) or even as independent and identically distributed (i.i.d.; e.g., Leeb & Pötscher, 2009). This holds for the uncorrelated Gaussian case but rarely occurs in reality—especially in hydro(geo)logy. Hence, a reasonable treatment of the errors is required when model selection criteria are applied (e.g., Schoups & Vrugt, 2010). However, it was shown that model selection criteria generally work under weaker assumptions on the errors than being Gaussian or i.i.d. (Leeb & Pötscher, 2009, and references therein). Principally, it has to be noted that all criteria are conditional on the choice of the error distribution (also known as loss, cost or likelihood function; e.g., Tarantola, 2006). The expectation over a loss function is usually called risk function (Vrieze, 2012).

The selection criteria presented in the following are those most widely used by the majority of practitioners (see Boisbunon et al., 2014; Mallick & Yi, 2013). We classify and discuss them with respect to their incorporation of model complexity. Based on this classification, we explain why similarly looking but different selection methods yield contradicting model ratings (Schöniger et al., 2014; Ye et al., 2008) and put them into perspective, going beyond similar earlier attempts (e.g., McQuarrie & Tsai, 1998).

For readability, nonconsistent model selection criteria are referred to as A‐type and consistent criteria are called B‐type in the following. Both worlds (A and B) each contain two classes (1 and 0) with respective target quantities for model selection. Criteria of class 1 are at the least level‐1‐Bayesian, i.e., they require a parameter prior distribution, while criteria of class 0 are not Bayesian (see Figure 3). All criteria within a class resemble approximations to the class‐specific target quantity as scores urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0049 for model selection.

3.1 A‐Type Model Selection: Nonconsistency

As stated earlier, nonconsistent model selection will find the model that promises largest predictive capability given already observed data, assuming an unknown nonparametric truth. Predictive capability can be assessed through the target quantities of predictive density (A1, section 3.1.1) or predictive error (A0, section 3.1.2).

3.1.1 A1: Predictive Density

Model selection criteria which try to estimate predictive density assume that there is a (infinite dimensional) true model with a predictive density function urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0050 for an observable random variable urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0051. They are called predictive information criteria (IC). The exact pdf urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0052 is generally unknown, but the observed (within‐sample) data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0053 and future (out‐of‐sample) data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0054 are both assumed to follow urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0055.

Generally, models are rated on how strongly their own predictive density urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0056 deviates from the true urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0057. In practice, predictive IC try to estimate the model's logarithmic predictive density urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0058 of potential future data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0059 (Geisser & Eddy, 1979; Gelman et al., 2014). Since the model parameters are assumed to be probabilistic, this logarithmic predictive density is obtained by marginalizing urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0060 over the model's whole posterior parameter distribution urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0061:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0062(2)

Since potential future data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0063 themselves are unknown, equation 2 is marginalized over the true model's urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0064 from which urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0065 is expected to come from Gelman et al. (2014). This yields the expected log predictive density urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0066 of the model as model rating score urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0067.

However, without actual out‐of‐sample data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0068, predictive IC can only approximate urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0069 using the given within‐sample data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0070. This results in an offset (Hooten & Hobbs, 2015) that is caused by testing a model on the data set on which it was conditioned (fitted). Predictive IC incorporate this offset by an effective number of parameters urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0071 and use it as complexity representation of the model (Akaike, 1973). This correction can be interpreted as a quantification of how much predictive density for urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0072 increases by fitting urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0073 parameters to urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0074 (Gelman et al., 2014).

The most popular predictive IC is the Akaike information criterion AIC (Akaike, 1973, 1974, 1978). It is an approximation to the information‐theoretic Kullback‐Leibler‐Divergence urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0075 that quantifies the information loss between the predictive distributions of a hypothetical true model urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0076 and the candidate model urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0077 (Aho et al., 2014) in the model space:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0078(3)
The first term on the right‐hand side of equation 3 is a constant for all compared models. Therefore, the AIC addresses only the second term, called the relative expected KL‐information (Burnham & Anderson, 2004), resembling the expected logarithmic predictive density. The approximation of equation 3 by the AIC was derived for asymptotic normal posterior distributions in the large‐sample limit (e.g., linear models with uninformative parameter prior and normal error distribution). In this special case, the point estimate urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0079 summarizes the posterior parameter distribution. Therefore, the model's expected log predictive density is conditional on urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0080 and given by urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0081 (Gelman et al., 2014). This cannot be directly calculated, but with all candidate models being conditioned on the same data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0082, it can be approximated via the log likelihood value urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0083 evaluated at the model‐specific MLE urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0084 plus a correction for the approximation offset. Under the above conditions, this correction naturally appears in the derivation as simply the number of model parameters Np (e.g., Burnham & Anderson, 2002). Hence, the AIC writes as (with a factor of two for historical reasons):
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0085(4)

A model with many parameters can reduce urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0086 for urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0087 by fitting all of these Np parameters. Since the AIC was derived for uninformative priors, all the model parameters have to be constrained by urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0088 instead of prior information. Hence, in equation 4, the goodness of fit (negative first term) in the model selection criterion has to be reduced by this “potential” for reducing urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0089 (positive second term), i.e., the independently adjustable parameters (Akaike, 1974). Interestingly, the factor of two converts the first term in equation 4 to a plain sum of squared errors for an uncorrelated normal error distribution. A version of the AIC corrected for finite data size Ns (AICc) was developed (Hurvich & Tsai, 1989) to compensate for smaller sample sizes in case of which the above asymptotic behavior cannot be assumed, i.e., Ns is too small that the IC could reliably select the model with the largest predictive capability.

A generalization of the AIC was proposed as deviance information criterion (DIC; Spiegelhalter et al., 2002). In contrast to the AIC, the DIC was designed for informative priors and can therefore be seen as a more Bayesian version of the AIC (Spiegelhalter et al., 2014). The deviance urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0090 is defined as the doubled negative logarithmic likelihood (NLL): urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0091 (as it was used for the AIC). The DIC is evaluated at the posterior parameter mean urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0092:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0093(5)

In contrast to AIC, model complexity is measured as effective number of parameters urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0094, which does not necessarily equal the straightforward parameter count Np. The DIC does not require asymptotic normality in the large‐sample limit. This extends the applicability of the DIC to nonlinear models and to incorporate informative priors (in contrast to AIC) as long as the posterior parameter distribution can be sufficiently approximated by a Gaussian even under limited sample size. Being evaluated at the posterior mean urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0095, the DIC uses an averaged quantity based on the assumed normality. In principle, this is more Bayesian than just using the MLE as point estimate, but it relies heavily on the Gaussian assumption. This is not a real marginalization (see second level of Bayesianism) and makes the DIC subject to criticism (e.g., Piironen & Vehtari, 2017).

The derivation of urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0096 in equation 5 based on the deviance urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0097 is as follows: if the posterior parameter distribution is multivariate Gaussian, the deviance automatically follows a urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0098 distribution. This is typically given for errors being normally distributed (Clark & Gelfand, 2006). As a property of the urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0099 distribution, the difference between the mean density, urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0100, and the density at the mean, urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0101, is equal to the statistical degrees of freedom ν of the urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0102 distribution. The DIC uses this difference to approximate ν and then defines it as the number of effective parameters urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0103 (Spiegelhalter et al., 2002):
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0104(6)
Exploiting another property of the urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0105 distribution, Gelman et al. (2004) suggested to use half of the variance of the deviance over the posterior to estimate the effective number of parameters. This is possible, because just like the difference in equation 6, urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0106 also equals the distribution's statistical degrees of freedom:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0107(7)

Spiegelhalter et al. (2002) describe urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0108 or urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0109 as the dimension of parameter space that can be constrained by the given data, calling it a model dimensionality. Since urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0110 is not necessarily equal to Np, it shall not be confused with parameter space dimensionality. However, urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0111 reduces to Np if the prior is uninformative (Meyer, 2014; van der Linde, 2012) and the DIC reduces to the AIC.

The AIC and the DIC are limited to so‐called regular models. This means that certain regularity conditions hold, e.g., the Fisher information matrix F exists and is positive definite (Watanabe, 2010). Otherwise, a model is called singular. urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0112 is defined as the negative Hessian of the log likelihood urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0113 with respect to parameters: urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0114. The inverse of urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0115 is an estimator to the posterior covariance matrix among the parameters. For singular models, F is not positive definite. Hence, there can be parameters with infinite variance even after calibration.

Because the AIC and DIC are limited to regular models (van der Linde, 2012), the “widely applicable” (or “Watanabe‐Akaike”) information criterion (WAIC) was developed (Watanabe, 2010) as generalization of the AIC and DIC to singular models (Betancourt, arXiv:1506.02273, 2015) such as Gaussian mixture models, strongly overparametrized models (causing underdetermined inverse problems; Gelman et al., 2004) or artificial neural networks (Watanabe, 2010). The WAIC writes as
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0116(8)
Again, model complexity is measured by the effective number of parameters in two versions, called urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0117 and urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0118 in the following. urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0119 is estimated in a similar way as urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0120 in equation 6. However, for urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0121, the difference is evaluated for each observation Di in urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0122 independently over the whole parameter space, and then summed over all Ns observations, approximating leave‐one‐out (LOO) cross‐validation (CV)—details on LOO follow at the end of this section).
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0123(9)
Similarly to urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0124 above, a variance‐based estimator of the urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0125 exists, yielding a second version of the WAIC:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0126(10)

It is still argued about whether the variance‐based estimators in the DIC and the WAIC can be seen as generalizations of each other (Watanabe, 2010). However, for practical purposes, both are sometimes advantageous over the two difference‐based estimators ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0127 and urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0128) because they cannot become negative (Gelman et al., 2014).

In the WAIC‐related equations 8-10, expectations and variance of log predictive density are evaluated for each data point in urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0129 and then summed up. This is different from the approaches in the AIC and DIC where the log likelihood function urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0130 of the entire data set urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0131 is used. Further, with the underlying assumptions in the AIC and the DIC, they may use point estimators to estimate the predictive density. The WAIC only uses quantities averaged over the whole parameter space and all independent observations. Therefore, the WAIC is considered the only fully Bayesian one among the predictive IC (Gelman et al., 2014).

Predictive IC resemble explicit approximations to different implicit Bayesian cross‐validation (BCV) methods (Gelman et al., 2014; Stone, 1977). However, the introduced A1‐type criteria are criticized for not always evaluating predictive model capability reliably. In cases, where assumptions might be violated, implicit methods (e.g., Piironen & Vehtari, 2017) can be applied. These are computationally more demanding but do not split the model evaluation score urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0132 into goodness of fit and complexity. A popular example is LOO (Bayesian) cross‐validation (Vehtari & Lampinen, 2002). In this BCV method, the posterior parameter pdf is inferred by conditioning on urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0133, which is obtained by leaving out one data point Do from urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0134. Then, the predictive density for each Do is evaluated using the accordingly modified equation 2. After repeating this procedure over all urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0135, the expected predictive density as final score urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0136 is obtained by averaging these individual predictive densities.

In summary, model complexity quantified as effective number of parameters urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0137 is an offset correction for estimating the predictive density of unknown out‐of‐sample data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0138 by only using known within‐sample data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0139. urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0140 is conditional on urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0141 and can therefore be interpreted as the amount of parameters that are constrained by urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0142 rather than just by the prior information on parameters (Gelman et al., 2014).

3.1.2 A0: Predictive Error

An alternative starting point for assessing the predictive capability of a model is to interpret observables urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0143 deterministically rather than as random variables. Then, future data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0144 shall be met as exactly as possible and model predictions shall only show minimal residuals when compared to data. This is related but different from A1‐type criteria that see data as samples of random variables with the goal for models to meet the right probability distributions.

In statistical learning (Friedman et al., 2001), calibration error and validation error are often referred to as training and test error, respectively. A common way to measure the training error is the residual sum of squares (RSS) urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0145 between observations urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0146 and the corresponding best fit estimates urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0147 of the model. Accordingly, the test error RSS writes as urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0148. As the potential future data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0149 are unknown, the test error has to be estimated by just using the training error from urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0150. As the training error underestimates the test error, a penalty term has to be added to the training error which accounts for the gap between the two types of errors. This term is assumed to be proportional to the so‐called model degrees of freedom (DoF; e.g., Friedman et al., 2001; Zou et al., 2007)—not to be confused with the statistical degrees of freedom ν in section 3.1.1. Adding these DoF, scaled by a known error variance urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0151, to the RSS yields a common estimate for test error, called the expected prediction error (EPE; Janson et al., 2015):
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0152(11)

This expression for EPE in equation 11 is also known as Mallows' Cp (Efron, 1986; Janson et al., 2015; Mallows, 1973). It can easily be seen that the RSS is proportional to a Gaussian log likelihood, which leads to the same term (yet with a different derivation) as in the AIC. Therefore, the EPE is often interpreted similarly to the predictive information criteria from section 3.1.1 and model DoF are considered to be yet another measure for model complexity (Hooten & Hobbs, 2015; Ye, 1998). However, the DoF as model complexity are neither motivated by predictive density nor do they require any kind of Bayesian parameter prior.

DoF are meant to measure the sensitivity of model predictions urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0153 with respect to perturbations in the data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0154 used for training (Ye, 1998). A model with high sensitivities does not allow for a unique calibration when calibrated to different data sets that even just slightly deviate from one another. This interpretation directly links to the stability of model inversion (e.g., Tarantola, 2005). In this sense, a model with high sensitivities to data is considered to have many degrees of freedom and is rated complex.

A widely accepted and used formulation for the DoF in model selection is the so‐called expected optimism (Efron, 1983, 1986), assuming i.i.d. errors of finite variance urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0155:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0156(12)

In this approach, urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0157 is estimated on repetitively perturbed data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0158 and corresponding best fit estimates urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0159. This is a direct assessment of how sensitive model predictions are to noise in data. The perturbed data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0160 can be obtained, for example, using residual bootstrapping (Efron & Tibshirani, 1994). DoF can generally be evaluated for linear and nonlinear models (Janson et al., 2015). In the special case of Gaussian linear models, DoF is independent from data and predictions (Ye, 1998). Gaussian refers to the error distribution; and a linear model writes as urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0161, with the parameter vector urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0162 and the independent variables being contained in the matrix X of base function vectors. The least squares estimator is urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0163 which yields urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0164. This can be reformulated by urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0165, where the so‐called projection matrix (aka hat, smoother, or influence matrix) urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0166 describes the projection from observations to least squares estimates: urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0167 (Cardinali et al., 2004). The diagonal elements of S are called leverages. The sum of leverages, i.e., the trace of S, is interpreted as the model DoF. Thus, for linear models, equation 12 turns into urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0168 and yields the number of linearly independent predictors (Janson et al., 2015), i.e., the number of parameters: urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0169.

In general, the interpretation of model DoF has to be done carefully. In linear (polynomial) regression, DoF is equal to the number Np of nonredundant free parameters in the model and are therefore accepted as model complexity measure (van der Linde, 2012). For nonlinear models, it is possible that the DoF are smaller or larger than the actual number of parameters (Janson et al., 2015). This counteracts our intuition of counting flexible parts of the model and makes it more important to consider model DoF as a complexity measure representing sensitivities rather than a parameter count. Following a similar spirit, methods exist (which can also be used for model selection) that bound the predictive error using, e.g., structural risk minimization (e.g., Friedman et al., 2001), the so‐called covering number (Cucker & Smale, 2002) or related concepts (e.g., Pande et al., 2009, 2015).

In summary, the sensitivity‐based DoF estimated from the available data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0170 quantify the potential of poorly predicting new data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0171 due to unstable model inversion. This is a short‐cut to classic (not Bayesian) cross‐validation approaches (e.g., Friedman et al., 2001).

3.1.3 A1‐Type Versus A0‐Type Criteria

For linear models and uncorrelated Gaussian errors, it can be shown that the A0‐type EPE (equation 11) is equivalent to the A1‐type AIC (equation 4; Boisbunon et al., 2014; Mallick & Yi, 2013). In this special case, the model complexity representation by Np coincides in both criteria. This coincidence is one of the reasons, why model DoF and urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0172 are often used interchangeably to quantify model complexity despite their different motivation. In classical statistics, DoF are a measure for the “number of dimensions in which a random vector may vary” (Janson et al., 2015). This interpretation also suits the flexible‐parts‐view on DoF and triggers even more the interchangeable use with effective number of parameters urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0173. Further, in the large‐sample limit urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0174, the influence of the Bayesian parameter prior in the DIC and WAIC declines. This makes A1‐type criteria asymptotically equivalent to A0‐type criteria.

Despite these similarities, they are not the same thing: A0‐type DoF in model selection refer to the difference between two kinds of error (training and test). They resemble the summed‐up sensitivities of predictions to perturbations in the calibration data and can be seen as quantifying the stability of the model inversion. A1‐type urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0175 refer to the information‐theoretically motivated distance between probability distributions of observations and model predictions, which requires the incorporation of a Bayesian parameter prior. Due to their different motivations, they correspond to different nonconsistent model selection classes, as shown in Figure 5.

image

Classifying common nonconsistent model selection criteria based on them being Bayesian or not. Subscripts display the level of Bayesianism of the particular method. A1‐type criteria are Bayesian and target at predictive density. The implicit (indicated by *) method approximated by the predictive IC in this class is Bayesian cross‐validation ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0176). The strength of underlying assumptions increases from WAIC to AIC. A0‐type criteria are non‐Bayesian and target at predictive error, exemplified by the predictive expected error EPE. In the large‐sample limit ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0177) the influence of the Bayesian parameter prior declines. Then, A1‐type and A0‐type criteria approach asymptotic equivalence (dashed line).

Overall, the common ground of nonconsistent model selection (shown in A1 and A0) can be summarized by three points:
  1. Model complexity is an estimate for the lack of generalizability to unseen data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0178 after seeing urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0179. In a quite counterintuitive manner, this is estimated based on just the calibration data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0180 (in IC), i.e., without actually considering a validation data set urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0181 (as in CV).
  2. Model complexity is evaluated for the model having a certain parametrization (calibration) conditional on data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0182.
  3. Their behavior in the limit of urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0183 is asymptotically equivalent.

3.2 B‐Type Model Selection: Consistency

Consistent model selection will identify the data‐generating true model in the large‐sample limit if it is among the candidate models. This can be pursued based on either model probabilities (B1, section 3.2.1) or code length (B0, section 3.2.2) as target quantities.

3.2.1 B1: Posterior Model Probability

Model selection based on model probabilities is commonly known as Bayesian model selection (BMS; Hoeting et al., 1999). BMS estimates the probability of a model Mk to represent the DGP out of a set of NM discrete candidate models. The Bayesian model probability framework allows to incorporate both prior knowledge on model parameters and on the probabilities of models being the true one, i.e., it covers all levels of Bayesianism. In BMS, observations are used to update prior model probabilities urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0184, called weights, to posterior model weights urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0185. The posterior model probability urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0186 of a model Mk in an ensemble of NM models writes as
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0187(13)
with the model‐specific marginal likelihood urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0188.

Theoretically, with more data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0189 coming in, the posterior probability urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0190 of the true model converges to 1 as result of the consistency property of BMS (Aho et al., 2014). The ratio of posterior model probabilities between two models Mk and Ml is often referred to as posterior odds, which are the prior odds updated by the so‐called Bayes factor urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0191 (Kass & Raftery, 1995): urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0192.

The term urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0193 in the denominator of equation 13 denotes the Bayesian model evidence (BME; as introduced in section 2.2). It is used to update from prior to posterior model weights. BME is a marginalized likelihood urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0194 that model Mk with parameters urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0195 under its parameter prior urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0196 generated the data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0197:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0198(14)
BME describes an implicit trade‐off between goodness of fit and model complexity (Schöniger et al., 2014), when the likelihood urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0199 is averaged over the entire parameter prior pdf. A good model can achieve high likelihood values for well‐chosen parameter values, but a complex model dilutes its high likelihood with lower likelihood values in a wide parameter distribution. BME is the normalizing constant in Bayes theorem for parameter inference, which can be arranged as
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0200(15)

In this interpretation, model complexity is expressed as the ratio between parameter prior and parameter posterior. This ratio is interpreted as shrinking of the prior toward the posterior, i.e., the information about the parameters which is gained by incorporating urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0201 (Bishop, 1995; Ghahramani, 2013; MacKay, 1992). Clearly, this interpretation of model complexity seeks to quantify the lack of identifiability of parameters.

It is important to note that the Bayesian Occam's razor only includes information about parameters that are constrained by (or sensitive to) data. Parameters for which there is no shrinking—because they do not relate to data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0202—do not have an effect on the BME (Trotta, 2008): If changing a certain parameter does not affect the model fit, the likelihood function is constant along this parameter dimension. Then, marginalizing over this parameter in equation 14 does not change urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0203. Further, BME does not provide information about generalizability of the model (Pitt et al., 2002) in the sense of yielding a score for predicting new data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0204. The consistent point of view of the marginal likelihood exclusively answers how likely it is that the model under consideration has generated urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0205.

There are many numerical implicit methods for estimating BME, like prior integration, thermodynamic integration, nested sampling or Chib's method (see Friel & Wyse, 2012; Liu et al., 2016; Schöniger et al., 2014). Prior integration is a direct evaluation of equation 14 and yields the so‐called arithmetic mean estimator: parameters are drawn from the prior urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0206 and the corresponding likelihood urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0207 is evaluated for each parameter sample. The average of these likelihood values is then used to calculate the marginal likelihood. With the entire prior sampled, this average converges to the BME. Prior integration has shown to be very reliable (Schöniger et al., 2014). However, like the other implicit methods, it is computationally very demanding. Some information criteria offer computationally more feasible explicit alternatives but come with strong assumptions. For readability, the above notation of being conditional on model Mk is dropped. However, all quantities are still model specific. Further, the notation for likelihood urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0208 is again substituted by the equivalent urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0209 already used in section 3.1.

For linear models with Gaussian parameter prior and Gaussian likelihood, the Kashyap information criterion KIC (Kashyap, 1982) provides an analytically correct solution to the marginal likelihood given by urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0210. It is based on the Laplace approximation around the maximum a posterior estimator (MAP) urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0211 (Schöniger et al., 2014) and it can be applied whenever this approximation is valid:
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0212(16)

The KIC allows for splitting up the logarithmic marginal likelihood explicitly into a goodness of fit term (first term in equation 16) and a so‐called logarithmic Occam factor (OF; MacKay, 1992) comprising three complexity terms. A detailed discussion on the effect of each of these terms can be found in Schöniger et al. (2014). In summary, the first two of these terms penalize complexity with respect to the number and prior uncertainty of parameters and balance each other partially by mutual compensation. The last term can be interpreted as a penalty for low parameter sensitivity toward data, i.e., for poor parameter identifiability by the given data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0213.

As an alternative to the KIC evaluated at the MAP, KIC is frequently evaluated at the maximum likelihood estimator (MLE; Neuman, 2003; Schöniger et al., 2014; Ye et al., 2004). For larger sample sizes it is assumed that the likelihood function dominates the posterior parameter distribution and the MAP approaches the MLE. Hence, for these cases, BME can be reasonably approximated by evaluating the KIC terms at the MLE for large (informative) data sets.

The popular Schwarz or Bayesian IC (SIC/BIC; Schwarz, 1978) is the most compact approximation to BME. The BIC is derived in the limit of infinite sample size urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0214. Then again, the MLE urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0215 becomes equal to the MAP. In this limit, all Occam factor terms in equation 16 that are not affected by Ns drop because they become negligible. The Occam factor that remains is urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0216. The BIC therefore writes as
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0217(17)

In theory, the BIC converges to BME in the limit of infinite sample size. In practice, it is criticized for yielding unsatisfactory results even for large Ns (Kass & Raftery, 1995). Nonetheless, the BIC is the most popular consistent information criterion due to its simplicity. Hence, the whole branch of consistent model selection is often referred to as BIC‐type model selection (Aho et al., 2014). Like the AICc in nonconsistent model selection, there is a proposed correction for small sample sizes called BICc (McQuarrie, 1999).

So far, reliable BME evaluation metrics do either underlie strong assumptions like the above explicit criteria, or they are computationally demanding like the mentioned implicit schemes. Therefore, it is not possible to measure model complexity in the above sense explicitly when these assumptions do not hold (Schöniger et al., 2014). Criteria like the Watanabe‐Bayesian IC (WBIC; Watanabe, 2013) have been proposed to resolve this issue, but in various cases they perform poorly in approximating the BME when tested against implicit methods that directly assess BME (see e.g., Friel et al., 2016; Mononen, 2015).

In summary, model complexity quantified as Occam factor OF as part of the BME is the knowledge gain between parameter prior and posterior. It grows with the data size Ns and only parameters that are affected by urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0218 (i.e., are identified by urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0219) contribute to this value.

3.2.2 B0: Code Length

In coding theory, a model is considered to be “a compact representation of possible data one could observe” (Ghahramani, 2013). The coding‐theoretic Kolmogorov(‐Chaitin) complexity (KC) formalizes this concept of a model by evaluating the complexity of a sequence (Grünwald, 2000). KC is the shortest code in bits that can produce a certain output, e.g., a sequence of symbols like a series of data, and then halts (Grünwald & Vitányi, 2003). For reasons not further discussed here, KC is considered to be incomputable (Rathmanner & Hutter, 2011).

From a coding theory point of view, everything can be encoded. In this spirit, fitting a model can be considered as encoding the data. The shortest coded compression of data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0220 is the simplest statistical model that can reproduce urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0221 (Grünwald, 2000; Rissanen, 1978). The idea of compressing data is based on the assumption that there is pattern or structure in the observations. A set of data without any structure cannot be compressed easily and each data point has to be stored explicitly. This enlarges the code and makes the required model more complex. The more compression due to redundancy or structure is possible, the better a simple model can describe the regularities behind the observations (Myung et al., 2000).

This perspective motivated the development of model selection based on Minimum Description Length (MDL; Rissanen, 1978). The MDL of a model candidate to compress urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0222 is its code length needed (Myung et al., 2006). The popular version of MDL presented here is formalized as the so‐called Fisher information approximation (e.g., Vandekerckhove et al., 2015; see section 3.1.1 for details on the Fisher information matrix F):
urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0223(18)

The MDL consists of two parts (e.g., Barron et al., 1998; Myung et al., 2000): a first part represents the code length that is needed to describe the deviations between data and best fit model predictions (goodness of fit). A second part encodes the functional relations of the model and its parameters, i.e., the complexity of the model, called geometric complexity (GC). The idea behind GC is that a model generates (likelihood) distributions urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0224. A model therefore represents a “family of probability distributions consisting of all the different distributions that can be reproduced by varying urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0225” (Myung et al., 2006). The complexity of the model then refers to how similar these distributions are (Rissanen, 1996). A model is considered to be simple, if the distributions are hardly distinguishable: less distinguishability means more structure in the data and more structure means more compressibility in code (Myung et al., 2000).

In so‐called entropy encoding, code length is approximately proportional to the negative logarithmic probability density (Friedman et al., 2001; Myung et al., 2006). A high density means little deviations or small errors, which in turn require just short pieces of code to be compressed (Barron et al., 1998). Hence, the goodness of fit term and the GC terms in equation 18 can be interpreted as code lengths.

GC in equation 18 can be seen as the logarithm of the counted number of distinguishable distributions over the model's whole parameter space, hence growing with Np. The counting is based on a differential‐geometric distance measure which employs the Fisher information matrix normalized by the number of observations urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0226. With this metric it is possible to quantify how “close” the distributions are, i.e., whether they can be distinguished and counted separately or not—for more details refer to, e.g., Myung et al. (2000) and references therein.

In summary, GC is a coding‐theoretic counter for distinguishable distributions produced by the model and it grows with data size Ns. MDL selects the model with the highest ratio between goodness of fit and the number of distributions the model can generate (Myung et al., 2000). It can be used for non‐Bayesian consistent model selection (Lanterman, 2001) but is numerically demanding in case no closed form solutions for the evaluation of GC are available.

3.2.3 B1‐Type Versus B0‐Type Criteria

Interestingly, the integrand from GC in equation 18 coincides with the so‐called Jeffrey's prior for parameters urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0227 (Myung et al., 2006). This prior is used as a kind of noninformative or “objective” prior in Bayesian model selection (Barron et al., 1998). For large Ns and using Jeffrey's prior on parameters, BMS is identical to model selection using MDL (Myung et al., 2006). Further, in the limit of urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0228, the last term of MDL in equation 18 becomes negligible due to its independence on sample size. Further, in this limit, the prior model weights and prior parameter distribution become irrelevant. Then, MDL becomes proportional to the BIC because the complexity terms in both criteria scale equivalently with urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0229 and Np (Barron & Cover, 1991; Hansen & Yu, 2001; Myung et al., 2000; Shiffrin et al., 2016).

Despite the asymptotic equivalence, the criteria and complexity representations differ fundamentally between the two classes (B1 and B0) as depicted in Figure 6: Model complexity in B1‐type criteria (OF) relates to how much knowledge about the parameters was inferred from data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0230, shrinking the prior to the posterior distribution. A complex model in this sense is a model for which parameters can hardly be constrained and identified with urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0231. B0‐type model complexity according to coding theory is measured as GC and relates to compressibility of data. A complex model in this sense is a long code needed to describe the regularities of data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0232. Apart from the special cases above, real BMS requires using posterior model weights, Bayesian parameter and model distributions, all of which is not supported by MDL. Therefore, the two classes generally lead to different selection results (Grünwald, 2000), unless the true model is actually among the candidates and will eventually be selected.

image

Classifying common consistent model selection criteria based on them being Bayesian or not. Subscripts display the level of Bayesianism of the particular method. B1‐type criteria are Bayesian and target at model probability. The implicit (indicated by *) method approximated by the other criteria in this class is direct evaluation of the Bayesian model evidence ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0233). The strength of underlying assumptions increases from KIC to BIC. B0‐type criteria are non‐Bayesian and target at code length, exemplified by the Minimum Description Length (MDL). In the large‐sample limit ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0234), the influence of the Bayesian parameter prior declines and B1‐type and B0‐type criteria approach asymptotic equivalence (dashed line).

Overall, the common ground of consistent model selection (shown in B1 and B0) can be summarized by three points:
  1. Model complexity is a measure for the lack of identifiability of a model and its parameters as representation of the data‐generating process.
  2. Model complexity is an integrated quantity over all possible model parametrizations unconditional on data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0235. Consistent model selection compares model predictions to data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0236 over the whole parameter space.
  3. Their behavior in the limit of urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0237 is asymptotically equivalent.

3.3 A‐Type Versus B‐Type: Large‐Sample Limit

In the limit of infinitely large‐sample size urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0238, the parameter prior distribution becomes negligible and the complexity terms of the criteria within the A‐type classes converge as well as those within the B‐type classes. However, nonconsistent model selection differs fundamentally from consistent model selection especially in this limit (as schematically depicted in Figure 2 of the illustrative thought experiment). The criteria designed for this limit are the AIC and BIC, respectively. This is why model selection criteria are often sorted into the so‐called AIC‐world and BIC‐world (Vrieze, 2012; Aho et al., 2014), which is conform with A‐type and B‐type used in this primer. The respective model complexity terms are shown in Figure 7 in order to visualize the fundamental difference between the two worlds. Remember, that the two criteria were designed for the large‐sample limit. Nontheless, AIC and BIC are displayed for small sample sizes in Figure 7 for two reasons: first, they are often applied in practice regardless of this assumption. Second, these prominent members of the two model selection types are perfectly suited to display the deviating model complexity representations between nonconsistent and consistent criteria and what this implies for the selection of models.

image

Model complexity representation of AIC (twice the number of parameters: urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0239) versus BIC (the number of parameters scaled by the logarithmic number of observations urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0240). Blue isolines display same complexity in the urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0241 space, with lighter blue indicating higher complexity (penalty).

In the AIC‐world, the complexity penalty generally does not grow with growing Ns as can be seen in Figure 7. In the AIC, complexity is totally independent of Ns and is given as twice the number of parameters. This resembles the most classic way of bounded complexity representation (Leeb & Pötscher, 2009) in nonconsistent model selection, enabling A‐type criteria to successively approach the (infinite‐dimensional) truth by switching to “closer” models under growing data size. Opposed to this, in the BIC‐world, the complexity penalty constantly increases with urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0242 as shown in Figure 7. This depicts the consistent nature of model selection criteria in the BIC‐world in the simplest way, enabling these criteria to identify the (finite‐dimensional) true model that is assumed to be among the model candidates.

3.4 Alternative Model Selection Criteria

Apart from the model selection criteria presented above (AIC, BIC, etc.), many other criteria were developed over the last decades—to the point that nearly a whole alphabet of criteria can be set up (Spiegelhalter et al., 2014). In most of them, model complexity is interpreted and measured differently, some are advances or refinements of other criteria. Additional examples of widely used model selection criteria and complexity measures are the nonconsistent ICOMP (Bozdogan, 1990, 2000), Moody's effective number of parameters (Moody et al., 1992), or the Vapnik‐Chervonenkis (VC) dimension (e.g., Friedman et al., 2001) which is used in structural risk minimization (Guyon et al., 2010). Additional consistent model selection criteria are Hannan‐Quinn (Hannan & Quinn, 1979) or various versions of encoding complexity (Myung et al., 2006; Rissanen, 1987).

Covering all of these in detail would go beyond the purpose of this primer, but the completed classification scheme in section 4.1 may allocate them as well. Crucial in their application is always how exactly they consider model complexity.

4 Model Selection Put Into Perspective

4.1 Classification Scheme

Choosing the right model selection class, as summarized in Figure 8, starts with asking what the purpose of the model is. This leads to either A‐type (approaching truth) or B‐type (identifying truth) model selection. Then, the next step refers to the distinction whether a Bayesian perspective starting with parameter prior incorporation shall/can be used or not.

image

Classifying all introduced criteria based on them being first nonconsistent versus consistent and second Bayesian or not. Subscripts display the level of Bayesianism of the particular method/criterion. A1‐type and B1‐type criteria are Bayesian using at least a Bayesian parameter prior. The corresponding criteria are sorted according to the strength of underlying assumptions in approaching their respective implicit (indicated by *) target model selection scores ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0243 versus urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0244). A0‐type and B0‐type criteria are not Bayesian, they are represented by their most common members EPE and MDL, respectively. In the large‐sample limit urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0245, the influence of the Bayesian parameter prior declines and, respectively, nonconsistent and consistent model selection criteria become asymptotically equivalent (dashed line).

Predictive information criteria (A1‐type) are nonconsistent, and also Bayesian to a certain degree: all of them cover the first level of Bayesianism; DIC and WAIC can incorporate informative priors. Even the AIC assumes a Bayesian parameter distribution, but just a noninformative one. Further, the WAIC uses averaged goodness of fit and model complexity terms (second level). However, none of these criteria is designed to work with Bayesian prior and posterior model weights (third level).

B1‐type model selection is consistent and covers all three levels of Bayesianism. Methods of this kind use prior and posterior probabilities for both parameters and models. Although point estimates are used in some of the criteria for assessing the goodness of fit, e.g., like in the KIC, they are used under conditions where they coincide with averaged estimators, i.e., the peak of a Gaussian likelihood function is both mean and maximum of the distribution. Further, although the first Bayesian level is covered by the BIC, it is irrelevant in the assumed infinite sample size limit.

The (weak‐assumptions) end‐members in A1‐type and B1‐type model selection methods are the implicit evaluation of a Bayesian cross‐validation (BCV) score or the Bayesian model evidence (BME) coming from a implicit evaluation, respectively.

Looking back at the three levels of Bayesianism, the third level (model probabilities/weights) occurs only in Bayesian model selection (BMS; Hoeting et al., 1999). The other two levels may occur in both the nonconsistent and the consistent model selection world. As an information‐theoretic equivalent to Bayesian model weights, so‐called Akaike weights (Burnham & Anderson, 2002, 2004) can be used in a similar way. However, these shall not be confused with the concept of Bayesian model weights, because Akaike weights are nonconsistent and have no connection to the notion of (true) model probability.

While the Bayesian perspective is part of the underlying assumptions for A1‐type and B1‐type criteria, A0‐type (e.g., EPE) as well as B0‐type (e.g., MDL) selection criteria do not require a Bayesian parameter prior, as depicted in Figure 8. They allow for prior‐free nonconsistent or consistent model selection, respectively. Hence, they are immune to misspecified priors but can also not benefit from potential advantages of using a prior. However, this does not mean that they cannot be extended in a Bayesian fashion. For example, EPE can be employed with a Bayesian parameter prior as a form of regularization (Mallick & Yi, 2013). Similarly, the MDL (B0‐type) presented here can be derived in a non‐Bayesian context (Lanterman, 2001) but can be extended to the normalized maximum likelihood (NML) approach (Shiffrin et al., 2016) which is able to incorporate a Bayesian prior.

4.2 The Role of Priors in Model Selection

Generally, the use of priors (for parameters and models) in model selection is a double‐edged sword: On the one hand, an inappropriately chosen prior can yield problematic results or even allow a modeler to manipulate a model ranking in favour of a certain candidate model (Gelman et al., 2014). An appropriate prior that is too vague does not help either, preventing a clear model selection (Bartlett, 1957; Gelfand & Dey, 1994). The search for priors that are less susceptible to subjectivity of the modeler is still a large field of ongoing research. Among others, uniform, maximum entropy or reference priors (van der Linde, 2012) are investigated as such “objective” priors.

On the other hand, using an appropriate (e.g., physics‐based) parameter prior in consistent model selection might sometimes be the only way to get close to the data‐generating truth. From a classic statistics point of view, infinitely many data points have to be collected until the best fit parameter estimate of the true candidate model converges to the true parameters. In reality this is simply impossible, especially when expensive field data are collected. The parameter prior might be the missing piece to select the true model with limited data. Further, it serves as a natural regularization of the model that counteracts overfitting (VanderPlas, 2014). Hence, if for instance mechanistic models are used and a reasonable physical prior for the parameters is available, it shall be used (Vanpaemel, 2009).

4.3 Contrasting the Views on Models and Their Complexity

The four introduced model selection classes differ in the definition of models, the meaning of what a complex model is, how model complexity can be quantified and what the respective complexity measures are. Therefore, Table 2 summarizes the foundations on which the four selection classes try to identify the respective “best” model.

Table 2. Class‐Specific Consideration of What Models Are in Principle, What a Best Model Provides and Based on Which Score This Is Measured; Respective Properties of Complex Models, How Complexity Is Represented and Quantified, and Which Model Selection Criteria Work Accordingly
Type Model is a … Best model … … based on … A complex model … Complexity … … quantifies … Criteria
A1 Probabilistic attempt to approach truth Has largest predictive capability Predictive density Has low predictive coverage urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0246 Data‐constrained parameter number AIC, DIC, WAIC (BCV)
A0 Flexible regression of data Poses most stable inversion Predictive error Poses a nonunique inversion problem DoF Sensitivity to data perturbations EPE, Mallows' Cp
B1 Probabilistic attempt to represent truth Is most likely data‐generating process Model probability Allows only weak parameter inference OF “Posterior‐prior‐ratio” BIC, KIC (BME)
B0 Compression of data series Is most compact data representation Code length Is a too long code GC Distinguishable likelihoods MDL

A1‐type criteria consider a model to be a probabilistic attempt to approach the infinitely complex data‐generating truth—but only approaching, not representing. The best model achieves the highest predictive capability based on predictive density. A complex model shows a large offset (large urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0247) between the estimated out‐of‐sample ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0248) predictive density and the within‐sample ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0249) predictive density (Vehtari & Ojanen, 2012), because the complexity of the model does not allow the data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0250 to sufficiently constrain the parameters.

In a similar spirit, A0‐type criteria assume that a model is just a more or less flexible regression of data. This does not need to be inspired by the physical truth behind the data, either. The best model obtains the highest predictive capability based on its predictive error for urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0251. A0‐type criteria are concerned with the instability (flexibility) of the model inversion. A complex model only allows instable or nonunique parameter inversion, which is measured by large sensitivities (large DoF) of model predictions with respect to perturbations in the data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0252.

B1‐type criteria take each model as a probabilistic attempt to truly represent the data‐generating process, believing that the true model exists and is among the candidate models. The best model is most likely to have generated the data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0253 and achieves the highest probability of being the true model. B1‐type criteria expect the strongest parameter inference for this model and its prior when faced with the data urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0254. A complex model shows weak parameter identifiablity (large Occam factor OF), quantified as the shrinkage ratio from the prior toward the posterior parameter distribution.

Alternatively, B0‐type criteria consider each model to be a compression of data. Thus, they state that the best possible compression of data requires just a certain code length. The best model is the most compact one, which according to coding theory coincides with the data‐generating truth. Compactness of a model is quantified as number of distinguishable (likelihood) distributions over its parameter space. A complex model in a B0‐type sense is a too long compression of urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0255.

4.4 Matching Model Selection Classes With Model Types

The choice of a certain model selection criterion is specific to the model selection task at hand. We outline imaginable extreme cases of matchings for the field of water resources in the following, but there are equivalents in practically all other fields where mathematical or numerical models are employed.

For A‐type model selection in an infinite (relevant) dimensional truth scenario, matching suggestions between selection criteria purpose and models one could think of are the following:
  1. A1‐type ( urn:x-wiley:00431397:media:wrcr23140:wrcr23140-math-0256). Providing high predictive capability via high predictive density for unseen data. The probabilistic nature allows for incorporating prior parameter knowledge—example: Bucket‐type models for stream discharge or flood forecasting (e.g., Orth et al., 2015). Such models normally include (semi)physical relationships and corresponding prior parameter distributions.
  2. A0‐type (DoF). Obtaining the highest predictive capability in a nonprobabilistic manner via predictive error for unseen data with the most uniquely calibrated model. Besides, the effect of regularizations can be assessed in a second step because reduction in DoF means reduced risk of overfitting—example: regression models (e.g., artificial neural networks) for time series predictions (e.g., R. J. Tibshirani, Degrees of freedom and model search, arXiv preprint arXiv:1402.1920, 2014). Such flexible models are hardly uniquely calibrated and often require some sort of regularization.
For B‐type model selection in a finite (relevant) dimensional truth scenario, matching suggestions between selection criteria purpose and models one could think of are the following:
  1. B1‐type (OF). Identifying the data‐generating model via Bayesian model selection, given that a reasonable prior is provided (e.g., based on physical quantities)—example: partial differential equations (pde)‐based models for groundwater flow (e.g., von Gunten et al., 2014). These mechanistic models and their parameters are subject to prior knowledge and physical meaning. However, it is crucial that prior information on parameters covers potential subsurface heterogeneity, scale‐dependence, etc. to obtain a maximally unbiased prior (e.g., for hydraulic conductivity).
  2. B0‐type (GC). Approaching the true model via minimal required code length, without the need to specify a certain parameter prior distribution—example: stochastic rainfall generators (e.g., Golder et al., 2014). Such a model represents the statistics of the process of interest. The parameters are the statistical moments which describe the process pattern.

These extreme examples of matching highlight the importance of having the model purpose, the type of model (data‐driven, mechanistic, etc.), and the information about the model parameters in mind (Guthke, 2017), when an appropriate model selection class has to be picked. Further, the consideration of the (relevant) dimensionality of the truth to be modeled affects whether a certain kind of model selection matches with a model (Leeb & Pötscher, 2009).

In the B1‐type example for a well‐observed system of finite dimensionality, a pde‐based model that parametrizes all dimensions can be identified as true model and will therefore also yield the best predictions. Yet it might be questionable whether real‐world problems can ever be considered to be of finite dimensionality. A system can contain such an amount of different state variables and phenomena like fluxes, unclear boundary conditions, and heterogeneity at all scales that even a highly resolved physical model cannot really capture this truth. Then, assuming that the large dimensionality of the underlying true system cannot fully be represented, a nonconsistent method might be a better choice for rating such physical models in terms of approaching the true model (rather than identifying it) to assure maximum predictive capability.

Within the bounds of how accurate and precise (and at what scale) we can observe real systems which we seek to model, it can be possible for models to be fully sufficient and not to consist of pde‐based process representation on the finest imaginable discretization level of time and space. For example, a hydrologic system can be fully covered by a conceptual hydrologic bucket model in cases where some system components and processes might have only small impact on the intended objective or scale of the model, or are overlaid by measurement error. Then, the conceptual model resembles the characteristics of the whole system to a sufficient degree. Hence, a B‐type model selection might identify it as true model because it covers all the relevant dimensions of the truth that affect the predicted quantities of interest. The risk in such a scenario is that the selected model differs from one that would have been selected by an A‐type criterion for highest predictive capability (Vrieze, 2012).

In case fully data‐driven models are used, it has to be kept in mind that such models rather mimic than represent the truth. Model predictions might be based on correlation rather than causality. In terms of physical comprehensibility it is therefore questionable whether B‐type model selection can ever identify a resemblance of the data‐generating process among such model candidates. From this perspective, A‐type model selection appears to be the preferential choice for this type of models, yielding optimal predictive capability. However, it can be expected that with more data coming in, A‐type criteria might primarily justify adding more and more terms to the model which points to one of the core issues of machine learning and related fields (Breiman, 2001; Friedman et al., 2001).

In summary, while our presented scheme helps to allocate model selection criteria and methods, a clear connection to the models which shall be rated is not always straight forward. Nonetheless, being aware of the capabilities of the model selection criteria within the presented classes (when applied to certain models under a more or less graspable truth dimensionality) enables a structured way to model selection. In this sense, the extreme cases above indicate potential, but nonexclusive matchings. Matching certain model types with other model selection criteria is likewise possible and we think our scheme can support the reasoning behind:
  1. Thinking about the dimensionality of the truth, i.e., its relevant system components and processes that have to be modeled and how well they are observed (measurement errors).
  2. Finding corresponding models (i.e., model purpose, model type and model parameters) to approach or represent the truth.
  3. Applying our scheme to find a match between models and model selection method/criterion.

The last two points are part of the model development process, when the degree of detail for each candidate model has to be set. In this process, briefly, it has to be decided how many and which processes, state variables, boundary conditions, etc. shall be parametrized, in which fashion and resolution this shall be done and how much of that is supported by observables. Hence, these two should be iterated throughout model improvement and reassessment.

Once a certain kind of model selection class is picked, one has to note the trade‐offs that come with it: If an A‐type model selection is chosen, one can legitimately switch toward a more complex model if new data supports it. However, if the highest rated model does not change anymore when new data comes in, this does not mean that the true model was identified. The method might then only indicate that a model of an even higher level of model complexity can be added to the tested set and could be supported by the data. If a B‐type model selection method is chosen, one cannot justify switching to another one of candidate models in case the selection metric suggests it when new data comes in. However, this kind of model selection is able to identify the true model if it is in the set. If the method switches between candidates in the set under incoming data, the message is that none of the candidates represents the true model well enough such that the criterion can converge toward it. Hence, further model development has to be considered and the new model should become part of the tested ensemble.

5 Summary and Conclusion

Model selection methods perform an explicit or implicit trade‐off between goodness of fit with data and model complexity. Generally, no complexity metric in model selection works without incorporating data—which means that there is no unique intrinsic model complexity (e.g., Du, 2016) that quantifies complexity only based on the model's functional relationships and parameters. The counted number of parameters Np fully represents the complexity of the model only in special cases of model selection (see A‐type).

It is nonintuitive why the two major model selection types (nonconsistent and consistent) should not lead to selecting the same model. However, they are optimal under different assumptions about the dimensionality of the truth that is modeled. If this truth is infinite dimensional, a model selection method is optimal if it can progressively approach this truth by sticking with one model only until more data justifies switching to another (more complex) one that approaches the truth even more closely (A‐type model selection). Alternatively, if the truth is of finite (relevant) dimensionality, a model selection method is optimal if it identifies the model that fully parametrizes this truth (B‐type model selection). Hence, both types of model selection pursue different target quantities for model selection and yield deviating results when they are applied to the same modeling task.

The model purpose is crucial to be considered when a particular model selection method is used. From a pragmatic point of view, nonconsistent model selection is the right choice for finding the best model for predictions in situations where the modeler cannot be sure that the truth can be sufficiently represented. Then, nonconsistent methods enable optimal use of a certain model until more observations become available and a more complex model can be legitimately employed. Driven by the philosophy to find the model which represents the truth, a model selected in a consistent manner will avoid to be falsified when more data arrives. The consistent selection therefore ranks candidate models (hypotheses) according to how strongly they resist to be proofed wrong by the data. Therefore, consistent model selection is the right choice for process understanding and scientific hypotheses testing because it is philosophically completely in line with the scientific approach.

Centered around the specific interpretations of model complexity, we conclude the following major points:
  1. When choosing between model selection criteria, the truth (dimensionality) that shall be approached or represented by a certain type of model indidcates the appropriate type of model selection. Whether this modeling purpose can be pursued in an either Bayesian way or not, directs toward the right model selection class. The assumptions met by the modeling task at hand justify the corresponding method/criterion within each class.
  2. Model selection methods that incorporate Bayesian priors should only be applied if “reasonable” priors can be assigned. The purpose of the prior should be to provide a meaningful context for testing models (Nearing & Gupta, 2015), which means not too vague and not too constraint in order to allow for a fair model selection. In cases where a “reasonable” prior cannot be assigned, non‐Bayesian model selection methods offer an alternative.
  3. Some of the explicit model selection criteria underlie strong assumptions in order to reliably quantify what they consider to be model complexity. If these assumptions do not hold, we rather recommend an admittedly more computationally costly but more reliable implicit method, e.g., (Bayesian) cross‐validation (nonconsistent) or direct evaluation of Bayesian model evidence (BME) (consistent).
  4. For general discussions during qualitative model development and comparison, it does not seem to be necessary to force our intuitive notion of complexity into a specific definition—which certainly will not be comprehensive and itself will be subject to discussion (see Gell‐Mann, 1995b). However, as soon as a model selection technique is applied, a specific definition and role of model complexity is used and the models are ranked accordingly. A comparison of different model selection metrics does therefore only make sense if either they belong to the same class (e.g., B1) or if their respective interpretation of model complexity is part of the discussion on the results.
  5. Rather than claiming the “best” model was found with a certain model selection criterion, it would be more appropriate to call it “best given the complexity interpretation” of the particular criterion. All of the criteria give the right answer (within their underlying assumptions and limitations), but to different questions.

Regardless of which explicit or implicit approach is suitable and used for model selection, we want to emphasize that one should consider and report how the particular method interprets complexity and what this means for the model which is selected.

Acknowledgments

The authors thank the German Research Foundation (DFG) for financial support of the project within the Research Training Group “Integrated Hydrosystem Modelling” (RTG 1829) at the University of Tübingen and the “Cluster of Excellence in Simulation Technology” (EXC 310/1) at the University of Stuttgart. Further, the authors thank Saket Pande, Michael N. Fienen, and one anonymous reviewer for their constructive comments and suggestions that have helped to improve the manuscript. For the manuscript, no models or data were generated or used.

      Number of times cited according to CrossRef: 28

      • Handling model complexity with parsimony: numerical analysis of the nitrogen turnover in a controlled aquifer model setup, Journal of Hydrology, 10.1016/j.jhydrol.2020.124681, (124681), (2020).
      • Global Sensitivity Analysis for Multiple Interpretive Models With Uncertain Parameters, Water Resources Research, 10.1029/2019WR025754, 56, 2, (2020).
      • A Bayesian Approach for Statistical–Physical Bulk Parameterization of Rain Microphysics. Part II: Idealized Markov Chain Monte Carlo Experiments, Journal of the Atmospheric Sciences, 10.1175/JAS-D-19-0071.1, 77, 3, (1043-1064), (2020).
      • Disentangling model complexity in green roof hydrological analysis: A Bayesian perspective, Water Research, 10.1016/j.watres.2020.115973, (115973), (2020).
      • Joint Optimization of Measurement and Modeling Strategies With Application to Radial Flow in Stratified Aquifers, Water Resources Research, 10.1029/2019WR026872, 56, 7, (2020).
      • Karst aquifer characterization by inverse application of MODFLOW-2005 CFPv2 discrete-continuum flow and transport model, Journal of Hydrology, 10.1016/j.jhydrol.2020.124922, (124922), (2020).
      • Poised to Hindcast and Earthcast the Effect of Climate on the Critical Zone, Biogeochemical Cycles, undefined, (207-222), (2020).
      • Sensitivity of Streamflow Metrics to Infiltration‐Based Stormwater Management Networks, Water Resources Research, 10.1029/2019WR026555, 56, 7, (2020).
      • Confronting the Challenge of Modeling Cloud and Precipitation Microphysics, Journal of Advances in Modeling Earth Systems, 10.1029/2019MS001689, 12, 8, (2020).
      • How to Tailor My Process‐Based Hydrological Model? Dynamic Identifiability Analysis of Flexible Model Structures, Water Resources Research, 10.1029/2020WR028042, 56, 8, (2020).
      • Improving Evapotranspiration Model Performance by Treating Energy Imbalance and Interaction, Water Resources Research, 10.1029/2020WR027367, 56, 9, (2020).
      • A Brief Analysis of Conceptual Model Structure Uncertainty Using 36 Models and 559 Catchments, Water Resources Research, 10.1029/2019WR025975, 56, 9, (2020).
      • Pedotransfer Function for the Brunswick Soil Hydraulic Property Model and Comparison to the van Genuchten‐Mualem Model, Water Resources Research, 10.1029/2019WR026820, 56, 9, (2020).
      • Bayesian Model Weighting: The Many Faces of Model Averaging, Water, 10.3390/w12020309, 12, 2, (309), (2020).
      • Dynamics of hydrological-model parameters: mechanisms, problems and solutions, Hydrology and Earth System Sciences, 10.5194/hess-24-1347-2020, 24, 3, (1347-1366), (2020).
      • Nitrogen Mass Balance and Pressure Impact Model Applied to an Urban Aquifer, Water, 10.3390/w12041171, 12, 4, (1171), (2020).
      • Non-stationary extreme value analysis applied to seismic fragility assessment for nuclear safety analysis, Natural Hazards and Earth System Sciences, 10.5194/nhess-20-1267-2020, 20, 5, (1267-1285), (2020).
      • Evaluating Subsurface Parameterization to Simulate Hyporheic Exchange: The Steinlach River Test Site, Groundwater, 10.1111/gwat.12884, 58, 1, (93-109), (2019).
      • Equifinality, sloppiness, and emergent structures of mechanistic soil biogeochemical models, Environmental Modelling & Software, 10.1016/j.envsoft.2019.104518, (104518), (2019).
      • Future floods using hydroclimatic simulations and peaks over threshold: An alternative to nonstationary analysis inferred from trend tests, Advances in Water Resources, 10.1016/j.advwatres.2019.103463, (103463), (2019).
      • Bayesian model selection for spatial capture–recapture models, Ecology and Evolution, 10.1002/ece3.5551, 9, 20, (11569-11583), (2019).
      • The Hydrologist’s Guide to Bayesian Model Selection, Averaging and Combination, Journal of Hydrology, 10.1016/j.jhydrol.2019.01.072, (2019).
      • Simplification error analysis for groundwater predictions with reduced order models, Advances in Water Resources, 10.1016/j.advwatres.2019.01.006, (2019).
      • The Comprehensive Differential Split-Sample Test: A stress-test for hydrological model robustness under climate variability, Journal of Hydrology, 10.1016/j.jhydrol.2019.03.054, (2019).
      • Bayesian performance evaluation of evapotranspiration models based on eddy covariance systems in an arid region, Hydrology and Earth System Sciences, 10.5194/hess-23-2877-2019, 23, 7, (2877-2895), (2019).
      • Implications of water management representations for watershed hydrologic modeling in the Yakima River basin, Hydrology and Earth System Sciences, 10.5194/hess-23-35-2019, 23, 1, (35-49), (2019).
      • Making Steppingstones out of Stumbling Blocks: A Bayesian Model Evidence Estimator with Application to Groundwater Transport Model Selection, Water, 10.3390/w11081579, 11, 8, (1579), (2019).
      • Model-Data Interaction in Groundwater Studies: Review of Methods, Applications and Future Directions, Journal of Hydrology, 10.1016/j.jhydrol.2018.09.053, (2018).