Evaluation of Methods for Causal Discovery in Hydrometeorological Systems
Abstract
Understanding causal relations is of utmost importance in hydrology and climate research for systems identification, prediction, and understanding systems behavior in a changing climate. Traditionally, researchers in hydrometeorology attempted to study causal questions by conducting controlled experiments using numerical models. This approach, however, in most cases of interest provides uncertain results because the models are approximate representation of the natural system. An alternative approach that has recently drawn significant attention in several fields is to infer causal relations from purely observational data. It possesses several traits to its utility particularly in hydrometeorology due to the rapid accumulation of in situ and remotely sensed data records. The first objective of this study is to present a brief description of four causal discovery methods (Granger causality, Transfer Entropy, graph-based algorithms, and Convergent Cross Mapping) with special emphasis on the assumptions on which they are built. Second, using synthetic data generated from a hydrological model, we assess their performance in retrieving causal information taking into account sensitivity to sample size and presence of noise. Last, we use causal analysis to examine and formulate hypotheses on causal drivers of evapotranspiration in a shrubland region during summer and winter seasons. An interpretation of the hypotheses based on canopy seasonal dynamics and evapotranspiration processes is presented. It is hoped that the results presented here can be useful in guiding researchers studying hydrometeorological systems as to which causal method is most appropriate to the characteristics of the system under study.
Key Points
- A brief description of the fundamentals and assumptions of four causal discovery methods is provided
- The methods performance and their sensitivity to sample length and presence of noise is assessed using synthetic data from a hydrologic model
- Causal analysis is applied to examine and formulate hypotheses on the significance of environmental drivers of evapotranspiration
Plain Language Summary
The old cliché “correlation does not imply causation” is grounded on the pitfalls of conventional correlation analysis. Despite its well-known shortcomings, the use of correlation permeates the hydrometeorological literature. Recently, however, methods of causal inference have attracted a lot of attention, and they are increasingly being used to address a wide range of problems. Due to the rapid increase in hydrometeorological data acquisition whether in situ or remotely sensed, the use of causal discovery methods is nothing but opportune. This article aims to contribute to the burgeoning community effort in advancing the use of causal discovery methods in hydrometeorological applications. It firstly reviews the assumptions underlying four causal discovery methods: Granger Causality, Transfer Entropy, PC algorithm, and Convergent Cross Mapping. Next, it evaluates the performance of these methods using synthetic data generated from a hydrological bucket model, where the model is considered as “ground truth.” The article finally concludes by applying one of the methods to examine the relative contributions of environmental variables in regulating evapotranspiration in a shrubland region during summer and winter seasons. The used methods are not inclusive since there are countless nuanced variants of causal discovery methods. Nonetheless, they represent long-standing, general classes of causal discovery methods.
1 Introduction
In hydrometeorology, as in other branches of natural sciences, causal inference plays a central role in the acquisition of objective scientific knowledge. Arguably, most questions encountered in hydrometeorology can be framed in the context of cause and effect. Causal questions in the form of hypothesis formulation and validation regarding the interaction of variables and processes are ubiquitous in the literature of hydrology and climate. Most notable are studies concerned with understanding the impact of climate change on the hydrologic cycle (e.g., Barnett et al., 2005; Held & Soden, 2006; Milliman et al., 2008; Piao et al., 2010), resolving ambiguities in the interactions of the coupled Land-Ocean-Atmosphere system (e.g., Charney, 1975; Eltahir, 1998; Entekhabi et al., 1996; Koster et al., 2004) and understanding the impact of land cover and anthropogenic land-use on atmospheric circulations (e.g., de Noblet-Ducoudré et al., 2012; Findell et al., 2017). The underlying question among all these studies is a causal one, and the aim is to understand how a specific variable will change as a result of an intervention in the system. It is important to distinguish causal interactions from associations; the latter characterize the dependence between variables as in standard statistical analysis (regression and correlation), while the former extends the analysis by identification of variables in dependence relationships as cause and effect. In order to identify causal interactions, all causal inference methods rest on different sets of assumptions that identify invariant relationships in systems under intervention (Pearl, 2009a). Such a distinction between causality and association in hydrological systems has been pointed out by Klemes (1982), while discussing the relationships obtained by empirical analysis “The relationships initially discovered are of necessity simple …. They generally tell us what change in one observed quantity correspond to a change in another … they tell us what happens but do not derive the outcome from the dynamic mechanisms governing the process.”
Although experimental research, manipulation, and controlled testing provide a framework to understand causal processes, using such an approach in hydrometeorology is infeasible (e.g., manipulating global or regional climate), costly (e.g., experimental catchments), or inaccurate (e.g., using numerical models for controlled experiments). With these considerations in mind, causal inference from observations is an alternative avenue. Simply stated, the goal of all causal inference methods is to extract information regarding causal interactions among variables in a given system utilizing time series measurements with prior knowledge incorporated in the selection of variables. The last few decades witnessed significant advancements in theories as well as algorithms needed for causal inference from observational data sets. The earliest significant work in empirical causal inference was proposed by Granger (1969). In a seminal paper, Granger formulated a statistical method for causality which states that variable X has a causal effect on variable Y if variable X provides statistically significant information about future values of variable Y (Granger, 1969). The original framework of Granger causality (GC) has later been extended by introducing concepts of information flow resulting in Transfer Entropy (TE) as a measure of causality that is sensitive to both linear and nonlinear relationships (Schreiber, 2000). In addition to Granger framework, a fundamental work that influenced the field of causality was the introduction of probabilistic Graphical models (e.g., Bayesian Networks) and causal diagrams (Pearl, 1988, 1995, 2009b). Furthermore, almost simultaneously with the advances in causal discovery in the fields of statistics and machine learning, fundamental contributions in detecting causality from time series were conceived in the field of dynamical systems (Deyle & Sugihara, 2011; Sugihara et al., 2012; Sugihara & May, 1990). These contributions were built on the theories of time-delay embedding and reconstruction of attractors from time series (Takens, 1981).
The application of causal inference methods in hydrology and climate research has been gaining attention in recent years. Ruddell and Kumar (2009a, 2009b) adapted TE to characterize process networks of ecohydrological systems from observational data sets. Similarly, Sharma and Mehrotra (2014) developed an information theoretic measure to be used in system identification of natural systems. Recently, TE concepts have been adopted to develop process networks taking into account the partitioning of information into synergistic, unique, and redundant (Goodwell & Kumar, 2017a, 2017b). On the other hand, methods of causal detection based on time-delay embedding have been used to investigate the soil moisture—rainfall feedback (Wang et al., 2018). The outcomes of these studies are encouraging, and they highlight the potential of causal inference in improving identification of hydrometeorological systems from observational data sets. However, there exists a lack of comparative studies. Specifically, studies that compare the performance of different causal methods in the context of hydrometeorological systems; taking into consideration the challenges and practical limitations frequently encountered in such systems such as nonlinearity (e.g., threshold behavior), sample size, process and observational noise, and synchronization due to seasonality.
In view of the above discussion, the aim of the present paper is three-fold. First, to present briefly four main causal inference methods: GC, TE, Graph-based casual methods (PC algorithm), and Convergent Cross Mapping (CCM) and discuss their theoretical underpinnings and assumptions. Second, to contrast the performance of these methods using synthetic data generated from a simple hydrological model yet representative of most features common to environmental systems and to investigate the impact of sample size and presence of noise on the performance of each method. Third, to use causal analysis in examining the significant environmental drivers of evapotranspiration in a shrubland region and to identify their relative contributions during summer and winter seasons. The overarching objective of this paper is to make contemporary advances in causal inference more accessible to the hydrologic community by bridging the gap between the theoretical notions of causality on one hand and practical applications in hydrology on the other.
To this end, the paper is organized as follows. In section 2, a brief comparative description of the main causal inference methods (GC, TE, PC algorithm, and CCM) is provided while illuminating their fundamental assumptions. Section 3 presents the design of the hydrological model used for evaluating the performance of the causal inference methods; hence, setting the scene for section 4 which reports and discusses the performance of the methods and their sensitivity to sample size and presence of noise. Section 5 examines the significance and relative contributions of environmental variables in regulating evapotranspiration in a shrubland region by causal analysis of observational data sets. Section 6 sums up the main findings of the paper, discusses potential applications of causal methods in hydrology and climate research and suggests possible topics for future research.
2 Measures of Causality
Several methods have been developed over the last few decades for inferring causal relationships from observational data. Broadly speaking, they can be classified into four categories: methods based on linear and nonlinear autoregressive modeling (e.g., GC and TE), Graph-based methods (e.g., the PC algorithm), methods founded on the theory of time-delay embedding (e.g., CCM), and Structural Causal Models (SCMs); see Runge et al. (2019) for an overview on various causal discovery methods. SCMs framework is a promising approach for discovering causal relationships, and a handful of methods were developed based on this framework (Hoyer et al., 2009; Peters et al., 2017). However, they have mostly been restricted to acyclic graphs (Hoyer et al., 2009), and they are yet to be applied in hydrometeorological research. In this study, we focus on the first three categories of causal inference methods; specifically, the four methods of GC, TE, PC algorithm, and CCM. The first three methods assume the underlying system to be purely stochastic, while CCM assumes that causal interactions arise from a dynamical system (see Table 1). This begs the question of whether hydrometeorological systems should be viewed as stochastic or chaotic dynamical systems. Several arguments exist in support of both views. Here, we do not engage in this debate; however, it is worthwhile to mention that historically both random-process theory and chaotic determinism were used for the analysis of hydrologic systems. Bras and Rodriguez-Iturbe (1993) provides an overview on applications of stochastic theories in hydrology. On the other hand, several studies examined chaotic determinism in hydrologic systems; Sivakumar (2000) provides a comprehensive review of such studies. Overall, previous studies indicated the existence of deterministic chaos in several hydrological variables such as rainfall (Hense, 1987; Rodriguez-Iturbe et al., 1989) and river flow (Porporato & Ridolfi, 1996).
Granger causality (GC) | Transfer entropy (TE) | PC algorithm | Convergent cross mapping (CCM) | |
---|---|---|---|---|
Number of Hyperparameters | 1 | 1–4^{a} | 1–4^{a}^{,}^{b} | 4 |
Significance test | Parametric | Nonparametric | Parametric/Nonparametric^{b} | Nonparametric |
Type of system | Stochastic | Stochastic | Stochastic | Deterministic (Chaotic Dynamical) |
Assumption of linearity | Yes^{c} | No | Yes/No^{b} | No |
Detection of contemporaneous causal interactions | No | No | Yes/No^{d} | Yes |
Reference | (Granger, 1969) | (Schreiber, 2000) | (Spirtes & Glymour, 1991) | Sugihara et al. (2012) |
Extensions and variants | (Geweke, 1982); (Geweke, 1984); (Barnett & Seth, 2014) | (Staniek & Lehnertz, 2008) | (Spirtes & Glymour, 2000); (Runge et al., 2017) | (Ye et al., 2015) |
- ^{a} Contingent upon the method used to estimate 𝕀(X; Y ∣ Z).
- ^{b} Contingent upon the independence test implemented in the algorithm (partial correlation, conditional mutual information … etc.).
- ^{c} GC assumes a linear model can represent the process; this does not necessarily mean that the system is linear as some nonlinear processes can be modeled by linear regression models.
- ^{d} Depends on whether the assumption of time precedence (the cause precedes the effect) is used for directing the links or not.
In this section, our intent is to succinctly present the fundamentals of the four methods (GC, TE, PC, and CCM), and shed light on their assumptions with an eye to hydrometeorological systems. In the following discussion, we will generally make three assumptions about the system subjected to causal analysis. First, causal sufficiency (i.e., all variables are directly observed, and there are no unmeasured or unmeasurable variables in the system). Second, causal faithfulness (i.e., there exists an underlying causal structure such that all probability distributions generated by the system satisfy independence relations implied by its causal structure). Finally, time series are stationary.
2.1 Granger Causality
GC was developed in the late 60s by Clive Granger (Granger, 1969), and it is perhaps the first practical method to test for causality. GC is defined in both time and frequency domains, and it relies primarily on a fundamental assumption that the cause precedes the effect in time (i.e., two variables occurring at the same time step cannot be causally related). Although this assumption may appear to be trivial and intuitive, it has significant implications because causality is interpreted based on the time ordering of events; hence, the original method (Granger, 1969) is unable to detect contemporaneous causality. In addition, two secondary assumptions underlie GC. First, the cause provides useful information for predicting the effect at future time steps. Second, any variable in the system can be represented linearly by lagged values of system variables and an error term. That is, the system can be represented by a vector autoregressive (VAR) model. The latter condition implies that the underlying system is linear and stochastic, albeit some nonlinear processes can be modeled as VARs. It should be noted that the definition of causality in GC does not conform with more strict definitions of causality as in Pearl (2009b).
Where c_{xxk} and are the regression coefficients of X_{t} regressed on X_{t−k} in the first and second model, respectively. Similarly,c_{xzk} and are the regression coefficients of X_{t} regressed on Z_{t−k} in the first and second models, respectively. c_{xyk} is the regression coefficient of X_{t} regressed on Y_{t−k}, while ε_{xt} and are the residuals in the two models.
The first model (equation 2) is unrestricted regression model that includes lags of variable Y, while the second model (equation 3) is a restricted regression model. The Null hypothesis (H_{0}) that Y does not cause X is rejected if it is shown that the first model improves estimation compared to the second model. That is, the difference between the residuals of the two models is statistically significant according to an F test at a given significance level (α); see Text S1 in the supporting information for exact F-statistic computation. Apart from α, the model order p must be defined in order to perform GC. p specifies the maximum lag time to be included in the model which is commonly defined by adding significant terms according to a t test. Alternatively, techniques such as the Akaike and Bayesian information criteria (AIC and BIC) are used to define model order (Barnett & Seth, 2014). GC is a common method to test for causality due to its simplicity, and it has been used for testing causal interactions in hydrometeorological research (e.g., Green et al., 2017; Tuttle & Salvucci, 2017). For more information on the implementation of the GC method, see Text S1.
2.2 Transfer Entropy
The generality of TE (i.e., sensitive to both linear and nonlinear relationships) comes at a hefty cost. Specifically, unlike GC, there is no analytical form for the test statistic under the null hypothesis; thus, statistical significance is determined through numerical approximations. One issue of concern when evaluating TE is estimating the probability density functions (pdfs) in equation 4. Several methods exist for approximating pdfs; these include partitioning the data into finite size bins, Kernel density estimators (Moon et al., 1995) and nearest neighbor approaches (Kraskov et al., 2004). In this study, we use nearest neighbor approach; its main advantage is being data-adaptive such that the resolution is higher where data is numerous (Kraskov et al., 2004). This means that the neighborhood is smaller where data are dense and larger where data are sparse. It is efficient when there is a high probability that a variable takes specific values (e.g., zeros); such variables are common in hydrometeorological systems due to threshold behavior (e.g., rainfall and other fluxes; see Figure 3c). For comparison of nearest neighbor and kernel estimators, refer to (Runge, 2017). In practice, , , and cannot be infinite sets; therefore, the model order p must be defined. Furthermore, contingent upon the estimator chosen to approximate equation 4, a set of additional hyperparameters must be specified (see Table 1). See Text S2 for more information on assigning values to these parameters and the computation of TE.
2.3 Graph-Based Methods
In Graph theory, a directed graph representation describes the causal relationships among variables in a given system. The representation consists of nodes representing variables, and directed edges between nodes representing direct casual influences (Dechter, 2013). Kinship terminology (parent, child, descendants … etc.) is commonly used to describe relations in a causal graph. If the graph encodes probabilistic information (i.e., each directed edge in the graph is quantified by the conditional probability distribution of the child given its parent), the causal graph is called a Bayesian network (Darwiche, 2009; Pearl, 1988). Graph-based causal algorithms utilize a set of graphical rules that govern the retrieval of system causal graphs from nonexperimental data. These rules include, among others, the “d-separation” criterion (Pearl, 1988) and Causal Markov condition (Kiiveri et al., 1984; Pearl & Verma, 1991). See (Pearl, 1995) for an in-depth discussion of graphical rules.
The PC algorithm (Spirtes & Glymour, 1991) utilizes graphical rules to effectively recover causal relations among variables from observational data. Given time series of three variables X, Y, and Z, lagged time series are first constructed. Similarly, lagged time series of variables Y and Z are constructed. The algorithm starts by considering a complete (fully connected) undirected graph containing all lagged time series variables as distinct variables. For illustration, let us consider only variable X_{t} (black node in Figure 1). First, the PC algorithm starts with the complete (fully connected) undirected graph (Figure 1a). The set of nodes connected to X_{t} (neighbors of X_{t}) are denoted by ne(X_{t}) (red nodes in Figure 1). Second, each node in ne(X_{t}) is tested for independence with X_{t} conditioned on a set of variables . Such that, is n-dimensional and . At the beginning, n = 0, and edges connecting nodes that are independent of X_{t} are removed (Figure 1b). Then, the algorithm continues to increase n by an increment of 1 until n equals the number of elements in ne(X_{t}) (Figure 1c). Finally, the algorithms direct the links using graphical rules.
The main advantage of the PC algorithm is that it does not require the determination of high-order independence relations as in GC and TE methods. Note that in testing causal relations in GC and TE methods (equations 2, 3, and 5), one must condition on all the remaining variables in the system. If the number of variables in a system is very large, this will lead to high-order conditioning sets which undermine the effectiveness of GC and TE. However, PC algorithm avoids this by sequentially and systematically conditioning on a specific subset of variables. Time precedence (i.e., cause precedes effect), although not used in (Spirtes & Glymour, 1991), can be utilized to direct causal links in the PC algorithm (note that it has been used in this study). Alternatively, graphical rules can be used to direct the links; in such a case, PC algorithm has the advantage of detecting contemporaneous causal interactions. The conditional independence test can be a linear test (partial correlation), conditional mutual information (equation 4) or any other suitable test. For more information on the implementation of PC algorithm, see Text S3. The PC algorithm has previously been used in several climate studies. Ebert-Uphoff and Deng (2012) used the PC algorithm to derive casual relations among four prominent modes of atmospheric low-frequency variability. Furthermore, several variants of the PC algorithm have been developed (see Table 1). For example, FCI algorithm (Spirtes et al., 2000) account for unobserved common drivers; and PCMCI algorithm (Runge et al., 2017a) uses PC as a first step, and it improves the performance by reducing false positives that might result from highly autocorrelated time series.
2.4 Convergent Cross Mapping
Unlike the three methods described in previous subsections, CCM rests on a different paradigm of causality; namely, the theory of time-delay embedding (Takens, 1981). Based on this theory and under certain conditions, the manifold of a chaotic dynamical system can be reconstructed using time-lagged observations of a single variable (state). Let us consider a chaotic dynamical system with the variables X, Y, and Z. At any time t, the system is represented in the state space by the three-dimensional point . For a sequence of observations of length l, the trajectory of for t = {1,2,3…. l} in the state space constructs a manifold of the system . Time-delay embedding theorem states that one can reconstruct the manifold using lagged time series of a single variable. For example, the trajectory of the point is called a shadow manifold, and it preserves certain topological properties of the original manifold. Similarly, shadow manifolds can be constructed from time series of variables Y and Z. In Figure 2, the three-dimensional manifold and shadow manifolds constructed from variables X and Y of the Lorenz system are shown.
Based on Takens' theorem, Sugihara et al. (2012) developed CCM as a test for causality. To evaluate the hypothesis Y ⟹ X, shadow manifolds from variables X and Y are constructed from the trajectory of the points and respectively. Where and for . Here, E is the dimension of the manifold, also known as the embedding dimension. is the length of time series used to create the shadow manifolds which is a fraction of the total length of observations l. Next, the shadow manifold of variable X is used to identify the nearest neighbors of the point for and their Euclidian distance from the point . Finally, the nearest neighbors and their distances are used to identify contemporaneous points on the shadow manifold of Y; consequently, values of Y_{t} for are estimated. These estimated values are compared with the observed ones using a metric, often correlation coefficient. If the value of correlation coefficient is significant, the hypothesis Y ⟹ X is accepted.
Note that in testing the hypothesis Y ⟹ X, the effect X is used to predict the cause Y. This is contradictory to the assumption in GC and TE in which the opposite is true; namely, the cause provides useful information for the prediction of the effect. Sugihara et al. (2012) discusses and explains this contradiction. The central idea is that the cause leaves an information signature in the time series of the effect; thus, the cause can be estimated using its signature. CCM requires the identification of three hyperparameters; these are E, l̀, and the significance level α for testing the correlation coefficient. For more information on assigning values to these parameters and the implementation of CCM, see Text S4.
3 Hydrological Model Setup
In order to test the performance of the four causal methods in identification of causal relationships from time series, we resort to the use of a hydrological model. The purpose of using synthetic data generated from the model is that, unlike natural systems, the underlying causal relationships are well known. The model we use here is a simple hydrological bucket model (Figure 3a) with four variables: rainfall R, soil storage S, interflow I, and runoff Q. This model simulates the process by which rainfall is transformed into runoff in a simplified manner. Essentially, the soil is represented as a bucket and the maximum amount of water it can store is defined by maximum storage (S_{max}). An outgoing flux of water from the bucket represents interflow I, the lateral movement of water within the unsaturated zone. Whenever the bucket is full, rainfall will spill from the bucket and transform to runoff Q. This bucket model is the building block of most conceptual hydrological models such as the SIXPAR model (Gupta & Sorooshian, 1983), HyMOD model (Boyle, 2000; Moore, 1985), and Sacramento Soil Moisture Accounting model (SAC-SMA) previously used as the National Weather Service operational model (Burnash, 1995). Each of these models enhances the representation of processes by including various components such as adding multiple soil layers in the form of connected buckets. This model also represents the cornerstone of the Manabe bucket model (Manabe, 1969), previously used as a basic model to represent land-atmosphere coupling in Global Climate Models (GCMs), which adds evapotranspiration as a function of soil moisture. Currently, state-of-the-art models in representation of the rainfall-runoff process at the catchment scale are mostly distributed models. Unlike conceptual models which treat the catchment as a single unit, these models attempt to represent the heterogeneities within a catchment by explicitly representing water movement at small spatial grids. They are particularly advantageous when spatially dense data is available. Clearly, the aim of the present paper is not to select a model that will most accurately simulate rainfall-runoff process, an issue that is catchment-dependent based on available data and the intricacies of hydrologic processes. However, the goal is to select a model that fairly represents the interactions in a rainfall-runoff process while maintaining the least possible number of variables which consequently allows interpretation of the causal network to be tractable. The hydrologic bucket model (Figure 3a) exactly achieves this goal; therefore, it was selected for evaluating the causal discovery methods.
In addition to the four variables R, S, I, and Q, the model has four parameters: The maximum soil storage S_{max}, and three parameters K_{s}, δ, and ξ identifying the nonlinear storage-discharge relationship. The storage-discharge relationship employed here (equation 9) is nonlinear of concave type (i.e., Q as a function of S is a concave power law model). Botter et al. (2009) provides a detailed description of such relationships which has previously been employed in literature (e.g., Brutsaert & Nieber, 1977). In this relationship, δ is the soil moisture amount below which no interflow occurs, ξ is an exponent between 0.5 and 1 (Botter et al., 2009) and K_{s} characterizes how fast the bucket is depleted of water.
In causal theory, a very common way to represent causal interactions in a given system is to use directed graphs. Each node in the graph represents a variable or a subprocess in the system, while directed edges represent causal links with the arrow pointing towards the effect. Figure 3b shows the directed graph representation of the bucket model in equation 9. Note that each variable is a child of the arguments of its function. For example, S is a function of R and I; therefore, the node S has the two parents R and I. Since R is a forcing variable in the model, it has no parents in the causal graph of the model. In the remaining sections of this paper, we shall use graphical representations and kinship metaphors to discuss causal relations. Figure 3c shows the marginal probability distributions of the variables simulated by the model using process noise (dB [SNR] = 40 [10^{4}]). Clearly, all probability distributions have high density at zero. This is a key feature in many hydrological systems where considerable number of the fluxes observations are equal to zero.
4 Results
To evaluate and compare the efficiency of the four causal discovery methods (GC, TE, PC, and CCM) in retrieving causal structures of hydrological systems, we use synthetic data generated from the bucket model described in section 3. The analysis is separated into three parts: (1) The asymptotic performance of each causal method is assessed using a large sample length (l = 3,000); (2) Sensitivity of the performance to sample size is investigated; and (3) sensitivity of the performance to presence of noise, both process and observational noise, is assessed. It is noteworthy that the implementation of the four methods is conducted not only on the time series of the variables R,S,I, and Q but also on their lagged time series shifted by time lags of 1, 2, …τ_{max}. Where τ_{max} is the maximum time lag; see Texts S1 to S4 for more information. This implementation allows the methods to theoretically handle the lagged feedback interactions (e.g., S ⇄ I).
Where n_{detection} refers to the number of times a causal link was either correctly or mistakenly identified by the algorithm.
4.1 Performance in Large Samples
In this subsection, we investigate the performance of causal discovery methods in large samples. This allows us to identify the limitations of each method as the size of data approaches a large number. The analysis is performed over 100 simulations (see Figure S2) each with sample length (l = 3000) which represents a relatively large sample length in hydrometeorological applications. The parameters and specifications of the hydrologic model are as shown in Table 2. It should be noted that a small process noise is added to the model (see Table 2) to satisfy causal faithfulness assumption, a prerequisite for the methods of GC, TE, and PC to be utilized.
Parameter | Value |
---|---|
Maximum soil storage (S_{max}) [L] | 80 |
Storage-discharge parameter 1 (K_{s}) [1/T] | 2.3 |
Storage-discharge parameter 2 (δ) [L] | 10 |
Storage-discharge parameter 3 (ξ) | 0.6 |
Process noise (dB [SNR]) | 40 [10^{4}] |
Initial soil storage (S_{0}) [L] | 40 |
Figure 4 shows the retrieved causal structure using each of the causal discovery algorithms (GC, TE, PC, and CCM). The results in Figure 4 summarize the mean behavior of the algorithms across 100 simulations; that is, a causal link exists only if it has been identified by the algorithm in more than 50% of the simulations. By comparing the retrieved causal structure using GC (Figure 4a) and the true causal structure of the model (Figure 3b), all the true causal links were correctly identified by GC (blue links in Figure 4a). However, GC mistakenly identifies three causal links (R ⟹ I, Q ⟹ I, and I ⟹ Q; red links in Figure 4a). These false detections are attributed to the inability of the GC method to control for mediation and confounding of nonlinear relationships. For the link R ⟹ I, the relationship between R and I is mediated by S, that is, R ⟹ S ⇒ I. As for the relationship between Q and I, the two variables share a common confounder which is S, that is, Q ⇐ S ⟹ I. The three links would not be mistakenly identified if the GC method conditions properly on variable S. Although GC conditions on variable S, the nonlinearity of these relationships violate the assumption of linearity in GC method; therefore, resulting in false detection.
By comparing the results of GC with the causal structures retrieved from TE (Figure 4b) and PC (Figure 4c), one can see that both TE and PC rule out the links mistakenly identified by GC. This is because both TE and PC are nonparametric methods; thus, they are able to detect nonlinear relationships. However, the two algorithms fail to detect the causal link I ⟹ S; this causal link is in fact a feedback link such that variable S causes variable I which in turn feedback and impact variable S. The reason behind underdetection of this link is that both algorithms accept the null hypothesis of the independence test ; that is, variable S is independent of variable I conditioned on variable R. Specifically, variable I negatively impact variable S while variable R has a positive impact on variable S, and both effects negate each other to maintain the mass balance . This type of relationships is a typical example for violations of the causal faithfulness assumption which states that “there are no precisely counterbalanced causal relationships in the system that would result in a probabilistic independence between two variables that are actually causally connected” (Andersen, 2013). As can be seen in Figure 6, the DR of this feedback link using TE and PC algorithms is not sensitive to changes in sample size; this lends credence to the aforementioned assertion since the independence test is accepted regardless of changes in the sample size.
For TE method, in addition to its inability to detect the causal link I ⟹ S, the algorithm also fails to detect the link S ⟹ Q. In evaluation of this causal link, the TE method examines the independence relationship ; that is, whether the variables S and Q are independent given the history of variables I,R, and Q. Because the number of conditioning variables (i.e., ) is relatively large, and the relationship S ⟹ Q is a weak causal relationship, the DR of this link is low. Therefore, the link can only be detected for large sample size; this can be seen in Figure 6, for the link S ⟹ Q, as the sample size increases, the DR of the link using TE also increases. However, even a sample length of 3,000 is insufficient for the TE method to detect the causal link S ⟹ Q.
Like GC, CCM mistakenly identifies the three links: R ⟹ I, Q ⟹ I, and I ⟹ Q (see Figure 4d). This is because CCM does not control for confounding and mediation. Additionally, CCM mistakenly identifies the relationship in the pairs (S,Q) and (R,S) as bidirectional causality rather than unidirectional. This points out to a limitation of CCM that when two variables are strongly coupled (synchronized), CCM identifies unidirectional causality as bidirectional. Sugihara et al. (2012) reported that in the case of extremely strong forcing, CCM will result in bidirectional causality between variables.
4.2 Impact of Sample Length
Here, we analyze the sensitivity of each algorithm to changes in sample size; we perform the analysis over sample size of 100, 300, 500, 1,000, 2,000, and 3,000. Each analysis is performed with a number of simulations (n_{sim} = 100; see Figure S2). Figures 5a and 5b show the TPR and FPR, respectively, for each of the four algorithms. As can be seen in Figure 5a, TPR consistently increases with increasing sample length. At the limit of large sample length (l = 3,000), both CCM and GC approach a TPR equal to 1 (i.e., all causal links are correctly detected by the algorithms). On the contrary, PC and TE approach a TPR < 0.8; this is because the feedback link I ⟹ S is not detected as illustrated in the previous section. The most important finding to note in Figure 5a is the insensitivity of CCM to sample length regarding the TPR. It shows that a sample size as small as 100 is sufficient for the CCM to identify all causal links in the model. On the other hand, GC, TE, and PC show sensitivity to sample size; this is not surprising since the three methods are based on a probabilistic framework, and the statistical estimation improves as the sample size increases.
As for the false positives in Figure 5b, the results might seem counterintuitive as one would expect the FPR to consistently decrease with sample size. However, this is not necessarily the case when the false detection is not related to sample size. For example, both CCM and GC show increasing FPR with the increase in sample size. This is because the mistakenly identified links result from the limitations of the algorithms in controlling for confounding and mediation. These types of false detection increase as the sample size increases. Similarly, the false detection of causal links due to strong forcing (synchronization) in the CCM increases with increase in sample size as can be seen in Figure 6 for DR of the causal link S ⟹ R.
4.3 Presence of Noise
To assess the sensitivity of performance to presence of noise, we first analyze the sensitivity of performance to process noise. Unlike observational noise that is associated with errors in measurements, process noise means that there is a stochastic component in the underlying system. In the hydrologic model (equation 9), if the variance of η_{I},η_{s}, and η_{Q} is zero, the model is completely deterministic, and there is no process noise. As the variance takes values larger than zero, the model incorporates a stochastic component. In hydrometeorological systems, process noise can arise even in well-defined deterministic systems because of heterogeneity. For example, the rainfall-runoff process in catchments is not entirely deterministic as it has some stochastic component due to heterogeneity associated with land properties. Figures 7a and 7b show the TPR and FPR for dB[SNR] of 3[2], 4.8[3], 6[4], 7[5], 10[10], 13[20], and 40[10^{4}]; each is averaged across 40 simulations (see Figure S2) with a sample length of 1,000. The model specifications of parameters and initial states are as shown in Table 2. As can be seen in Figure 7a, process noise has a minimal impact on the performance of the three methods (GC, TE, and PC) with a slight increase in TPR as the level of noise is increased. This is because process noise is part of the dynamics, and variables remain stochastically coupled as the noise level increases. As a result, GC, TE, and PC which assume the underlying system to be of stochastic nature (see Table 1), are able to maintain their performances as the noise increases. On the contrary, TPR of CCM decreases significantly for noise levels (dB < 4.8). The discernible decrease in performance of CCM in presence of process noise is expected since the method is based on an assumption of deterministic systems (see Table 1). However, the results also suggest that CCM can tolerate process noise down to 4.8 dB. Figure 7b shows that FPR for all the methods decreases as the noise level increases (i.e., lower dB). While this can be justified for the methods of GC, TE, and PC due to their probabilistic framework, the results appear to be counterintuitive regarding CCM. However, the reason behind this is that as the process noise increases (lower dB), all variables in the system no longer contain dynamic information about each other. Consequently, the cross-mapping ability of CCM diminish leading to a decrease in both TPR and FPR.
The second part of the analysis is examining the sensitivity of performance to presence of observational noise. For observational noise, time series are simulated from the model (equation 9) with a small process noise (SNR = 10^{4}; dB = 40). Then, observational noise is added after the time series are simulated from the model with noise levels in dB[SNR] of 3[2], 4.8[3], 6[4], 7[5], 10[10], 13[20], and 40[10^{4}]. This type of noise represents measurement error where the noise is not a result of the underlying system, but it is associated with the devices measuring the data. Figures 7c and 7d show TPR and FPR for different levels of noise each averaged across 40 simulations with a sample length of 1,000. The model specifications of parameters and initial states are as shown in Table 2. Unlike in the case of process noise where the three methods of GC, TE, and PC are insensitive to changes in noise level, the results here show that these methods in addition to CCM are all sensitive to presence of observational noise. Specifically, the performance of GC, TE, and PC deteriorates as evidenced by a decrease in TPR (Figure 7c) and an increase in FPR (Figure 7d). As for CCM, both TPR and FPR decrease consistently with the increase in observational noise. This is because as the noise increases, causally related variables in the system no longer contain information signature of each other; thus, the efficiency of cross mapping degrades.
5 Causal Analysis of Environmental Drivers of Evapotranspiration
Evapotranspiration ET plays a central role in the Earth's water and energy cycles, and it is the primary process in the biosphere-atmosphere coupling. Several factors can potentially regulate evapotranspiration rate; these include net radiation R_{n}, vapor pressure deficit VPD, soil water content SWC, air temperature T_{a}, soil temperatures T_{s} and wind speed WS. R_{n}, VPD, and SWC are considered as direct drivers of ET, while other variables (T_{a}, T_{s}, and WS) affect ET primarily through their regulation of canopy stomatal conductance. Several models with wide range of complexity exist to understand and simulate evapotranspiration process; however, modeling large-scale evapotranspiration remains a major source of uncertainty (Mackay et al., 2007; Sivapalan et al., 2003). Therefore, analysis of observational data is important to assess the significance of environmental controls on evapotranspiration and their seasonal and regional variations. Observational data sets were used by Vrugt et al. (2002) along with artificial neural networks to identify controlling factors of transpiration in a forested region, while Mackay et al. (2007) used observational data to quantify the differential impact of net radiation and vapor pressure deficit in regulating evapotranspiration in upland and wetland regions. In this section, we use the PC algorithm to identify the forcing environmental variables that control evapotranspiration rate and their relative contributions during summer and winter seasons in a shrubland region.
Observational data set used in our analysis is obtained from Santa Rita Mesquite (US-SRM) FluxNet site. This site is located in southeastern Arizona (31.82°N, 110.87°W) at an elevation of 1,118 m above sea level. The Koeppen climate classification of the site is Arid Steppe cold (BSk). The land cover is broadleaf vegetation shrublands, and it consists primarily of mesquite (Prosopis velutina) trees. Mean annual precipitation and temperature are 333 mm and 19°C, respectively. Hourly time series were obtained by accumulating and averaging the native 30-min observations for the following variables: ET, R_{n}, VPD, SWC, T_{a}, T_{s}, and WS. Table 3 shows the statistics of mean and standard deviation of each variable for the summer and winter seasons. Anomalies of hourly observations were calculated by subtracting the seasonal mean to remove effects of diurnal cycle. Time series were then tested for stationarity (monotonic trend) using Mann Kendall test; Table 3 shows p-values for each time series. When the null hypothesis of no trend was rejected at a significance level of 0.05, we removed a linear trend from the time series. The sample length for the summer season (JJA) and the winter season (DJF) is 24,288 and 23,832 observations, respectively. They represent hourly observations at the US-SRM FluxNet site during the period 2004–2014. We used the PC algorithm to infer the environmental drivers controlling evapotranspiration because, according to the results in section 4, it controls FPR (almost similar to that of TE) when the sample length is sufficient while achieving a higher TPR value than that of TE.
Summer (JJA) | Winter (DJF) | |||||
---|---|---|---|---|---|---|
Mean | Standard deviation | p-value (Mann-Kendall) | Mean | Standard deviation | p-value (Mann-Kendall) | |
ET (mm) | 0.07 | 0.09 | 0 | 0.02 | 0.03 | 0 |
T _{a} (°C) | 27 | 4.7 | 0.01 | 10.5 | 5.6 | 0.01 |
WS (m/s) | 2.29 | 1.22 | 0.85 | 2.51 | 1.52 | 0 |
T _{s} (°C) | 32.2 | 6.2 | 0.07 | 12.54 | 4.7 | 0.31 |
VPD (hPa) | 24.3 | 13.5 | 0.04 | 8.82 | 5.6 | 0.23 |
SWC (%) | 4.56 | 2.77 | 0 | 5.82 | 2.15 | 0.07 |
R _{n} (W/m^{2}) | 283.11 | 391.24 | 0.09 | 100 | 245.61 | 0.36 |
- Note. The p-vlaues of Mann Kendall trend test are also shown; the null hypothesis of no trend is rejected if the p-value is smaller than the significance level (0.05).
Figures 8a and 8b show the two causal networks obtained from the PC algorithm using hourly observations during summer and winter seasons. During the summer season (JJA; Figure 8a), evapotranspiration rate is regulated by, in order of importance, net radiation and soil water content (equally important), vapor pressure deficit, and soil temperature. The two remaining variables, wind speed and air temperature, are not causally related to evapotranspiration at a statistical significance level of 0.05. On the other hand, evapotranspiration during the winter season (DJF; Figure 8b) is controlled by, in order of importance, net radiation and wind speed (equally important), soil water content, and vapor pressure deficit. In order to understand the physics underlying these results, it is important to examine the dynamics of the vegetation cover on the site. Figure 9a shows the monthly variability in Gross Primary Production (GPP) averaged over the period (2004–2014). Clearly, GPP peaks during the summer because Mesquite trees which dominate the land cover at the site bloom and grow during the summer. On the contrary, GPP is very low during the winter. This means that during the winter season, bare soil evaporation is the predominant portion of evapotranspiration due to the limited vegetation cover. However, in the summer, transpiration represents a large portion of evapotranspiration. This provides an interpretation of the results that soil temperature is a causal factor only during the summer season because of its impact on regulating water uptake in plants, whereas it has no discernible impact on bare soil evaporation during the winter. Effect of soil temperature on water uptake and stomatal opening in plants was previously reported in Kramer (1940) and Feldhake and Boyer (1986) among others.
Figure 8b shows that wind speed plays a major role in controlling evapotranspiration in the winter season. Given that the primary effect of wind speed is to clear the air of humidity produced by evapotranspiration, it might be plausible that wind speed is not a significant causal factor during the summer because the advected air is humid. Advection of moisture during the summer at low levels of the atmosphere (geopotential heights greater than 800 mb) towards southwestern U.S. is a key feature of the North American Monsoon, and it has been reported in several studies (e.g., Adams & Comrie, 1997). Furthermore, Figures 9b and 9c show the diurnal cycle of wind speed during summer and winter seasons, respectively. Clearly, in the summer, maximum wind is in late afternoon (5 pm) lagged by several hours from the peak of evapotranspiration (noon). However, during winter season (Figure 9c), the lag time is shorter; thus, wind speed and evapotranspiration are nearly in-phase. Consequently, wind speed can clear the air of humidity and regulate evapotranspiration rate. It should also be noted that wind speed during the winter has larger variability (standard deviation = 1.52 m/s) compared to the summer (standard deviation = 1.22 m/s; see Table 3).
6 Discussion and Concluding Remarks
6.1 Discussion
- Sample length: CCM is the least sensitive method to changes in sample length. The results demonstrate that a sample size as small as 100 is sufficient for CCM to identify all causal relationships in the model. On the contrary, the performance of GC, TE, and PC improves as the sample length increases. This is attributed to the fact that they are based on a probabilistic framework; thus, statistical estimation improves as the sample size increases.
- Nonlinearity: Among the four methods used in this study, GC is the only method that assumes a linear VAR model for the underlying system. Despite this assumption, the results show that GC was able to detect nonlinear interactions; for example, the causal links S ⟹ I and S ⟹ Q (see Figure 4a). This may support the argument that many nonlinear processes can be modeled as VARs (Barnett & Seth, 2014). However, the results show that the assumption of modeling the system as a linear model while has no impact in detecting causal links leads to increase in false positives.
- Stochastic vs deterministic systems: While CCM assumes the underlying system to be deterministic, the three methods of GC, TE, and PC are based on assumption of stochastic systems. In this study, the system is inherently deterministic, and the evolution of its variables is described through dynamical equations. However, we examined the performance of the algorithms by adding different levels of process noise which adds a stochastic component to the system. The results demonstrate that CCM can tolerate process noise down to 4.8 dB (SNR = 3).
- Presence of counterbalanced relationships: Attention must be paid when the system under study is expected to maintain counterbalanced interactions to fulfill physical laws such as conservation of mass, momentum, and energy. These types of relationships are typical in hydrometeorological systems, and they represent a violation of the faithfulness assumption. Methods based on conditional independence, PC and TE, are unable to detect such relationships as evidenced by their inability to detect the causal link I ⟹ S (see Figures 4b and 4c).
- Presence of observational noise: As expected, presence of observational noise degrades the performance of all causal discovery methods. Specifically, the impact is more significant in the case of CCM in which the cross-mapping efficiency between variables diminish as the observational noise increases.
- Confirmatory vs exploratory studies: The results indicate the existence of a tradeoff between TPR and FPR (see Figure 5). Therefore, if the purpose of a given study is exploratory, for example, searching for climatic teleconnections of a certain phenomenon, then one might consider using GC or CCM due to their high TPR compared to TE and PC. On the contrary, if a study is confirmatory, for example, selecting significant climatic teleconnections from a set of predefined teleconnections, then using TE and PC is more appropriate as they minimize the false detection.
The hydrological model used in this study resembles the type of relationships commonly found in hydrometeorological systems. Furthermore, the performance of causal discovery algorithms was examined for a range of process and observational noise levels that might typically be present in observations of hydrometeorological systems. However, caution must be exercised when applying the aforementioned guidelines to systems with characteristics that deviate substantially from those considered here. Moreover, due to the wide range of complexity of hydrometeorological systems, further evaluation studies are sorely needed to examine the performance of causal discovery methods in retrieving the causal structure of different hydrometeorological models.
A secondary aim of this study was to examine environmental drivers of evapotranspiration and their relative contributions during summer and winter seasons. The PC algorithm was applied as a causal discovery algorithm along with observational time series of the following variables: ET, R_{n}, VPD, SWC, T_{a}, T_{s}, and WS. The results show that environmental drivers are dependent on season. While R_{n}, VPD, and SWC are key drivers in regulating evapotranspiration in both seasons, the results demonstrate that T_{s} is a significant driver only in summer season, and WS controls evapotranspiration in winter season. The obtained results from causal analysis represent a hypothesis which can either be refuted or confirmed through further investigation. We provided an interpretation of the results based on the canopy seasonal dynamics and basic understanding of the evapotranspiration process. In order to compare the results of causal analysis with the information that could have been obtained using correlation, Figures 9b and 9c show the correlation matrix for the variables: ET, R_{n}, VPD, SWC, T_{a}, T_{s}, and WS during summer and winter seasons, respectively. Firstly, all correlation values are statistically significant (p-value < 0.00001); thus, a threshold must subjectively be defined to identify variables significantly related to ET. If a threshold of 0.5 (a relatively low value for correlation) is selected, none of the variables will be considered as a causal driver in both seasons. Only by lowering the threshold to 0.1, correlation will yield similar results to that of causal analysis during the summer season. However, using the same value for the winter season will result in all variables being identified as possible drivers of ET. This demonstrates the ambiguity of classical correlation analysis which can lead to misleading results.
Hydrologic models commonly use a single relationship to estimate ET using a specific set of environmental drivers without prior information on which variables are dominant and significant in regulating ET. The results presented in this study highlight the importance of selecting ET models that are sensitive to the key drivers in each season. Similar causal analysis can be applied to investigate the differential impact of environmental drivers on evapotranspiration in sites across a range of climate conditions.
6.2 Limitations and Potential Applications in Hydrometeorology
Research in causal discovery techniques is burgeoning rapidly with applications in a wide range of fields such as economics, neuroscience, and epidemiology to name a few. However, the application of causal discovery in hydrometeorological research is in its infancy, and considerable research must be conducted to mainstream its use. Several challenges need to be addressed in order to use causal discovery methods efficiently in hydrometeorology; these include but not limited to (a) causal inference in systems with partial observability where only a subset of system variables is observed; (b) causal inference in systems where stationarity assumption is violated; and (c) discovering instantaneous causality when variables are causally related at the same sampling time step.
In this paper, we attempted to provide a baseline study for the effectiveness of causal discovery methods when used in hydrometeorological systems. In doing so, we made several assumptions including causal sufficiency (i.e., system variables are fully observed), stationarity of time series, and the absence of instantaneous causal interactions. However, in many hydrometeorological systems, these assumptions are seldom appropriate. Regarding the issue of partial observability, significant research has been conducted in causal discovery for systems with unmeasured latent variables; specifically, in linear systems (Bollen, 2014). As for time series stationarity, it is important to test observations for stationarity prior to their use for causal discovery purposes. Several approaches exist to deal with nonstationary systems; for example, causal models can be retrieved separately for sliding time windows (Calhoun et al., 2014). Such an approach is feasible when the sample length is large; we used this approach to examine the environmental drivers of evapotranspiration using time windows each with 1,000 sample points (see Table S2). The results were similar to those presented in section 5 which were obtained using the total sample length with a linear trend removed from nonstationary variables. Zhang et al. (2017) developed a framework to infer causal relations in nonstationary systems benefiting from the distribution shifts induced by non-stationarity. This approach is very promising, and it can potentially render non-stationarity as a blessing in disguise. Instantaneous causality in hydrometeorological systems is only a problem when the sampling time step is large compared to the temporal scale of causal interactions. For instance, if infiltration process in a catchment occurs within few hours of a rainfall event while both rainfall and soil moisture are measured on a daily timescale, then one must account for instantaneous causality. In such a case, resorting to time-delay embedding methods (e.g., CCM) might be the best option to take. In summary, the aforementioned issues represent key challenges that require further investigation; tackling these challenges will likely play a fundamental role in advancing the use of causal discovery in hydrology and climate research.
In an era characterized by immense environmental data and information provided by in-situ and remotely sensed observations, field campaigns, and palaeoclimatological reconstructions; causal discovery methods have a great potential in advancing hydrometeorological research by providing the appropriate tools for data analysis. We anticipate that in the foreseeable future, researchers in hydrology and climate will utilize the potential of causal discovery methods in addressing a wide range of problems. Recently, a hydrologic community initiative identified 23 unsolved problems in hydrology and explicitly acknowledged the importance of understanding causality: “Questions remain focused on process-based understanding of hydrological variability and causality at all space and time scales” (Blöschl et al., 2019). One straightforward application of causal discovery methods is system identification in which the goal is to understand the causal structure of a sufficiently complex system such that its behavior cannot be described by equations derived from first principles. As we mentioned earlier, a handful of studies explored this topic using causal discovery methods (e.g., Goodwell & Kumar, 2017a; Ruddell & Kumar, 2009a; Sharma & Mehrotra, 2014). However, with hydrology becoming more of an integrated science incorporating ecological, social, and epidemiological aspects related to the water cycle, the problems are more complex than ever before, and much can be achieved using causal discovery methods. One very attractive application of causal discovery methods that is gaining more attention is exploratory and confirmatory analysis of climatic teleconnections (e.g., Hlinka et al., 2013; Runge et al., 2015). There also appears to be huge potential for using causal discovery methods in model diagnosis and efficient merging of models and observational data in hydrologic, atmospheric and oceanic subdisciplines.
Author Contributions
M. Ombadi designed and implemented the study. M. Ombadi wrote the manuscript. P. Nguyen, S. Sorooshian and K. Hsu provided insightful feedback on the manuscript. The authors would like to thank the associate editor and four anonymous reviewers whose insightful feedback significantly enhanced the quality of the manuscript.
Acknowledgments
This research was partially supported by the Department of Energy (DoE prime award DE-IA0000018), California Energy Commission (CEC award 300-15-005), University of California (#4600010378 TO#15 Am 22), and NASA MIRO (NNX15AQ06A).
Open Research
Data Availability Statement
Synthetic data used in this study are available online at (https://github.com/mombadi/Ombadi-et-al.-2020-Evaluation-of-methods-for-causal-discovery-in-hydrometerorlogical-systems-/tree/master/Synthetic-Data-_-Hydrologic-Bucket-Model). Data from FluxNet site US-SRM used in this study is publicly available from the FluxNet data portal at https://fluxnet.fluxdata.org/.