Quantifying Climate and Catchment Control on Hydrological Drought in the Continental United States
Abstract
The evolution of hydrological drought events is a result of complex (nonlinear) interactions between climate and catchment processes. To investigate such nonlinear relationship, we integrated a machine learning modeling framework based on the random forest (RF) algorithms with an interpretation framework to quantify the role of climate and catchment controls on hydrological drought. More particularly, our framework interprets a built RF machine-learning model to identify dominant variables and visualize their functional dependence and interaction effects on hydrological drought characteristics utilizing concepts of minimal depth, interactive depth, and partial dependence. We test our proposed modeling framework based on a set of 652 continental United States catchments with minimal human interference for a period of 1979–2010. Application of this framework indicated presence of three distinct drought regimes, which includes, Regime 1: droughts with longer duration, less frequent and lesser intensity; Regime 2: droughts with moderate duration, moderate frequency, and moderate intensity; and Regime 3: droughts with shorter duration, more frequent, and more intense. RF algorithm was able to accurately model the drought characteristics (intensity, duration, and number of events) for all the three drought regimes as a function of selected variables. It was observed that the type of dominant variables as well as their nonlinear functional relationship with hydrological droughts characteristics can vary between three selected regimes. Our interpretation framework indicated that catchment characteristics have a significant role in controlling the hydrologic drought for catchments (regime 1), whereas both climate and catchment characteristics control hydrological drought in regimes 2 and 3.
Key Points
- An integrated random forest algorithm interpretation framework was applied to investigate hydrological drought characteristics in CONUS
- This framework indicated the presence of three drought regimes which witnesses dominant climate and catchment controls
- The dominant climate and catchment controls exhibit varied functional relationships with hydrological droughts
1 Introduction
Hydrologic drought events defined as a period with inadequate surface and subsurface water resources are a result of multifaceted interaction between climate and catchment processes (Mishra & Singh, 2010; Van Lanen et al., 2013; Van Loon et al., 2014; Wang et al., 2011). Therefore, hydrologic drought not only depends on decrease in precipitation or increase in temperature, but it is further influenced by the interaction of various climate and terrestrial components (e.g., soil characteristics, elevation, and stream order). An inadequate understanding of this complexity can be a major challenge for accurate prediction as well as efficient drought management (Cayan et al., 2010; Mishra & Singh, 2011; Narasimhan & Srinivasan, 2005; Sheffield et al., 2012). To address this complex hydrological drought processes, many studies have investigated the potential influence of terrestrial catchment characteristics on hydrological droughts by utilizing physically based models (Apurv et al., 2017; Tallaksen et al., 2009; Van Loon et al., 2014; Van Loon & Laaha, 2015; Van Loon & Van Lanen, 2012). However, the application of physically based models for catchments is often plagued by differences in spatial scale, over/underparameterization, and model structural error, including model calibration uncertainties.
A few studies have utilized a linear regression-based framework (Saft et al., 2016, 2015; Van loon et al., 2014; Van Loon & Laaha, 2015; Van Loon & Van Lanen, 2012) to understand the role of climate and terrestrial components in the development of hydrological drought. On the other hand, many studies suggested the response of streamflow to meteorological conditions is predominantly nonlinear in nature (Konapala & Mishra, 2016; Latt et al., 2015; Stahl et al., 2008). Therefore, we expect that hydrological drought characteristics derived based on streamflow likely to have a nonlinear dependence due to the complex interaction between climate and catchment processes within a watershed. In addition to that, the evolution of hydrological drought is often clustered based on neighboring catchments due to the similarity in climate and catchment characteristics (Rajasekhar et al., 2014; Zhang et al., 2012). Hence, it is important to gain a deeper understanding of the dominant linear and nonlinear controls resulting in distinct drought regimes using robust nonparametric techniques. Therefore, there is a great potential to further quantify the nonlinear association between climate (catchment) variables and the evolution of clustered drought events based on nonparametric techniques, machine learning algorithms and interpretive framework.
Machine learning algorithms are a class of nonparametric techniques that can successfully capture subtle functional relationship between the input (e.g., precipitation, evaporation, and base flow) and the output variables (e.g., streamflow) of a hydrologic system (e.g., watersheds), even if the underlying mechanism producing data is not known (Elshorbagy et al., 2010a, 2010b; Nourani et al., 2014; Raghavendra & Deka, 2014). In addition to that, these methods have no distributional or functional assumptions on covariate relation to the response function. Hence, majority of the studies in hydrology have utilized machine learning algorithms for prediction purposes in hydrology (Shen, 2018). However, the formulation of machine learning algorithms may not be straightforward to quantify underlying mechanisms responsible for model behavior in case of hydrologic processes (Gupta & Nearing, 2014; Karpatne et al., 2017). Recognizing these issues in machine learning algorithms, recently several studies have introduced interpretation frameworks [see Guidotti et al., 2018, for review] to address such limitations. Works on interpreting these black-box models have focused on understanding how a fixed machine-learning model leads to particular predictions. These interpretation frameworks can provide a deeper understanding on the functioning of machine learning models like artificial neural networks, random forests (RF), and support vector machines (Bastani et al., 2017; Bibal & Frénay, 2016; Doshi-Velez & Kim, 2017). Although, the machine learning approaches are widely used in hydroclimatology (Veettil et al., 2018; Fahimi et al., 2017; Shortridge et al., 2016; Raghavendra & Deka, 2014), the interpretation framework to quantify the causal relationship between input and modeled outputs is emerging (Fienen et al., 2018; Koch et al., 2019; Schwalm et al., 2017) and especially not applied to extreme events.
- How are the hydrological drought regimes clustered in the CONUS? What are the key climate and catchment characteristics that control hydrological drought regimes?
- To identify and extract the functional relationships and interactions among dominant variables influencing the hydrological drought characteristics based on the interpretive machine learning techniques (i.e., minimal depth, interactive depth and partial dependence plots).
The remainder of the manuscript is organized as follows: Section 2 provides an overview of data, study area, section 3 presents the methods designed for this study, section 4 presents the results, section 5 discusses the findings and the outlook, and finally, the manuscript is concluded with section 6.
2 Data and Study Area Description
We selected the catchments located in CONUS due to the availability of extensive and open source data associated with various characteristics of catchments. In addition, to understand the dominant variables associated with different drought regimes, it is important to utilize data from catchments with minimal human interference. Therefore, we first identified catchments with minimal human interference based on the GAGES II database (Falcone, 2011), which provides geospatial data for 9,322 stream gages maintained by USGS. This data set serves the purpose of providing users with a comprehensive set of geospatial characteristics for many gaged catchments with long flow record. In addition to that, it also provides information on catchments which are least disturbed by human influences. In this database, 2,057 catchments are identified to have minimal human interference based on three criteria: (1) a quantitative index of anthropogenic modification within the catchment based on Geographical Information system derived variables, (2) visual inspection of every stream gage and drainage basin from recent high-resolution imagery and topographic maps, and (3) information about human influences from USGS Annual Water Data Reports (Falcone, 2011). We have selected water years of 1980–2011 to represent the U.S. climate normal period as our study period to reflect the current climate conditions. Overall, we identified 652 catchments with no missing data during the period of 1980–2011. The spatial location of catchments with minimal human interference and continuous streamflow data are shown in Figure 1.
2.1 Overview of Selected Climate and Catchment Variables
A lack of precipitation and increase in evapotranspiration (i.e., meteorological drought) causes low soil moisture content (i.e., agricultural drought), which further reduces surface and subsurface water resources (i.e., hydrological drought) (Mishra & Singh, 2010; Mukherjee et al., 2018). The propagation of meteorological to hydrological drought is influenced by interaction between climate and catchment variables (Apurv et al., 2017; Haslinger et al., 2014; Mishra & Singh, 2010; Tallaksen et al., 2009; Van Loon et al., 2014). The hydrological drought is directly related to the streamflow generated in a watershed, and it is influenced (controlled) by climate and catchment characteristics of the selected watershed. In our analysis, we selected sixty variables related to climate, catchment, and morphological aspects of catchments documented by the GAGES II data set (Table S1 in the supporting information). Among them, 12 climate variables describing the annual magnitude and intraannual variability of precipitation, temperature, and potential evapotranspiration, and these data are obtained from the high-resolution data available from PRISM database (Daly et al., 2000). Fifteen hydrologic catchment variables related to stream order, base-flow index, and over-land flow are derived from the U.S. National Hydrography Data Set (NHD). Four land cover variables describing the percentage of different land cover types are derived from 2006 Land cover product obtained from National Land Cover Database. Twenty-three soil characteristics are derived from State Soil Geographic data base for the CONUS. Finally, six topographic variables related to elevation, slope, and geographical aspect features of the catchments are included in the analysis. A brief discussion and data sources of the selected variables are provided in Table S1. The interplay between these catchment characteristics are assumed to shape catchment behavior by influencing how catchments store and transfer water. The variables selected and provided in this database are considered to significantly affect the hydrologic processes. Some of these catchment attributes have been previously used for predicting mean streamflow (Rice et al., 2015) and other streamflow signatures (Addor et al., 2018) and drought (Stoelzle et al., 2014). In addition to previous variables, we have selected multiple attributes to cover a wide range of features, such as the catchment climate, hydrology, land cover, soil, geology, topography, and river network.
3 Methodology
3.1 Hydrological Drought Characterization
Hydrological drought often expressed a time period with inadequate surface and subsurface water resources with respect to a normal condition of a given water resources management system (Mishra & Singh, 2010). Therefore, we applied the concept of Standardized Streamflow Index (SSI) (Shukla & Wood, 2008; Vicente-Serrano et al., 2011) to characterize hydrological drought at monthly time scale for selected watersheds across USA. SSI can be computed for multiple timescales and is flexible to determine the drought conditions at seasonal (3 to 6 months), annual (12 months), and longer (>12-month SSI) time scales. However, in this study, we restrict our analysis to seasonal scale as the droughts usually take 3 or more months to develop. Therefore, we calculated the 3-month SSI by aggregating streamflow over 3 months and fitting these accumulated values to a parametric statistical distribution. The probabilities from these fitted distributions are then transformed to the standard normal distribution to create hydrological drought index [Vicente-Serrano et al., 2011; Modarres, 2008; Shukla & Wood, 2008]. Therefore, SSI determines the conditions of stream flow drought relative to the long-term monthly streamflow. The positive SSI values indicate a surplus relative to the long-term streamflow conditions whereas the negative values indicate a deficit (i.e., hydrologic drought). Hydrological drought indices similar to SSI have been previously applied to understand the U.S. hydrological drought characteristics (Shukla & Wood, 2008; Veettil et al., 2018). Shukla and Wood (2008) reported that the two-parameter gamma and lognormal distributions generally performed well for deriving hydrological drought in USA. In this study, lognormal distribution was selected for deriving hydrological drought index based on SSI by using streamflow time series. The formulation of SSI is presented in the supporting information Text S1.
3.2 Classification of Hydrological Drought Regimes
By applying the procedure of multivariate clustering, we can possibly distinguish the evolution of hydrological drought regimes exhibited by catchments based on the long-term (~30 year) statistics of drought characteristics (Rajsekhar et al., 2014; Gocic & Trajkovic, 2014; Yoo et al., 2012). The clustering-based approaches typically used in hydrological studies are based on hierarchical clustering, k-means/medoids clustering, and fuzzy partition clustering (Carrillo et al., 2011; Ley et al., 2011; Olden et al., 2012; Sawicz et al., 2011; Yadav et al., 2007). Since, we aim to investigate dominant controls of catchment characteristics defined by drought regimes; we applied a fuzzy partitioning algorithm that accounts for uncertainty in the classification process.
We identified drought regimes based on the three drought characteristics (i.e., intensity, duration, and number of events) using a fuzzy medoid clustering algorithm. Fuzzy clustering assigns membership values, and it is more generalized and useful to describe a point by its membership values in all the clusters. The method chosen for this study is fuzzy k medoids clustering algorithm introduced by Krishnapuram et al. (2001), which is usually more robust, and the effect of outliers can be significantly reduced compared to other clustering algorithms that uses mean values for classification. Hence, the data objects closer to the median of clusters as determined by Euclidean distance likely to have higher degrees of membership compared to objects scattered around the limits of clusters. Similar to other clustering algorithms, fuzzy k-medoids follows a heuristic approach to minimize the within cluster variance. The formulation of this approach is presented in the supporting information Text S2.
Smaller the Xie-Beni index, more compact is the cluster. Therefore, each catchment is assigned to a specific class with a certain probability, and the catchments with highest probability are considered as primary clusters for subsequent analysis. The resulting clusters based on these trivariate drought characteristics (intensity, duration, and number of events) are a consequence of natural partitions identified by the clustering algorithm. The drought characteristics in each cluster would indicate a distinct drought regime that can provide valuable information on the controls of climate and catchment characteristics on hydrological droughts.
3.3 RF Model
In this study, we utilize RF algorithm (Breiman, 2001) to investigate the dominant catchment and climate variables that plays an important role for evolution of clustered drought characteristics. It is important to acknowledge that selection of an algorithm depends on the objectives and the types of data to be analyzed (Caruana & Niculescu-Mizil, 2006; Huang et al., 2015; Kotsiantis et al., 2007). The RF algorithms differ from linear regression methods. In this study, we used nonlinear RF model and it has a major advantage that they are (mostly) unaffected by multicollinearity (Ishwaran et al., 2010; Zhang & Ma, 2012; Díaz-Uriarte & De Andres, 2006). The multicollinearity problem is alleviated since a random subset of features is chosen for each tree in a RF. (Hsilch et al., 2014; Ishwaran et al., 2010; Zhang & Ma, 2012; Díaz-Uriarte & De Andres, 2006). The ability of RF algorithm to deal with overfitting issues makes it suitable for our application.
RF algorithm uses a set of bootstraps (Efron & Tibshirani, 1994) samples and grows an independent tree model on each bootstrapped sample of the population. Each tree is grown by recursively partitioning the population with an objective to minimize the mean square errors. At each split, a subset of candidate variables is tested for the split optimization and each node is divided into two successor nodes. Each successor node is then split again until the process reaches the stopping criteria of either maximum node purity or node member size, which defines the set of terminal (unsplit) nodes for the tree. RF algorithm then ranks each training set observation into one unique terminal node per tree. The RF estimate for each observation is then calculated by averaging the terminal node results across the collection of trees. A basic pseudo-algorithm explaining the RF procedure is presented in Table 1 and Figure 2. The resampling and averaging procedure circumvents the problem of overfitting and multicollinearity making this approach suitable for our study (Cutler et al., 2007; Díaz-Uriarte & De Andres, 2006; Prasad et al., 2006; Zhang & Ma, 2012). RF algorithm can be tuned to reduce the prediction error (Boulesteix et al., 2012; Breiman, 2001; Strobl et al., 2009). The accuracy of RF algorithm output mainly depends on three parameters (1) the number of trees (ntrees) to grow in the forests, (2) the number of randomly selected predictor variables (mtry) at each node, and (3) the minimal number of observations at the terminal nodes (nodesize) of the trees. We set the number of trees (ntrees) to 1,000 as suggested by Hengl et al. (2018) and Probst and Boulesteix (2017), and we randomly resampled different combinations of parameter sets with “mtry” ranging from one to total variables considered (60 variables) and “nodesize” ranging from one to total number of catchments in each regime. The combination of “mtry” and “nodesize” are selected based on the least out-of-bag error is considered as the optimal parameter.
3.4 Framework for Interpreting RF Algorithm
We interpreted the RF model by examining three important features exploring variable importance, variable interaction and partial dependence. Variable importance and interaction are based on maximal trees and minimal depth concept (Ishwaran et al., 2010), whereas partial dependence is estimated by integrating the effects of all the variables besides the covariate of interest (Breiman, 2001). The concept of minimal depth would allow us to identify the dominant variables, whereas the partial dependence quantifies the approximate relationship between each dominant variable and the drought characteristic. The concept of interaction depth would allow us to understand the interaction among dominant controls of climate, catchment and morphological variables related to a particular drought characteristic.
3.4.1 Minimal Depth
The concept of minimal depth [Diller et al., 2012; Hsisch et al., 2011; Ishwaran et al., 2010] is useful for assessing the variable importance and variable interactions within a RF modeling framework. The concept of minimal depth of a RF can be formulated precisely in terms of a maximal subtree. The maximal subtree for a variable v is the largest subtree whose root node is split based on the changes in variable v. The shortest distance from the root of the tree to the root of the closest maximal subtree of v is the minimal depth of v.
where NRF is the total of number of trees (i.e., ntrees = 1,000), ‖T(v)‖ represents the distance of variable v from the root of any tree T. To illustrate this concept of maximal tree and minimal depth of variable v, we show three separate trees (Figure 3) representing three randomized trees to mimic the behavior of RFs. In this way, depth of variable v for all the maximal subtrees are identified and averaged across all the randomized trees to calculate the (minimal depth) MD(v). A smaller MD(v) value indicates that the corresponding variable v is more influential. Those variables with averaged minimal depth exceeding the average minimal depth threshold are treated as noisy and therefore removed from the final model.
3.4.2 Interactive Depth
where NRF is the total of number of trees (i.e., ntrees = 1,000), ‖MT(v,w)‖ represents the distance between variable v and w from the root of any maximal tree MT, and MTD(v) is the depth of maximal subtree MT(v). Based on the formulation, the ID(v,w) has a range between 0 to 1. Among them, the interactive depth (ID) values closer to zero indicates higher interaction between any two considered variables. Figure 3c illustrates these interactions between variables v and w, where the right maximal subtree of variable v and w splits further inside the subtree. If this concept is observed over all the randomized trees, then there is a significant interaction between variables v and w and they collectively influence a prediction outcome.
3.4.3 Partial Dependence
where represents the outputs based on the RF models. This partial dependence estimate can be visualized to understand the functional relationship between the variables (xk) and their potential influence on hydrological droughts. As the RF algorithm randomly resamples the variables for bagging the trees, we run each model 1,000 times and then average the minimal and interactive depth variables to interpret and identify the dominant variables.
4 Results and Discussions
4.1 Classification of Drought Regimes
The fuzzy k medoids clustering approach was applied to 652 catchments to classify the drought regimes based on drought intensity (DI), drought duration (DD), and number of events. First, we identified the optimal number of regimes based on fuzzy silhouette (FS) index andXB index. Figure 4a shows the behavior of FS and XB indices with respect to the number of regimes. It was observed that the optimal number of clusters appears to be three based on the maximum and minimum value of FS and XB, respectively. Therefore, we consider the optimal number of clusters as three for further analysis.
The drought characteristics (i.e., DI, DD, and ND) for three selected drought regimes are shown in Figures 4b and 4d. Since the units of DI, DD, and number of drought (ND) are different, we applied the concept of Z score to standardize and compare the drought characteristics for three selected regimes. Z score measures the standard deviation of the sample data points from their population average. The boxplots with Z score are plotted in Figure 4b, so that the drought characteristics can be compared among the identified regimes. The number of catchments representing each regime are shown in Figure 4c. The absolute values of drought characteristics for each of the drought regime is plotted as probability distribution as shown in Figure 4d. Regime 1 is represented by 142 catchments with longer droughts (median DDzs~1), lower drought intensities (median DIzs~−0.75) and occurrences with median NDzs~−1. The magnitude of DD for regime 1 varies between 7 and 20 months, whereas the magnitude of DI and ND varies from 0.4 to 0.8 and 5 to 20, respectively. Regime 2 is represented by 242 catchments that exhibit relatively moderate drought characteristics with median z scores close to 0 (Figure 4b). The magnitude of DD for the catchments located in regime 2 varies between 7 and 12 months, DI within the range of 0.5 to 0.9, and ND within the range of 15 to 25. Higher number of catchments (total: 268) is located in regime 3, which represents low drought duration (median DDzs~−0.8) occurring frequently (median NDzs~0.75) with higher intensity (median DIzs~1). The catchments located in regime 3 witness droughts with duration between 5 and 8 months, intensity varies between 0.7 and 1.2 and frequency between 20 and 30 (Figure 4d).
The spatial locations of catchments for three drought regimes are shown in Figure 5. The catchments located in Pacific North West, parts of north eastern, and central USA represent regime 1, with drought characteristics of longer duration but are less intense and occur less frequently. Whereas, the catchments representing regime 2 with moderate drought characteristics are in different parts of CONUS, and the catchments representing regime 3 are mostly located in north central and eastern USA including watersheds in pacific North West region. Overall, it was observed that the spatial proximity between the catchments does play a considerable role in the clustering of regimes, which is probably due to similar climatological variability and catchment response characteristics (Brutsaert & Nieber, 1977; Knapp et al., 2002; Serrano, 2006)
4.2 RF Model Performance
The effect of multicollinearity in data analysis can make it difficult to get appropriate linear coefficient estimates with small standard errors (Achen, 1982). Our analysis is different due to the application of RF algorithm, which is nonlinear in nature, and we do not rely any regression coefficients in our analysis. Therefore, even though there is a linear correlation between the predictors, it does not interfere with our analysis. In addition, this multicollinearity problem is alleviated since a random subset of features is chosen for each tree in a RF [Hsilch et al., 2014; Ishwaran et al., 2010; Zhang & Ma, 2012; Díaz-Uriarte & De Andres, 2006].
As highlighted before, the primary purpose of our study is to identify the key climate, catchment, and morphological variables using a machine learning interpretation framework. Although, our machine learning application is not focused on prediction, we performed a preliminary analysis to evaluate the performance of RF model by splitting the data in to training (75%) and testing (25%) phase of the optimized RF algorithm. The model performed well based on the root-mean-square error information, and these plots are presented in the supplementary text.
We evaluated the performance of RF algorithm to model the variations in drought characteristics (DI, DD, and ND) in each regime with respect to the selected catchment and climate variables by applying on the entire data set. As highlighted in section 3, the optimal parameters of RFs (i.e., mtry and nodesize) which was derived based on the least out-of-bag error are listed in Table 2. In addition to that, the metrics of R2, percentage bias (PBIAS), and Nash-Sutcliffe efficiency (NSE) for the corresponding optimal model configurations are listed in Table 2. The coefficient of determination (R2) in case of each RF model is more than 0.9. This measure indicates that the adopted RF algorithm can explain more than 90% of the variance found in the drought characteristics. The PBIAS values which are expressed in percentage remain closer to 0 indicating comparatively lesser bias among all the RF models. Finally, the NSE values are in the range of 0.77 to 0.85. NSE values closer to 1 correspond to a perfect match between the modeled and observed data points. Also, NSE values greater than 0 indicate an unbiased model. Hence, the NSE values also point toward an unbiased and efficient model. Therefore, all the models have high coefficient of determination (R2 > 0.9), lower PBIAS values and NSE values closer to 1.
4.3 Application of Interpretation Framework to Understand Drought Characteristics
4.3.1 Application to Drought Duration
Figure 6 shows the ranking of climate and catchment variables that has potential influence on the hydrological DD for each drought regime. As discussed earlier, variables with least minimal depth likely to have higher dominant control, whereas the increase in depth will have lower influence on drought duration within each regime. The dashed line (Figure 6) indicates the average minimal depth of all the variables, which can be used as a threshold to determine the significant variables of interest (Ishwaran et al., 2011). Based on this threshold, the significant influencing variables are highlighted in green color, and the noninfluential variables are highlighted in orange color (Figure 6).
Overall, 20 variables have more than average minimal depth for regime 1, which represents catchments with higher drought duration (median DDzs~1). In case of regime 2, which represents catchments with average drought duration, 14 variables have more than average minimal depth. Finally, in case of regime 3, which consists of catchments with lower drought duration (median DDzs~−0.8), a total of 11 variables have more than average minimal depth. The potential influence of number of climate and catchment variables on drought characteristics varies for three different drought regimes. For instance, maximum number of catchment variables dominate in controlling the drought duration for catchments that witnesses low drought durations, whereas soil and climate variables dominate for catchments witnessing high and medium drought durations, respectively. It was observed that in the case of catchments with high drought durations (regime 1), base flow index (BFI_AVE) has significant lesser minimal depth compared to other variables suggesting its dominant role in that regime. Base flow index is a key variable as it captures the interaction between climate and catchment variables that generates streamflow a given watershed.
To further understand how these dominant variables interact with each other to potentially influence the drought duration, the normalized interactive minimal depth was plotted between the top 5 variables (Figure 7). As highlighted before, normalized interactive minimal depth varies from 0 to 1, where 0 indicates highly interactive and 1 being no interaction between the selected variables. In case of regime 1 and 2, the interactive minimal depth between the variables is closer to 1 indicating that there is less interaction between the dominant variables. However, Base flow index (BFI_AVE) seems to interact with the other variables and especially with the mean Relief ratio (RR_MEAN) and aspect with respect the geographical north (ASPECT_NORTH) in regime 1. In case of regime 3, the maximum number of days in a month with nonzero precipitation (WDMAX_BASIN) interacts with other variables and particularly with length of streams per square kilometer (STREAMS_KM_SQ_KM) within the catchments. Overall, these results suggest no significant interaction between the dominant variables, although they have direct influence on the drought duration.
We further assessed the partial dependence of top 5 dominant variables on drought duration (Figure 8). In case of regime 1 (Figure 8a), base flow (BFI_AVE) controls the drought duration based on a power law behavior. The relation between baseflow index and drought is often complicated. Higher base flow index can result in low duration drought events, and as the magnitude of baseflow index increases, it shares a power law function with the drought duration. In addition to that, the power law behavior extends over the entire range of drought duration, which suggests a greater control of base flow on higher drought durations. Mean elevation and percentage of soils with low infiltration rate (HGC) exhibit nonlinear relationships; however, unlike the case of base flow index, they explain the variability of drought duration partially ranging from 12 to 13 months. In case of regime 2 (Figure 8b), base flow index predominantly controls the drought duration based on a nonlinear relationship. However, it is interesting to see that the underlying functional relationship does not obey power law, as in the case of regime 1. Other variables, such as, basin compactness (BASIN_COMPACTNESS), percentage of soils with low infiltration rate (HGC), aspect with respect the geographical north (ASPECT_NORTHNESS), and temperature variability (TMEAN_SD) also exhibit a nonlinear and inversely proportional functional dependence on drought duration. In case of regime 3, the maximum of number of days in a month with nonzero precipitation (WDMAX_BASIN) plays a key role compared to the base flow index. A left truncated parabolic relationship can be observed, which indicates a nonlinear control of precipitation intensity on drought duration.
4.3.2 Application to Number of Drought Events
Figure 9 shows the ranking of catchment and climate variables that has potential influence on ND events within each regime. Overall, 23 variables have more than average minimal depth for regime 1, which represents catchments with lesser drought occurrences (median NDzs~−1). A total of 13 variables have more than average minimal depth are selected for regime 2, which represents catchments with moderate number of drought events, whereas 14 variables have more than average minimal depth for regime 3. Although, different dominant variables are identified that controls drought duration for each regime; however, similar variables within each regime dominate both drought duration and drought event occurrences.
The interaction between the five most dominant variables within each regime based on the number of drought events is illustrated in Figure 10. Similar to the case of drought duration, lowest interactive depth was observed in the case of base flow index (BFI_AVE) and it has some interaction with the mean relief ratio (RR_MEAN) in case of regime 1. However, no such significant interactions observed in case of regime 2 due to the relatively high ID values.
Figure 11a illustrates the partial dependence between variables specific to regime 1. It can be observed that the variables that has potential influence on drought duration also influences drought event occurrences. The BFI_Ave is inversely proportional to drought occurrences following an exponential relationship. An increase in base flow likely to increase in ground water contribution to streamflow resulting in lesser number of droughts. Elevation exhibits an inverse relationship up to 2,000 m and then exhibits a directly proportional relationship till 3,000 m, whereas HGC exhibits a semi parabolic relationship and it can be observed that other variables do not explain much of the variability of drought event occurrences.
Overall, it was observed that RF modeling framework is flexible to accommodate different functional relationships between the dominant variables and the number of drought events. In case of regime 2, variability of temperature (TMEAN_SD) exhibits more dominant behavior in controlling the drought event occurrences, whereas BFI_AVE shares an inversely proportional relationship for the same regime. The other three selected variables exhibit dominant and different functional relationships as shown in Figure 11b. Bulk density of the soil (BD_AVE) is a key variable that has potential influence on the drought event occurrences in regime 3 (Figure 11c). However, it does not explain the variability of the entire range of drought event occurrences as in the case of other two regimes. WDMAX_BASIN was able to explain the variability of drought occurrences on the higher end which was previously ignored by the BD_AVE. This highlights the complementary behavior of the climate and catchment characteristics for controlling the drought event occurrences.
4.3.3 Application to Drought Intensity
Climate and catchment variables are ranked based on their potential influence on DI (Figure 12). A total number of 24 variables have more than average minimal depth for regime 1, which represents catchments with lower drought intensity (median DIzs~−1). In comparison to regime1, a lesser number of influencing variables were observed for regimes 2 and 3. A total number of 10 and 11 variables have more than average minimal depth for regimes 2 and 3, which represents catchments with moderate and higher drought intensity, respectively. The type of variables which dominate drought intensity are mostly similar in the case of drought duration and number of drought events. Overall, it was observed that the majority of the variables are related to soil, climate and catchment characteristics that has potential influence on drought intensity in regimes 1–3, respectively.
The interaction between the top five dominant variables within each regime in case of drought intensity is illustrated in Figure 13. In case of regime 1, none of the dominant variables have shown any significant interactions as the ID values are closer to 1. However, in case of regime 2, temperature variability (TMEAN_SD) exhibits potential interaction with other dominant variables as ID value is around 0.7. In case of regime 3, the average clay content (CLAYAVE) in the basin exhibits significant interactive effects. Among them, the percentage of soil with high infiltration (HGA) has exhibited significant interaction with CLAYAVE. As in the case of other drought properties, the interacting effects are significant but to the lesser as exhibited by the relative higher ID values in case of drought intensity.
Figure 14a illustrates the partial dependence specific to regime 1. The percentage of soils with high infiltration capacity (HGA) has a dominant role and it shares a directly proportional relationship with drought intensity. The presence of soils with high infiltration likely to create a competition between ground water recharge and streamflow. Hence, drier antecedent conditions may result in more intense droughts. On the other hand, forest cover (FOREST) has an inversely proportional relationship with drought intensity. Base flow index also influences the drought intensity, but not as significant as in other cases. Mean aspect degree (Aspect_Degrees) shares a direct proportional relationship with drought intensity.
In case of regime 2, the temperature variability (TMEAN_SD) exhibits the most dominant control on drought intensity similar to the case of number of drought events. However, the functional relationship is opposite in nature. The percentage of streams in Strahler's forth order (PCT_4th_ORDER) exhibits an inverse exponential relationship with the drought intensity. Additional variables, such as, rainfall factor (R_FACT), silt content (SILT_AVE), and precipitation variability (PRCP_CV) also influences the intensity of drought. In regime 3, basin compactness (BAS_COMPACTNESS) and base flow index (BFI_AVE) exhibit dominant control on drought intensity. However, the remaining variables are not as dominant as in the case of other clusters.
5 Discussion and Outlook
Hydrological drought in a catchment is controlled by the climate characteristics (recharge) and catchment characteristics (storage). Based on our interpretation framework by using MD, ID, and partial dependence metrics, the important climate and catchment characteristics that controls hydrological drought characteristics (number of events, duration, and intensity) are provided in Table 3. It was observed that the catchments for regime 1 are mostly located in the higher elevations or mountainous regions characterized by the steep sloping terrain (Figure 3). The hydrological drought characteristics for regime 1 are mostly influenced by the catchment characteristics, which includes base flow index, elevation, and soil characteristics (infiltration rates). Baseflow is influenced by natural factors such as climate, geology, relief, soils, and vegetation. Factors that promote infiltration and recharge of subsurface storage will increase baseflows, while factors associated with higher evapotranspiration will reduce baseflow. Therefore, in these catchments, groundwater drainage moves slowly, which results in prolonged baseflow following rainfall events and thus being more influential in generating hydrological droughts. Interestingly, the elevation is a standalone catchment characteristic that plays an important role for drought characteristics in regime 1. Most of these catchments receive snows during winter months; therefore, the hydrological drought can be influenced by a combination of rain and snow and depending on the difference between elevation ranges the timing and intensity of drought can vary among the watersheds.
In addition to elevation, the soil characteristic (e.g., infiltration and hydraulic conductivity) is a key variable for hydrological droughts for regime 1. In these catchments, subsurface flow generation is directly proportional to the hydraulic conductivity of soils and thus controlling the discharge rate specific to soil types (e.g., Armbruster, 1976; Musiake et al., 1984; Smith, 1981). In addition, soil properties are known to affect infiltration, rooting depth/restrictions, available water capacity, soil porosity, and soil microorganism activity, which influence the streamflow discharge rate (Bennie et al., 2008; Moeslund et al., 2013; Strachan & Daly, 2017). The moisture storage capacity in soil decreases due to reduced precipitation and high evapotranspiration that further reduces baseflow leading to evolution of hydrological droughts in different segments of the hydrological system. Hence, streamflow generation in base flow dominant streams is strongly influenced by the subsurface hydrogeologic, configuration, the saturated permeabilities of the component formations, and the unsaturated soil characteristics of the soil types (Freeze, 1972). Hence, in addition to the topography, the soil features may control the hydrologic drought properties through these physical processes. Further, the forest cover influences drought intensity compared to duration and number of events.
Some of the catchment characteristics that controls hydrologic drought in regime also influence drought characteristics in regimes 2 and 3. However, there is a clear difference between regime 1 and regimes 2 and 3 in terms of climate control on hydrological droughts. The Climate factors such as precipitation may have lesser direct influence on hydrological droughts in regime 1, which can be attributed to limited time available to store water in comparatively higher gradient watersheds as well as possible contribution of snow for the watersheds located in snowy regions (e.g., northeast and central north watersheds). The climate variable which has a potential influence on hydrological drought in regimes 2 and 3 includes temperature and precipitation. For example, temperature can have a direct influence on the development of hydrological drought in snow dominated regions. The combination of elevation and temperature on triggering hydrological droughts can vary due to snow dominated regions located in mountain regions. The role of precipitation characteristics on propagation of hydrological drought is well recognized (Mishra & Singh, 2010; Mukherjee et al., 2018; Wan et al., 2017). The amount of rainwater held in storage is different for three regimes, for example, higher elevation areas can hold less rain water compared to low lying forested areas. The rainfall pattern in semiarid regions (typically western USA) is very irregular leading to very low storage and increase in hydrological drought.
However, for the humid catchments located in regimes 2 and 3, the soils are mostly saturated due to the antecedent climate conditions that results in a more direct relationship between precipitation, potential evapotranspiration, and temperature with hydrologic drought characterization. In addition to precipitation and temperature, the lower relative humidity can influence rainfall patterns leading to the evolution of hydrological drought in regime 3. The role relative humidity on evolution of drought is complex in nature. During dry hydrologic conditions, the moisture depletes from the upper soil layers leading to decrease in evapotranspiration and atmospheric relative humidity (Mishra & Singh, 2010). Further, the reduced relative humidity reduces the probability of the rainfall, which further triggers hydrological drought (Mishra & Singh, 2010).
In regimes 2 and 3, the stream networks defined by Strahler number has a potential influence on the hydrological drought. An increase in Strahler order can be related to a decline in the catchment's general slope (Haidary et al., 2015) and potentially increase the storage capacity, groundwater recharge, and baseflow of the watershed. Therefore, the first order streams that represent the outermost tributaries are typically located at higher slopes compared to fourth order streams. This suggests that the higher storage capacity likely to be observed in fourth order stream will have a better control on hydrological drought compared with first order stream. The first order stream which is usually located at the upper end of channel networks (Strahler, 1952) comparatively has larger slope likely to drain out excess water immediately following a precipitation event (McMahon & Finlayson, 2003). As a result, if there is a deficit in precipitation or increase in evapotranspiration in the catchment, the first order streams likely to facilitate a more direct propagation of meteorological drought to hydrological drought with no buffer (Godsey & Kirchner, 2014; McMahon & Finlayson, 2003; Pinna et al., 2004). As a result, the presence of first and fourth stream orders might influence the drought properties in contrasting ways.
In regimes 2 and 3, the mean aspect degree found to be an important variable in controlling the hydrologic drought properties. Mean aspect degree is often associated with variability in microclimate, including near-surface temperatures, evaporative demand, soil moisture content, and vegetation (Strachan & Daly, 2017; Srinivasan et al., 2015; Moeslund et al., 2013). As a result, the mean aspect degree, which is a topographic metric, controls the microclimatic and vegetation features likely to control drought characteristics. It was observed that in addition to the common processes that control the hydrologic drought characteristics, there are additional distinct processes observed specific to each regime. This highlights the differential nature of climate and catchment control on hydrological droughts. Under humid conditions, the evolution of hydrological droughts in small size catchments can be attributed more to climate characteristics, whereas for the larger watersheds the storage capacity and baseflow associated with catchment characteristics can play a dominant role, whereas, for the catchments under severe dry condition, the climate signal can have less predictive power compared to the storage properties of the watershed.
The application of RF algorithm can provide a better understanding of how the climate and catchment controls differ for a specific hydrological drought characteristic. For instance, previous studies highlighted that the BFI, which represents the storage characteristics, plays a key role in controlling the drought duration (; Van Lanen et al., 2013; Van Loon & Laaha, 2015). Our framework does reconfirm the role of BFI in case of regimes 1 and 2; however, we further identified two important distinctions. First, the relationship between BFI with drought characteristics can be nonlinear, and as a result, it cannot be generalized about increase in drought duration with BFI. Second, the base flow acts as a dominant process mostly for the catchments that witnesses medium and high duration drought events, whereas it has lesser influence for the catchments with lower drought durations. The linear regression approach may not capture such phenomenon as the model parameters are more biased toward high magnitude variables (Hastie et al., 2009).
Our empirical analysis suggests lack of prominent interactions between dominant variables on hydrological drought propagation. This highlights the fact that dominant drivers of drought characteristics are more additive and independent in nature. For example, the percentage of soils with high infiltration rate (HGA) and percentage of fourth order streams (PCT_4th_ORDER) both have dominant control on drought intensity but have minimal interaction effect. As a result, even though both of these variables control the propagation of drought independently, the underlying processes for drought propagation have minimum interaction with each other.
Our results also indicated that even though similar dominant controls exist across different regimes; their functional relationship with drought characteristics might be different as highlighted in Jencso and McGlynn (2011) and Knapp et al. (2015). For instance, we identified that the base flow index controls drought duration for both regimes 1 and 2. However, the functional relationship of base flow with respect to drought duration is different in regimes 1 and 2. In regime 1, the base flow index exhibits an exponential relationship with drought duration, where as in the case of regime 2, a different form of nonlinear relationship exists, which does not fit into the traditional exponential functions. Therefore, even though the catchment and climate characteristics exhibit a nonlinear relationship with respect to drought characteristics, the relationships across regions with different hydrologic characteristics should not be generalized.
Our interpretative modeling framework also highlighted the influence of dominant variables can vary over a range of drought characteristics. In other words, individual climate (catchment) characteristic can have a higher (lower) influence on the variability in the upper (lower) range of drought characteristic. For instance, the bulk density may affect the soil features which control the drought intensity in regime 1. However, it does not explain the variability of the entire range of drought event occurrences as in the case of other two regimes. Whereas, WDMAX_BASIN can able to explain the variability of drought occurrences on the higher end which was previously ignored by the BD_AVE. This highlights the complementary behavior of the climate and catchment characteristics for controlling the drought event occurrences.
The framework presented in this study introduces valuable interpretability components of RF algorithm in the context of understanding hydrologic processes. Even though we have applied this framework for understanding drought characteristics, there are other frameworks for understanding the black box model interpretability. Among them, individual conditional expectations (Goldstein et al., 2015; Guidotti et al., 2018), local interpretable model-agnostic explanations (Ribeiro et al., 2016), and influence functions (Koh & Liang, 2017), which are recently introduced in machine learning literatures. There is a potential scope to compare these interpretability frameworks and the quality of machine learning algorithms. We believe our approach can serve as a preliminary avenue to further delve deeper into the application of interpretative machine learning frameworks for understanding not only droughts but also other hydrologic processes. Therefore, further works along these directions might improve our understanding of hydrologic processes using interpretative machine learning algorithms.
6 Conclusions
In this study, we applied machine learning methods by integrating fuzzy clustering and RF algorithm to develop an interpretation framework (i.e., minimal and interactive depth and partial dependence) to quantify the role of climate and catchment controls on hydrological drought for 652 catchments located in CONUS. RF algorithm can adequately capture the functional relationship between climate and catchment characteristics and hydrological droughts. The proposed framework based on MD, ID, and partial dependence metrics can identify the important climate and catchment characteristics that can further improve our understanding of the dominant role of climate and catchment characteristics in propagation of hydrological droughts.
Using a large number of catchments under different climatic regimes enabled us to explore the dominant control of these land scape control on CONUS hydrological droughts. We conclude that the RF-based interpretative approach is a simple, robust, and yet powerful way to gain insights into the drivers of hydrological droughts. The applied framework can provide useful information to understand different combination of climate and catchment characteristics that can either attenuated or intensify the hydrological droughts. The following conclusions can be drawn from this study: (i) Three drought regimes are identified based on their duration, frequency and intensity, which includes Regime 1: droughts with longer duration, less frequent, and lesser intensity; Regime 2: droughts with moderate duration, moderate frequency, and moderate intensity; and Regime 3: droughts with shorter duration, more frequent, and more intense; (ii) among the identified regimes, even though some common hydrologic processes control the drought characteristics, there are some distinct processes specific to each regime; (iii) similar climate, catchment, and morphological characteristics may exhibit varied functional relationships (i.e., exponential, hyperbolic, and linear) with drought characteristics located in different regimes; and (iv) the dominant variables may not explain the variability of the entire range of drought characteristics. From the above insights, we propose that these issues deserve more attention by integrating the knowledge obtained from the application of machine learning algorithms in hydroclimatic process (e.g., hydrological drought) and hydrological models used for such analysis. Although, hydrologic models can able to capture the streamflow with reasonable accuracy, but it often over (under) estimated the extreme events such as extreme drought events. This implies that a better understanding of the role of climate and catchment characteristics for the evolution and propagation of hydrological drought events is essential. The results obtained from our proposed machine learning framework can complement the ongoing research related to hydrological droughts by better exploitation of the value of nonclimatic attributes (such as soil, land cover, and geology), and a more systematic characterization of the uncertainties in catchment attributes needs to performed.
Acknowledgments
We very much appreciate Associate Editor and three reviewer's valuable comments that helped us improve our manuscript. This study was supported by the NSF award 1653841. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. The authors used GAGES II data set in this study, and these data sets are publicly available at https://water.usgs.gov/GIS/metadata/usgswrd/XML/gagesII_Sept2011.xml website.