Toward Improved Predictions in Ungauged Basins: Exploiting the Power of Machine Learning

Long short‐term memory (LSTM) networks offer unprecedented accuracy for prediction in ungauged basins. We trained and tested several LSTMs on 531 basins from the CAMELS data set using k‐fold validation, so that predictions were made in basins that supplied no training data. The training and test data set included ∼30 years of daily rainfall‐runoff data from catchments in the United States ranging in size from 4 to 2,000 km2 with aridity index from 0.22 to 5.20, and including 12 of the 13 IGPB vegetated land cover classifications. This effectively “ungauged” model was benchmarked over a 15‐year validation period against the Sacramento Soil Moisture Accounting (SAC‐SMA) model and also against the NOAA National Water Model reanalysis. SAC‐SMA was calibrated separately for each basin using 15 years of daily data. The out‐of‐sample LSTM had higher median Nash‐Sutcliffe Efficiencies across the 531 basins (0.69) than either the calibrated SAC‐SMA (0.64) or the National Water Model (0.58). This indicates that there is (typically) sufficient information in available catchment attributes data about similarities and differences between catchment‐level rainfall‐runoff behaviors to provide out‐of‐sample simulations that are generally more accurate than current models under ideal (i.e., calibrated) conditions. We found evidence that adding physical constraints to the LSTM models might improve simulations, which we suggest motivates future research related to physics‐guided machine learning.


Introduction
Science and society are firmly in the age of machine learning (ML; McAfee & Brynjolfsson, 2017). ML models currently outperform state-of-the-art techniques at some of the most sophisticated domain problems across the Natural Sciences (e.g., AlQuraishi, 2019; He et al., 2019;Liu et al., 2016;Mayr et al., 2016). In Hydrology, the first demonstration of ML outperforming a process-based model that we are aware of was by Hsu et al. (1995), who compared a calibrated Sacramento Soil Moisture Accounting Model (SAC-SMA) against a feed-forward artificial neural network across a range of flow regimes. More recently, Nearing et al. (2018) compared neural networks against the half-hourly surface energy balance of hydrometeorological models used operationally by several international weather and climate forecasting agencies, and showed that the former generally out-performed the latter at out-of-sample FluxNet sites. In a companion paper to this one, Kratzert et al. (2019) showed that regionally trained long short-term memory (LSTM) network outperforms basin-specific calibrations of several traditional hydrology models and demonstrated that LSTM-type models were able to extract information from observable catchment characteristics to differentiate between different rainfall-runoff behaviors in hydrologically diverse catchments. The purpose of this paper is to show that we can leverage this capability for prediction in ungauged basins.
There is a long-standing discussion in the field of Hydrology about the relative merits of data-driven versus process-driven models (e.g., Klemeš, 1986). In their summary of a recent workshop on "Big Data and the Earth Sciences," Sellars (2018) noted that "Many participants who have worked in modeling physical-based systems continue to raise caution about the lack of physical understanding of machine learning methods that rely on data-driven approaches." It is often argued that data-driven models might underperform relative to models that include explicit process representations in conditions that are dissimilar to training data (e.g., Kirchner, 2006;Milly et al., 2008;Vaze et al., 2015). While this may or may not be true (we are unaware of any study that has tested this hypothesis directly), in any case where an ML model does outperform relative to a given process-based model, we can conclude that the process-based model does not take advantage of the full information content of the input/output data (Nearing & Gupta, 2015). At the very least, such cases indicate that there is potential to improve the process-based model(s).
One of the situations where the accuracy of out-of-sample predictions matter is for prediction in ungauged basins (PUB). PUB was the decadal problem of the International Association of Hydrological Sciences (IAHS) from 2003-2012 (Hrachowitz et al., 2013;Sivapalan et al., 2003). State-of-the-art regionalization, parameter transfer, catchment similarity, and surrogate basin techniques (e.g., Parajka et al., 2013;Razavi & Coulibaly, 2012;Samaniego et al., 2017) result in streamflow predictions that are less accurate than from models calibrated individually in gauged catchments. Current community best practices for PUB center around obtaining detailed local knowledge of a particular basin (Blöschl, 2016), which is expensive for individual catchments and impossible for large-scale (e.g., continental) simulations like those from the U.S. National Water Model (NWM; Salas et al., 2018) or the streamflow component of the North American Land Data Assimilation System (NLDAS; Xia et al., 2012). Moreover, Vrugt et al. (2006) argued that reliable streamflow predictions from lumped catchment models typically require at least 2 to 3 years of gauge data for calibration (even this is likely an underestimate of the amount of data necessary for reliable model calibration). PUB remains an important challenge because the majority of streams in the world are either ungauged or poorly gauged (Goswami et al., 2007;Sivapalan, 2003), and the number of gauged catchments, even in the United States, is shrinking (Fekete et al., 2015).
In this technical note, we demonstrate an ML strategy for PUB. Our results show that out-of-sample LSTMs outperform, on average, a conceptual model (SAC-SMA) calibrated independently for each catchment, and also a distributed, process-based model (NWM). The purpose of this demonstration is twofold. First, to show that there is sufficient information in the available hydrological data record to provide meaningful predictions in ungauged basins-at least a significant portion of the time. Second, to show that ML offers a promising path forward for extracting this information, and for PUB in general. The current authors are unaware of any existing model that performs as well, on average, as the LSTMs that we demonstrate here. At the end of this technical note we offer some thoughts-both philosophical and practical-about future work that could be done to advance the utility of ML in a complex systems science like Hydrology.
To reemphasize our primary findings succinctly, ML in ungauged basins outperforms, on average (i.e., in more catchments than not) a lumped conceptual model calibrated in gauged basins, and also a state-of-the-art distributed process-based model. This rapid correspondence is intended to highlight initial results that might motivate continued development of these and similar techniques-this is not intended to be a comprehensive analysis of the application of LSTMs or deep learning in general to PUB.

Data
Experimental data for our analysis came from the publicly available Catchment Attributes and Meteorology for Large-Sample Studies (CAMELS) data set curated by National Center for Atmospheric Research (NCAR; Newman et al., 2014;Newman et al., 2015). CAMELS consists of 671 catchments in the continental United States ranging in size from 4 to 25,000 km 2 . These catchments were chosen from the available gauged catchments in the United States due to the fact that they are largely natural and have long gauge records  available from the United States Geological Survey National Water Information System. CAMELS includes daily forcing from Daymet, Maurer, and NLDAS, as well as several static catchment attributes related to soils, climate, vegetation, topography, and geology (Addor et al., 2018). It is important to point out that these catchment attributes were derived from maps, remote sensing products, and climate data that are generally available over the continental United States and either exactly or in close approximation, globally. For this project, we used only 531 of 671 CAMELS catchments-these were the same basins that were used for model benchmarking by Newman et al. (2017), who removed basins from the full CAMELS data set with (i) large discrepancies between different methods of calculating catchment area and (ii) areas larger than 2,000 km 2 .

A Brief Overview of LSTM Networks
LSTMs are a type of recurrent neural network (RNN) first proposed by Hochreiter and Schmidhuber (1997). LSTMs have memory cells that are analogous to the states of a traditional dynamical systems model, which make them useful for simulating natural systems like watersheds. Compared with other types of recurrent neural networks, LSTMs avoid exploding and/or vanishing gradients, which allows them to learn long-term dependencies between input and output features. This is desirable for modeling catchment processes like snow accumulation and seasonal vegetation patterns that have relatively long timescales as compared with input-driven processes like direct surface runoff.  applied LSTMs to the problem of rainfall-runoff modeling and later demonstrated that the internal memory states of the network were highly correlated with observed snow and soil moisture states without the model seeing any type of snow or soil moisture data during training . is a vector containing features (model inputs) at time step t. This is not dissimilar to any standard hydrological simulation model (i.e., is it not a one-step-ahead forecast model). The LSTM model structure is described by the following equations: (1) is the cell input and ] the cell state from the previous time step. At the first time step, the hidden and cell states are initialized as a vector of zeros. W, U, and b are calibrated parameters. These are specific to each gate, and subscripts indicate which gate the particular weight matrix/vector is associated with. (·) is the sigmoid activation function, tanh(·) the hyperbolic tangent function, and ⊙ is element-wise multiplication. The intuition is that the cell states (c[t]) characterize the memory of the system. These are modified by (i) the forget gate (f[t]), which allows attenuation of information in the states over time, and by (ii) a combination of the input gate (i[t]) and cell update (g[t]), which can add new information. In the latter case, the cell update contains information to be added to each cell state, and the input gate (which is a sigmoid function) controls which cells are "allowed" to receive new information. Finally, the output gate (o[t]) controls the flow of information from states to model output.

Experimental Design
The LSTMs used in this study took as inputs at each time step the NLDAS meteorological forcing data listed in Table 1. Additionally, at each time step, the meteorological inputs were augmented with the catchment attributes also listed in Table 1. These catchment attributes were described in detail by  and remain constant in time throughout the simulation (training and testing). In total we used 32 LSTM inputs at each daily time step: 5 meteorological forcings and 27 catchment characteristics. All LSTMs were configured to have 256 cell states with a dropout rate of 0.4 applied to the LSTM output before a single regression layer.
We trained and tested three types of LSTM models: 1. Global LSTM without static features: LSTMs with only meteorological forcing inputs, and without catchment attributes, trained on all catchments simultaneously (without k-fold validation). 2. Global LSTM with static features: LSTMs with both meteorological forcings and catchment characteristics as inputs, trained on all catchments simultaneously (without k-fold validation). 3. PUB LSTM: LSTMs with both meteorological forcings and catchment characteristics as inputs, trained and tested with k-fold validation (k = 12).
The third model is the one we want to test-this is the one that simulates in basins that are different than the ones that the models were trained on. Out-of-sample testing was done by k-fold validation, which splits the 531 basins randomly into k = 12 groups of approximately equal size, uses all basins from k-1 groups to train the model, and then tests the model on the single group of holdout basins. This procedure is repeated k = 12 times so that out-of-sample predictions are available from every basin. The second model sets an upper benchmark for our PUB LSTMs. In particular, comparison between the second and third models tells us how much information was lost due to prediction in out-of-sample basins versus in-sample basins. Similarly, a comparison between the first and second models lets us evaluate the value of adding catchment attributes to the model inputs, since these are what will, at least potentially, allow the model to be transferable between catchments.
For each model type we trained and tested an ensemble of N = 10 LSTM models to match the 10 SCE restarts used to calibrate the SAC-SMA models. All metrics reported in Section 4 were calculated from the mean of the 10-member ensembles, except for the NWM reanalysis.
All LSTM models were trained on the first 15 years of CAMELS data (1981-1995 water years)-this is the same data period that Newman et al. (2015) used to calibrate SAC-SMA. And all models (LSTMs, SAC-SMA, and NWM) were evaluated on the last 15 years of CAMELS data (1996-2010 water years). LSTMs were trained and evaluated using a k-fold approach (k = 12). The training loss function was the average NSE over all training catchments; this is a squared-error loss function that, unlike a more traditional MSE loss function, does not overweight catchments with larger mean streamflow values (i.e., does not overweight large, humid catchments) (Kratzert et al., 2019).

Results
A comparison between interpolated frequency distributions over the NSE values from 531 CAMELS catchments from all three LSTM models and both benchmark models (SAC-SMA, NWM) is shown in Figure 2. Mean and median values of several performance statistics are given in Table 2. Interpolation was done with kernel density estimation using Gaussian kernels and an optimized bandwidth.
The primary result is that the out-of-sample PUB LSTM ensemble performed at least as well as both of the in-sample benchmarks in more than half of the catchments against all four performance metrics that we tested, except that the basin-calibrated SAC-SMA has a slightly lower average difference between the 95th percentile flows (both SAC-SMA and the PUB LSTM underestimated peak flows to some extent). The PUB LSTM had a higher NSE than SAC-SMA in 307 of 531 (58%) catchments, and higher than the NWM in 347 of 531 (66%) catchments. The PUB LSTM ensemble also had higher mean and maximum NSE scores than the benchmark models; however, SAC-SMA tended to outperform the PUB LSTM in catchments with low NSE values (see the CDF plot in Figure 2).
There is some amount of stochasticity associated with training the LSTMs, especially through the random weight initialization of the LSTMs, but also by the weight optimization strategy (we used an ADAM optimizer, Kingma & Ba, 2014). Because of this, the LSTM-type models give better predictions when used as an ensemble. It is not necessarily the case that if one particular LSTM model performs poorly in one catchment that a different LSTM trained on exactly the same data will also perform poorly. In our case, we used an ensemble of N = 10 (the same size as the SAC-SMA ensemble developed by Newman et al., 2015 that was used here for benchmarking). Figure 3 shows the NSE values for each ensemble member for the PUB LSTM models. In total, there were 103 basins with at least one PUB LSTM ensemble member with an NSE score of below zero. Only 9 of these 103 basins have all N = 10 ensemble members with NSE < 0, while 55 of the 103 have at least one ensemble member with NSE > 0.5. As an example, one of the basins (USGS basin ID: 01142500, which is basin number 232 in Figure 3) had 9 of 10 ensemble members with NSE < 0, but one ensemble member with NSE > 0.7. This indicates that a substantial portion of the uncertainty in these LSTM models is due to randomness, rather than to systematic model structural error.
The global LSTM model with static catchment attributes performs better than all other models against the metrics that we tested. Figure 4 compares the performance of the Global LSTM with other benchmark models (SAC-SMA and the Global LSTM without static catchment attributes). The Global LSTM with catchment attributes performs better in most-but not all-catchments. This indicates two things. First, the comparison between the Global LSTM with and without static catchment attributes indicates that although there is useful information in the catchment attributes, in some catchments having these data actually hurts us. We explored this relationship briefly, but did not find any patterns in terms of which catchment attributes might tend to lead to underperformance. Specifically, Figure 5 shows that there is generally no correlation between   the values of individual catchment attributes and whether the Global LSTM with versus without statics performs better. Our initial conclusion is that the basins where the LSTM without catchment attributes performs better is likely an indication of error or uncertainty in the catchment attributes data. Nonetheless, these data did generally add significant skill to the model (the difference in NSE scores was statistically different at p < 1e−9). Future work might use a more sophisticated sensitivity analysis (e.g., sequential model building or a Sobol'-type analysis) to test which specific catchment attributes cause this underperformance when added to the model.
The second thing that we want to highlight from the comparison between the Global LSTM and SAC-SMA ( Figure 4) is that there is substantial room to improve SAC-SMA overall. This clearly shows that the LSTM finds rainfall-runoff relationships in individual catchments that SAC-SMA cannot emulate. However, the fact that SAC-SMA performs better in some catchments indicates the potential value of having physical constraints in a hydrological model. The LSTMs in these cases are either overfit or are not able to simulate behaviors of certain similar catchments in the training data set.

Discussion
The results illustrated in the previous section tell us three things: 1. The process-driven hydrology models that we used here as benchmarks could be improved. The LSTM often finds a better functional representation of rainfall-runoff behavior in most catchments than either SAC-SMA or the NWM.
2. The argument that process-driven models may be preferable in out-of-sample conditions may not hold water. Modern ML methods are quite powerful at extracting information from large, diverse data sets under a variety of hydrological conditions. 3. The comparison between models with and without static catchment attributes as inputs demonstrates that there is sufficient information contained in catchment attribute data to distinguish between different rainfall-runoff relationships in at least most of the U.S. catchments that we tested.
Related to the third conclusion, the challenge going forward is about how to extract the useful information from catchment attributes data for regional modeling. One of the historical reasons why this has been a hard problem is because the usual strategy is to use observable catchment attributes or characteristics to identify or "regionalize" parameters of conceptual or process-based simulation models (e.g., Prieto et al., 2019;Razavi & Coulibaly, 2012). This is hard because of strong interactions in high-dimensional parameter spaces. There are many methods for this-notably a family of regionalization methods that Razavi and Coulibaly (2012) called "model independent"; however, we are unaware of any approach that is as effective as LSTMs at extracting this information for streamflow simulation. This is also in line with the recent results by Kratzert et al. (2019), where similar LSTMs were compared against models calibrated with a parameter regionalization strategy (Samaniego et al., 2010). That paper additionally showed that the response of LSTM-type models were relatively smooth with respect to perturbing catchment attributes, indicating a robust fit (i.e., that the models were not overfit or simply remembering different catchments).
The results presented here show that the LSTM is able to extrapolate on catchment attributes to new catchments. Taken together, these results indicate that the catchment attribute data set  contains a significant amount of useful information about the differences between rainfall-runoff behaviors across (eco)hydrological regimes, and that machine learning is effective at extracting and using these patterns.
Related to the first conclusion, this is yet another example where traditional hydrological models do not take full advantage of the information available from the Earth-observation data record. In this case, neither SAC-SMA nor the NWM are able to directly use the catchment attribute data that we use here, but even if those models could leverage this information, they still could not compete with the LSTM, since the LSTM outperforms even when the conceptual model is calibrated in-basin. This means that not only is there useful information in catchment attributes data, but also that there is more information in meteorological forcing data than is used by the traditional models. Several recent experiments have shown the same thing for a number of operational terrestrial hydrology models (e.g., Nearing et al., 2018Nearing et al., , 2016. Hrachowitz et al. (2013) and others suggest that better process-based understanding of catchment behaviors should result in better out-of sample predictions. In reality, it is data-driven models that have consistently given increasingly better predictions. From a more optimistic perspective, ML benchmarking experiments like the one in this paper show that there are probably organizing theories about watersheds yet to be discovered, since machine learning models are able to find informative patterns in multibasin data sets that our current models do not reproduce.
The power of big data and machine learning for problems like this is that such techniques can synthesize information from multiple sites and situations into a single model. As an example, if we were to want to simulate catchment behavior under nonstationary conditions (e.g., evolving climate), then a single LSTM trained to recognize and distinguish different types of hydrological behavior (as shown here) will have a larger range of conditions where it can be expected to remain realistic than a model calibrated to a past conditions in only a single basin.
In our opinion, the most effective strategy moving forward will probably be theory-guided data-science Karpatne et al. (2017). There are now numerous strategies across scientific disciplines that allow for meaningful fusions of domain knowledge with machine learning and other algorithms for learning and predicting directly from data. Adopting approaches like this will be critical moving forward.

Code and Data Availability
CAMELS data, including SAC-SMA simulations, are available from NCAR at this site (https://ral.ucar.edu/ solutions/products/camels). National Water Model reanalysis data are available from the NOAA Big Data Repository (https://registry.opendata.aws/nwm-archive/). All code used for this project is available at this site (https://github.com/kratzert/lstm_for_pub).