Using Machine Learning to Identify Hydrologic Signatures With an Encoder–Decoder Framework
Abstract
Hydrologic signatures are quantitative metrics that describe a streamflow time series. Examples include annual maximum flow, baseflow index and recession shape descriptors. In this paper, we use machine learning (ML) to learn encodings that are optimal ML equivalents of hydrologic signatures, and that are derived directly from the data. We compare the learned signatures to classical signatures, interpret their meaning, and use them to build rainfallrunoff models in otherwise ungauged watersheds. Our model has an encoder–decoder structure. The encoder is a convolutional neural net mapping historical flow and climate data to a lowdimensional vector encoding, analogous to hydrological signatures. The decoder structure includes stores and fluxes similar to a classical hydrologic model. For each timestep, the decoder uses current climate data, watershed attributes and the encoding to predict coefficients that distribute precipitation between stores and store outflow coefficients. The model is trained endtoend on the U.S. CAMELS watershed data set to minimize streamflow error. We show that learned signatures can extract new information from streamflow series, because using learned signatures as input to the processinformed model improves prediction accuracy over benchmark configurations that use classical signatures or no signatures. We interpret learned signatures by correlation with classical signatures, and by using sensitivity analysis to assess their impact on modeled store dynamics. Learned signatures are spatially correlated and relate to streamflow dynamics including seasonality, high and low extremes, baseflow and recessions. We conclude that processinformed ML models and other applications using hydrologic signatures may benefit from replacing expertselected signatures with learned signatures.
Key Points

We built an encoder–decoder network to learn an optimal equivalent of hydrologic signatures

Using learned signatures as input to a processinformed ML model improves prediction accuracy over benchmark configurations

We interpret learned signatures by correlation with classical signatures and by sensitivity analysis of their impact on model store dynamics
1 Introduction
Hydrologic signatures are quantitative metrics that describe statistical or dynamic features of a streamflow time series (McMillan, 2021). Examples include annual flood, baseflow index, and descriptors of recession shapes. Signatures have multiple uses in hydrology and ecohydrology, including assessing instream habitat suitability (Poff et al., 1997; Yarnell et al., 2015), monitoring hydrologic change (Archer & Newson, 2002; Richter et al., 1996), inferring runoff generation processes in the upstream watershed (McMillan, 2020), and classifying watersheds (Sawicz et al., 2011; Wagener et al., 2007). In classical hydrologic modeling, signatures are used in calibration and evaluation to ensure that models properly reproduce flow dynamics (Gupta et al., 2008; Kavetski et al., 2018). Signatures can be used to calibrate hydrologic models in ungauged basins, by regionalizing the signatures (estimating signature values based on watershed attributes) and then calibrating the model against the regionalized signatures (Prieto et al., 2019).
A wide range of potential signatures are available (Olden & Poff, 2003). Authors tend to choose signatures that were used in previous studies, that are uncorrelated with each other, that capture a variety of aspects of the flow regime, or that can be related to specific processes in the watershed (McMillan et al., 2016). However, many signatures are not well predicted based on catchment attributes (Addor et al., 2018), and they can have high uncertainty which reduces their value in model calibration or evaluation (Westerberg & McMillan, 2015). Optimal signature choice is not well understood, and therefore there is an opportunity to use machine learning techniques to identify an optimized equivalent of hydrologic signatures, which contribute the most information to model calibration problems.
Machine learning techniques, including deep learning, have gained prominence in hydrologic research during recent years (Shen, 2018). Researchers have developed ML alternatives to classical rainfallrunoff models that predict streamflow from precipitation and climate time series. Long Short Term Memory networks (LSTMs) are a popular ML model structure that process time series data sequentially and allow for storing and forgetting of information. These models have been popularized by the neuralhydrology Python library (Gauch et al., 2021; Kratzert et al., 2022). Using an LSTM in a joint learning framework across many watersheds improves prediction accuracy (Kratzert, Klotz, Shalev, et al., 2019), and in Kfold validation where the LSTM is tested in ungauged mode it outperformed a locally calibrated traditional model (Kratzert, Klotz, Herrnegger, et al., 2019). These successes across large regions demonstrate how ML techniques can extract and extrapolate general relationships from large samples of data, in a way that traditionally structured models have failed to do (Nearing et al., 2021).
More recently, hydrologists have developed differentiable, parameterlearning (dPL) models that use an LSTM to predict static and dynamic parameters of an HBVlike processbased model (Feng et al., 2022). dPL models are close to matching stateofthe art LSTM performance, and allow users to interpret and simulate unobserved variables as in a classical model. Deep learning models can estimate the parameters of a global processbased model, improving calibration speed and spatial coherence of parameters over classical calibration (Tsai et al., 2021). These advances merge the interpretability and physical principles of classical hydrology with the crosswatershed information transfer and predictive power of machine learning (Frame et al., 2021, 2022; Hoedt et al., 2021; Razavi et al., 2022).
Instead of using a single ML model to predict runoff from rainfall, an alternative approach in multistepahead flood forecasting applications is to use a pair of ML models. The first model uses meteorological data to predict the current catchment state (i.e., catchment wetness), and the second model uses this current state, plus the new meteorological data, to forecast flow. Various structures have been proposed for the paired models, including LSTMs (e.g., Kao et al., 2020) and Convolutional Neural Networks (CNNs; Lin et al., 2020; Van et al., 2020). CNNs use a sliding window that passes across the input series to extract relevant patterns and are widely used to extract features from time series (Wang et al., 2017). Paired ML model structures are known as encoder–decoder systems, with the first model being the encoder and the second model being the decoder. The encoding refers to the intermediate state that is calculated by the encoder and then used by the decoder to make the prediction. In the examples above, the encoding is a representation of the catchment wetness. However, encodings can represent any useful properties of the input signal (Cho et al., 2014).
In this paper, we propose a novel use of an encoder–decoder system in hydrology, to learn encodings that are optimal ML equivalents of hydrologic signatures, and that are derived directly from the data. In our framework, the encoder is a CNN that learns the signatures (encodings) from historical streamflow and meteorological time series. The decoder is a differentiable, processinformed ML model that uses the learned signatures and current meteorological data to predict streamflow. By jointly training the encoder (to learn the signatures) and decoder (to predict streamflow), we ensure that the learned signatures are optimized to increase the model's predictive power. Our aim is to use ML to learn optimized signatures and compare them to classical signatures. We will interpret the physical meaning of the new, learned signatures by correlation with classical signatures and by investigating their effects on the stores and fluxes in the decoder model. We will determine whether the learned signatures can extract new information from streamflow series by testing whether they improve decoder model performance over control scenarios that use classical signatures or no signatures.
2 Materials and Methods
2.1 Data
We used the U.S. CAMELS data set to conduct our analysis (Addor et al., 2017; Newman et al., 2015). This data set is widely used in hydrology machine learning studies, enabling easy performance comparison. CAMELS includes 671 watersheds in the contiguous U.S. that are minimally impacted by human activities, with areas from 4 to 25,000 km^{2}. For each watershed, CAMELS provides continuous daily meteorological and streamflow data for 35 years from 1980 to 2014. Streamflow records are from U.S. Geological Survey gauges, meteorological forcing is from Daymet and includes precipitation, temperature maximum and minimum, snow water equivalent, solar radiation, vapor pressure, and day length (Thornton et al., 2012). CAMELS provides 30 static indices describing the topography, land cover, soil and geological characteristics of each watershed, and 13 precalculated hydrological signatures describing the flow characteristics. Two of the precalculated signatures use precipitation as well as streamflow data (runoff ratio, streamflow elasticity). Of the static indices, we removed categorical indices and nearduplicates, leaving 19. Of the hydrologic signatures, we removed the slope_fdc (slope of the flow duration curve) signature, which is known to contain errors. Of the watersheds, we removed Gauge number 03281100 for which no signatures were available. For comparison with learned signatures, in addition to the CAMELS signatures we also used a catalog of signatures that target baseflow and overland flow processes and are available via the TOSSH Matlab toolbox (Gnann et al., 2021a). All signatures used are listed in Table 1.
Signature name  Signature description 

Water balance signatures  
q_mean  Mean daily discharge (mm/d) 
runoff_ratio  Runoff ratio (ratio of mean daily discharge to mean daily precipitation) () 
hfd_mean  Mean halfflow date (date on which the cumulative discharge since October first reaches half of the annual discharge) (day of year) 
stream_elas  Streamflow precipitation elasticity (sensitivity of streamflow to changes in precipitation at the annual time scale) () 
EventRR  Event runoff ratio, average of runoff ratios calculated for all precipitation events () 
EventRR_TotalRR_Ratio  Ratio between event and total runoff ratio (EventRR/runoff_ratio) () 
RR_seasonality  Runoff ratio calculated for summer months divided by runoff ratio calculated for winter months; winter defined as the 6 months following the start of the water year () 
Low flow signatures  
baseflow_index  Baseflow index (ratio of mean daily baseflow to mean daily discharge, hydrograph separation performed using digital filter) () 
Q5  5% Flow quantile (low flow) (mm/d) 
low_q_freq  Frequency of lowflow days (<0.2 times the mean daily flow) (days/year) 
low_q_dur  Average duration of lowflow events (number of consecutive days <0.2 times the mean daily flow) (d) 
zero_q_freq  Frequency of days with Q = 0 mm/day (%) 
RecessionParametersAlpha  Recession parameters (alpha, beta) approximate storagedischarge relationship. Alpha is a timescale parameter derived by fitting a line to the dQ/dt versus Q point cloud in loglog space for each individual recession and returning the median intersect (m^{1−b} d^{b−2}) 
RecessionParametersBeta  Recession parameters (alpha, beta) approximate storagedischarge relationship. Beta is a shape parameter derived by fitting a line to the dQ/dt versus Q point cloud in loglog space for each individual recession and returning the median slope () 
BaseflowRecessionK  Exponential recession constant fitted to master recession curve (1/d) 
FirstRecessionSlope  Fits a straight line to the first segment of the logtransformed master recession curve and returns its slope (1/d) 
MidRecessionSlope  Fits a straight line to the second segment of the logtransformed master recession curve and returns its slope (1/d) 
Variability_Index  Standard deviation of logtransformed discharge values determined at 10% intervals from 10% to 90% of the cumulative frequency distribution 
High flow signatures  
Q95  95% flow quantile (high flow) (mm/day) 
high_q_freq  Frequency of highflow days (>9 times the median daily flow) (days/year) 
high_q_dur  Average duration of highflow events (number of consecutive days >9 times the median daily flow) (days) 
Overland flow signatures  
IE_effect  Infiltration excess importance; average of the standardized zscore coefficients for mean and maximum event intensity in regression equations to predict event peak magnitude and quickflow volume () 
IE_thresh_signif  Infiltration excess threshold significance; Pvalue for the significance of a change in slope in a plot of event quickflow volume versus event maximum intensity () 
IE_thresh  Infiltration excess threshold depth (intensity of precipitation needed to produce quickflow); value of the threshold identified in the IE_thresh_signif signature (mm) 
SE_effect  Saturation excess importance; average of the standardized zscore coefficients for total event precipitation and 3 and 7day antecedent precipitation totals in regression equations to predict event peak magnitude and quickflow volume () 
SE_thresh_signif  Saturation excess threshold significance; Pvalue for the significance of a change in slope in a plot of event quickflow volume versus event total precipitation () 
SE_thresh  Saturation excess threshold location (depth of precipitation needed to produce quickflow); value of the threshold identified in the SE_thresh_signif signature (mm) 
SE_slope  Abovethreshold slope in a plot of quickflow volume versus total precipitation (mm/mm) 
Storage_thresh_signif  Storage threshold significance; Pvalue for significance of a change in slope in a plot of event quickflow volume versus event precipitation + antecedent precipitation index () 
Storage_thresh  Storage threshold location (storage depth needed to produce quickflow); value of the threshold identified in the Storage_thresh_signif signature (mm) 
2.2 Model Structure
2.2.1 Encoder–Decoder Networks
An encoder–decoder is a pair of neural networks that can automatically extract salient information from highdimensional input data. The first network is the encoder, which maps the input to a lowerdimensional encoding vector. The second network is the decoder, which uses the encoding as an input and performs some downstream task. The encoder and decoder are trained endtoend to perform the downstream task, and in the process, the encoder learns to extract the information from the input that maximizes the decoder’s performance at the downstream task.
Encoder–decoder frameworks are widely used, for example, in machine translation to find simple representations of the meaning of sentences, and in computer vision to describe the appearance of a patch of an image that is invariant to changes in viewpoint (Cho et al., 2014; DeTone et al., 2018). In both examples, these learned descriptors outperform humandesigned descriptors. Encoders are key parts of wellknown networks including the language model GPT3 and the texttoimage synthesis network DallE (OpenAI, 2022). In these networks, the output is estimated from an encoding of the input rather than directly from the input.
2.2.2 An Encoder–Decoder Model to Extract Hydrological Signatures
In this paper, we use an encoder to extract features from a window of streamflow and meteorological data. These features are one input to a decoder that predicts streamflow from meteorological data (over a different time window). We train the encoder–decoder endtoend to predict streamflow, and in the process the encoder learns the features of its input most useful for predicting streamflow.
The encodings are analogous to hydrological signatures. Both encodings and signatures are numerical values that describe a streamflow series. Because the encodings of historical streamflow data feed into the decoder, they perform a similar function to that envisaged by Gupta et al. (2008) for hydrological signatures, that they can be used to improve the structure or parameters of the hydrological model. It is worth noting that our system differs from an autoencoder, where the decoder reverses the encoding process to recreate the input. In that case, the encodings would form a compressed representation of the streamflow such that the streamflow could be reconstructed from the encoding. However, this encoding would not serve the same purpose as a hydrological signature, because it would be specific to the particular year of streamflow (e.g., it would contain information about the timing of flood events or low flows within the year). Instead, our decoder is a hydrological model that uses a rainfall series, together with the encoding, to predict streamflow in a different year. This structure forces the encoding to encapsulate general information about streamflow dynamics that applies to any year of streamflow.
The model structure is shown in Figure 1, and the encoder and decoder structures are described in more detail in the following sections.
2.2.3 Encoder
We require the encoder to map a window of streamflow and meteorological data (a multichannel time series of N years of daily precipitation, temperature and streamflow values) to a feature vector. Onedimensional convolutional layers are a common approach to mapping spatial or temporal data (images or time series) to higherlevel representations of patterns in the data, with some invariance to where the patterns occur (Goodfellow et al., 2016, Chapter 9). They are therefore a good choice to learn hydrologic signatures from meteorological and streamflow time series. Each convolutional layer applies filters (kernels) to the input data to produce an output of reduced size but with multiple channels. By using filters, convolutional neural networks can learn complex behavior while reducing the number of learned parameters compared to alternative ML model structures. Each layer has a kernel size of 9, with 32 or 16 output channels, and rectified linear (ReLU) activation. Between layers, a Maximum Pool (MaxPool) with stride length three reduces the input data (N years of daily precipitation, temperature and streamflow values) to a smaller array. We append an average pool layer to allow the network to support multiple years of input, although in this paper we use 1year windows (see Section 11 Sensitivity Analysis for experiments with different length windows). Splitting the flow record into subsamples and calculating signatures for each sample to give a distribution of values follows recommendations from previous authors (Schaefli, 2016; Vogel & Fennessey, 1994). This approach enables a more accurate representation of typical hydrologic years and reduces the sensitivity of signatures to hydrologic extremes present in the period of record (Vogel & Fennessey, 1994). The final sigmoid activation maps each channel from its unconstrained range to a value in the range [0,1], normalizing the 16 learned signatures for easier analysis.
The dimensions of the encoder are chosen such that each layer has an intermediate feature dimension larger than the final encoding size so it can retain sufficient information (32 channels for 16 encodings), so that the reduction in size between each layer (the stride) is not too large, and so that the network reduces each year of data to a single feature vector (the encoding). Sizes in powers of two allow for fast computation. These are all typical choices for CNN design (Goodfellow et al., 2016). Similarly, use of ReLU activations and MaxPool layers are standard choices with high efficiency in training. Given the extreme flexibility of ML models such as CNNs, structural choices such as which activation function to use are less important than in processbased models, as model parameters can adapt equally easily to any standard choice. We choose a 16dimensional encoding because it is larger than the number of independent signatures that can typically be derived from streamflow data (2–4, Olden & Poff, 2003), while allowing for some redundancy in the common case that multiple channels learn the most dominant features. Further, the encoding size is similar to existing sets of signatures used to summarize streamflow dynamics, such as the 13value CAMELS signature set that we used for comparison (Addor et al., 2017). To evaluate the number of independent signatures that can be learned by our model, we conducted a sensitivity test of the model performance by varying the encoding size from 2 to 32 signatures.
To verify that this network structure is suitable for learning signatures, we conducted a separate test (not part of the ML encoder training described in the rest of the paper) of whether the structure could learn the existing CAMELS signatures. The structure was able to learn the signatures, with typical median relative errors of 10%, showing that our overall structure choices are suitable for our research objective. We did not evaluate other kinds of encoder that could perform the same task, for example, recurrent encoders or transformers (Jaderberg et al., 2015).
2.2.4 Decoder
We require the decoder to use the encodings and a window of meteorological data (a 7channel time series of N years of daily precipitation and climate values) to a predicted Nyear streamflow time series. Two common types of model have emerged in the hydrological community to perform this type of streamflow prediction: LSTM models (e.g., Kratzert, Klotz, Herrnegger, et al., 2019; Kratzert, Klotz, Shalev, et al., 2019) and processinformed models (e.g., Feng et al., 2022). LSTM models can learn the dynamics of the target catchment in their parameters, and might represent information about dynamics in their hidden state. The networks also learn hidden state variables that are equivalent to stores (Lees et al., 2021). However, interpreting state and weights in an LSTM to hydrological processes is not straightforward. In contrast, processinformed models have structures similar to a conventional hydrological model and use recurrent state variables representing stores that are more straightforward to interpret. Given that interpretation of the new learned signatures and their relationship to model dynamics is a key aim of this research, we chose to use a processinformed structure.
Our decoder runs on a daily timestep and uses eight stores to represent water storage. Eight stores offer flexibility in simulating runoff generation pathways, and are sufficient to represent the stores we expect in a classical model without too much redundancy, simplifying the interpretation of the model's results. We conducted a sensitivity analysis to test model performance with alternative numbers of stores. As with a classical hydrologic model, at each timestep the decoder uses its inputs (the encoding and current meteorological data) and the current state of the catchment (the store volumes from the previous timestep) to predict outflows (streamflow) and to update the store volumes.
The model uses a neural network to learn how to distribute water between stores and how much outflow occurs from each store, the equivalent of learning dynamic model parameters. The neural network uses fully connected layers with ReLU activations, an efficient and flexible structure that is the default choice recommended by Goodfellow et al. (2016; Chapter 11). Two layers are used to combine the 16value encoding with the 19 static CAMELS attributes to generate a 16value attribute vector. This step prevents overfitting of the model and ensures that the complexity of the model remains the same with or without signatures, allowing a fair test of the model performance improvement gained by using signatures. For tests where the 16value encoding is not used (see Section 9), the 19 CAMELS attributes are passed through an equivalent twolayer network to generate a 16value attribute vector. Four further layers are used to combine the attribute vector with the daily meteorological data and store values from the previous timestep, to predict the dynamic model parameters. An alternative to this structure would be to use an LSTM to learn the dynamic model parameters (e.g., Feng et al., 2022). However, this would make it less easy to interpret the effect of learned signatures on model behavior, because the LSTM maintains memory in its hidden states rather than in the stores.
The dynamic parameters predicted by the decoder at each timestep are the actual evapotranspiration (scalar AET value), the distribution of precipitation between stores (eightvalue vector a), and outflow coefficient for each store (8value vector b). All these values (AET, a, b) are predicted once per time step as outputs from the four fully connected layers described above. We use softmax activation to enforce that precipitation fractions sum to one, and softplus activations to enforce that AET and outflows are positive. The streamflow prediction for each timestep is the dot product of the b coefficients with the store volumes. An update of the store volumes is performed as follows: precipitation fraction minus AET (a coefficient*precipitation  AET) is added to the stores, and outflow (b coefficient*store volume) is subtracted from the stores. As explained in Section 8, the decoder model parameters that predict the a, b, and AET coefficients are trained such that they optimize model ability to predict observed streamflow.
2.3 Training Strategy
Equation 1: Loss function L constructed of Huber Loss applied to square root of 1NSE. Minimizing 1NSE is equivalent to maximizing NSE.
The loss function ignores the first ⅛ of the time series as a warm up period, in our case equal to 6 weeks with 1 year of input. During training, we add a regularization term to the loss function, equal to the difference in summed store volumes between the beginning and end of the model run. This term helps to maintain the water balance by penalizing the model if water is accumulated in stores. Requiring mass conservation has been shown in LSTM trials to produce a small reduction in performance, but the difference is minor (<0.01 NSE loss) (Frame et al., 2022).
The training phase used 470 watersheds with 100 reserved for validation and 100 for testing. Watersheds in the test sample were never used for training, which is a stricter criterion than several previous studies that allowed different years from the same watersheds to occur in both training and testing samples for example, (Feng et al., 2022; Frame et al., 2021). Evaluating hydrological models under different climate conditions, or (harder) in different watersheds to those in the training set is considered best practice to produce a robust model (Klemeš, 1986; Razavi, 2021). We evaluate the model in a new set of watersheds to ensure that our results reflect how well the decoder can represent rainfallrunoff dynamics based on the encoding, and not how well it has learned the behavior of individual watersheds. Each sample used for training or testing consists of two parts. One randomly chosen year of historical meteorology and streamflow data is input into the encoder to calculate the 16value encoding (signatures). A second year of meteorology and streamflow data is used to test the decoder, by inputting the meteorology data and testing the model output against the streamflow data. The decoder has no access to historical streamflow data except for the 16 signature values. In future, the decoder could be used to predict streamflow in completely ungauged watersheds by estimating signature values in those watersheds. Signature values could be estimated using regionalization methods, for example, by transferring signatures from watersheds with similar climate, topography and geology (Westerberg et al., 2016). We conducted additional experiments to test the sensitivity of the model to the length of the training data (from 1 year samples in the default case up to 5 years).
We train in batches of 128 samples, using the Adam optimizer (Kingma & Ba, 2014) and a learning rate of 0.0003, for 200 or more epochs that is, we pass through all training data. This takes 2 days on a PC, running on CPU. As with any neural network, our model has a large number of fitted parameters (10,000s to millions are typical; our encoder has 25,900 parameters and the decoder has 41,945 parameters). Therefore, care is needed to prevent overfitting of the training data to the detriment of testing performance. The primary method we use to prevent overfitting is early stopping, whereby after each epoch of training data the model performance is evaluated against the validation data set (Razavi, 2021). When the validation error stops decreasing or starts to increase, overfitting has started and therefore the training is halted. Figure 2 shows an example of training and validation performance for our model training using learned signatures. After approximately 2000 epochs, validation performance no longer increases and therefore training is halted. Further strategies that we used to prevent overfitting were batching (Goodfellow et al., 2016) and regularization by maintaining a water balance (see details above). Limited experiments with other regularizations (weight decay and dropout; not shown) showed no effect on the difference between training and validation error. After training finishes, the model was evaluated on the (separate) test set of watersheds.
2.4 Analysis

No signatures. The decoder input comprises meteorological data and CAMELS physical attributes, but no hydrologic signatures. No encoder is used.

CAMELS signatures. The decoder input comprises meteorological data, CAMELS physical attributes, and standard CAMELS hydrologic signatures. No encoder is used.

Learned signatures. The decoder input comprises meteorological data, CAMELS physical attributes, and hydrologic signatures learned by the encoder.
All tests are evaluated using the mean and quantiles of the NSE score distribution over the testing set of watersheds. We expect that configuration [2] CAMELS signatures should outperform configuration [1] No signatures, showing that providing the model with knowledge of desired streamflow dynamics will improve the simulation performance. By determining whether configuration [3] learned signatures outperforms configuration [2] CAMELS signatures, we will test whether learned signatures outperform classical humandesigned signatures. This will test whether there is extra information in streamflow dynamics that is not captured by the humandesigned signatures, but that is important for model simulation. We further benchmarked our model performance by comparing the mean and median NSE scores against recent processinformed and LSTMbased deep learning models, and traditional hydrological models that have been tested on CAMELS watersheds. The main aim of our paper is to analyze the usefulness and interpretation of the learned signatures, rather than purely striving for the best NSE score. Therefore, the objective of the benchmarking was to show that our models provided reasonable scores in the same range as previous, wellrespected models.
In our second research question, we aim to interpret the physical meaning of the learned signatures. This will expose the features of the streamflow dynamics that are most important for model simulation. To answer this question, we use two main strategies. Our first strategy is to plot a correlation matrix for the learned signatures against classical hydrologic signatures. We will use the CAMELS signatures and sets of signatures for baseflow and overland flow processes (Table 1), where learned signatures correlate with one or more classical signatures, and we infer their meaning as similar to those combinations of signatures. Where learned signatures do not correlate with any classical signatures, they have learned new dynamics of the flow series. We use maps of the learned signatures to evaluate their spatial patterns and spatial correlations and compare these with patterns of classical signatures, as spatial correlation is a key metric of signature behavior (Addor et al., 2018).
Our second strategy to interpret the meaning of learned signatures is to analyze the ML model behavior, and how the learned signatures affect its dynamics. We analyze the seasonal dynamics of the eight model stores and interpret their similarity to classical model stores. To do this, we examine two parameters: the a parameter which is the proportion of precipitation routed into the store, and the b parameter which is the linear outflow coefficient. For each day of the year, we calculate the average parameter values across all years and all watersheds, to visualize and interpret typical behavior. We then use a Method of Morris sensitivity analysis (Morris, 1991) to quantify the effect of each learned signature on the store dynamics. Method of Morris is an efficient global sensitivity analysis method that selects multiple starting locations in the parameter space, followed by “oneatatime” parameter adjustments to measure the effect on the output variable of interest. In our case, the parameters are the learned signature values, and the output variables are the a and b parameters for each store.
3 Results
The aim of this section is to assess the utility of the learned signatures and how they compare to classical signatures. We first tested the decoder model performance in predicting streamflow, and compared it against a decoder that uses no signatures or classical signatures. This tested whether the learned signatures extracted new and useful information from the streamflow signal. We then interpreted the physical meaning of the learned signatures, through their correlation with classical signatures, their spatial patterns, and their effect on the decoder stores and fluxes.
3.1 Model Performance Comparison
We trained the model for the three configurations described in Section 9 ([1] No signatures, [2] CAMELS signatures, [3] Learned signatures). Boxplots of the test set performance in each case are shown in Figure 3a, and performance statistics in Table 2.
Configuration  NSE (median)  NSE (mean) 

[1] No signatures  0.574  0.376 
[2] CAMELS signatures  0.685  0.513 
[3] Learned signatures  0.693  0.547 
The results confirm that as expected, using signatures as input improves model performance. The results further show that learned signatures improve performance over classical CAMELS signatures, implying that the learned signatures capture additional features of the streamflow dynamics that are not included in the classical signature set.
In Table 3, we compared the median and mean NSE for our model using learned signatures, against other machine learning and classical hydrological models that have been tested for streamflow prediction on the CAMELUS data set. Overall, LSTM deep learning models typically produce the highest scores, followed by processinformed deep learning models. Classical hydrological models typically have slightly lower scores, even when they are locally calibrated on each watershed. Model performance in all categories varies according to several methodological choices such as how data is split between training/testing sets and whether ensembles of models are used. This comparison shows that our model performance is comparable with other recent models and therefore our model provides a trustworthy framework in which to evaluate and interpret learned signatures.
Reference  CAMELS (median)  CAMELS (mean)  Network size—#hidden states  Model description and notes 

Processinformed models  
This paper  0.69  0.55  dPL model using fully connected network to learn dynamic parameters of an eightstore linear model. Test basins not included in training set, but signatures for test set are provided.  
Feng et al. (2022)  0.732  256 (in LSTM)  dPL model using LSTM to learn dynamic parameters of 16 replicates of a HBVlike model, each with five state variables. All watersheds used in training.  
Tsai et al. (2021)  0.44  64–1,280 (in LSTM)  dPL model using LSTM to learn parameters of VIC or a ML surrogate. All watersheds used in training.  
LSTM models  
Kratzert, Klotz, Shalev, et al. (2019)  0.71*  0.67*  256  EntityawareLSTM single model (could be improved using ensemble). All watersheds used in training. 
Frame et al. (2021)  0.75*  0.67*  256  LSTM as an NWM postprocessor including atmospheric forcing. All watersheds used in training. 
Kratzert, Klotz, Herrnegger, et al. (2019)  0.69*  0.54*  256  LSTM, mean of ensemble of 10 models. Test basins not included in training set. 
Frame et al. (2022)  0.76 *  64  Massconserving LSTM. All watersheds used in training.  
Benchmark models (as described by Kratzert, Klotz, Shalev, et al., 2019)  
Newman et al. (2017)  0.603  0.564  SACSMA model (locally calibrated)  
Seibert et al. (2018)  0.676  0.631  HBV (locally calibrated, ensemble mean of 100 models)  
Newman et al. (2017)  0.551  0.518  VIC (locally calibrated)  
Mizukami et al. (2017)  0.307  0.167  VIC (regionally calibrated) 
 Note. All the ML models here are regionalized—not locally trained for individual basins. All performance values are given for test data, but the studies differ in whether test watersheds are included in the training set, this information is provided in the Model description and notes column.
 *These studies use 531 of the CAMELS watersheds, excluding basins >2,000 km^{2} and those with discrepancies between different methods of calculating basin area.
Figure 3b shows the spatial pattern of NSE scores for all watersheds for the [3] Learned Signatures configuration. We included all watersheds in the map because the testing set contains only 100 watersheds, which is insufficient to show the regional pattern. The pattern is typical of hydrological model performance in the U.S., that is, performance is higher in humid watersheds and lower in arid watersheds (Newman et al., 2015). Poor performance can be caused by multiple factors, including rainfall errors, nonclosed water balance and other unusual flow dynamics not well represented by the model structure. The successive introduction of CAMELS signatures and learned signatures notably improves the most poorly performing watersheds and those with belowaverage performance.
3.1.1 Sensitivity Analysis
We conducted additional experiments to test the sensitivity of the validation NSE performance to model structure choices. We varied the length of the encoding input meteorology and streamflow data used to calculate the signature values, from 1 year (default) to 5 years. We varied the number of signatures learned by the encoder from 2 to 32 (default was 16). We varied the number of stores used in the processinformed decoder model from 2 to 32 (default was 8). The results are shown in Figure 4.
Figure 4 shows that none of these structural choices make large differences to the validation NSE performance. For models with at least six learned signatures and six model stores, validation NSE is in the range 0.685–0.710. Validation NSE increases with longer encoding input length and number of decoder stores. For number of learned signatures, validation NSE does not substantially increase above six signatures. All such choices offer some tradeoffs, for example, a larger number of decoder model stores increases performance but makes the function of each store less easy to interpret, while a longer encoding length increases performance but reduces the number of samples on which to train the model and therefore may decrease robustness to future changes in climate. Future work could investigate whether performance continued to increase above six signatures for more complex hydrological models (i.e., greater number of decoder model stores) or for larger numbers of catchments. Given the small range in validation performance in all cases, we retained the default structure choices for the remainder of the analysis below.
3.2 Interpretation of Learned Signatures: Correlations With Classical Signatures
In this section, we interpret the meaning of the learned signatures by examining their correlation with classical signatures (Figure 5) and their spatial patterns (Figure 6). Classical signatures that have high correlations with learned signatures represent basic properties of the watershed relating to its water balance and ability to store water (Figure 3, section “Water balance”). These include runoff_ratio (partitioning of water to flow and evapotranspiration), hfd_mean (day of year when half of flow volume has passed, related to snow storage), and EventRR_TotalRR_ratio (related to division of flow into fast and slow pathways). Therefore, learned signatures similarly represent these properties.
Low flow metrics have moderate correlations with learned signatures. Among the more highly correlated are BFI and variability index (both related to baseflow magnitude), q5 (fifth percentile low flow magnitude), low_q_freq (how often low flow periods occur), and RecessionParametersBeta (shape of hydrograph recessions). High flow metrics also have moderate correlation with learned signatures, with q95 (high flow magnitude) having the strongest of those. These examples show that the machine learning model is learning both low and high flow characteristics of the data, despite the emphasis of the NSE loss function on high flow periods where errors are typically larger. Further, correlations of learned signatures to classical signatures that describe flow dynamics (e.g., half flow date, baseflow index, recession shape) show that the model learns characteristics of the streamflow as well as the meteorology data.
Overland flow metrics with moderate or high correlations to learned signatures include SE_effect (importance of saturation excess flow), SE_thresh_signif and Storage_thresh_signif. The threshold significance signatures measure whether the watershed has a strong threshold for the depth of rainfall needed over shorter/longer timescales to generate flow. This result may partly reflect that neural networks are good at learning threshold relationships.
Some classical signatures do not have a learned signature equivalent. This can occur for multiple reasons. For example, metrics related to infiltration excess processes have low correlations with learned signatures, which could be because these processes occur at subdaily timescale while this analysis used data at a daily timestep. Other signatures may classify an aspect of the streamflow time series that is not important to model performance as measured by NSE, and therefore these classical signatures are not useful for improving model NSE.
The maps in Figure 6 show the spatial patterns of each learned signature. Note that high and low values can be reversed without change in meaning, and that the numbering 1–16 of the learned signatures is arbitrary as the signatures could be learned in any order. Some encodings reproduce key features of the hydroclimate, for example, maps 5, 6 and 9 are all similar to the pattern of aridity across the U.S. with the most arid areas in the Great Plains and the Southwest. These encodings correlate with the low_q_freq signature that measures the frequency of low flow periods. Even where there are strong correlations between an encoding and a physical signature, there may be notable differences in patterns. For example, learned signature 2 has a strong correlation of 0.67 with mean half flow date, but the spatial pattern doesn't differentiate between Florida and the wider Southern and Appalachian regions as strongly as the physical signature (see Figure 4c in Addor et al., 2017).
Some of the learned signatures such as 8, 15, and 16 do not correlate with any of the classical signatures in Figure 3 or with physical attributes (Addor et al., 2017). The lack of correlation shows that the ML framework has found new information in the streamflow data that was not previously used in flow descriptions or predictions. We can distinguish this new information from noise because it shows clear spatial patterns (Figure 6). All the learned signatures show high spatial correlations (smooth areas of high or low values), in contrast with classical signatures which sometimes show speckled maps with no clear spatial pattern (Addor et al., 2017). High spatial correlation is a desirable feature of signatures, as it implies lower uncertainty and stronger relationships to physical attributes (Addor et al., 2018).
3.3 Interpretation of Learned Signatures: Behavior of ProcessInformed Model
We further interpret the meaning of the learned signatures by evaluating their impact on the ML model dynamics. We first plotted the seasonal dynamics of the eight stores in the ML model (Figure 7), to understand their role. The figure shows mean values of store parameters a the proportion of precipitation routed into the store, and b the linear outflow coefficient. Values are averaged across all years and all watersheds. Store 1 (dark blue) has the highest outflow coefficient (Figure 7b), so we interpret it to represent a quickflow store. Water is routed into this store during the spring and summer, that is, snowfree periods (Figure 7a). Store 5 (lime green) has similar seasonality to Store 1, but has moderate outflow coefficients, interpreted as an interflow store. Water is routed into Stores 4 and 7 (cyan, red) during the winter, interpreted as snow or winter precipitation stores. Store 4 has higher outflow coefficients that peak in the spring, interpreted as a fastmelt store, while Store 7 has low outflow coefficients and is interpreted as delayed contributions of snowmelt to runoff. The interpretation as snow stores is further confirmed by mapping the average proportion of precipitation entering each store, showing highest values in Northern watersheds (not shown). A high proportion of precipitation is routed into Store 8 (brown; high values in Figure 7a) but its outflow coefficients are low and stable, interpreted as a baseflow store. The remaining stores 2, 3 and 6 absorb little water on average (low values in Figure 7a) so play a less important role in most catchments. This interesting result shows that five stores are usually sufficient to represent the flow dynamics. However, a greater proportion of water is routed into stores 2, 3, and 6 in Great Plains watersheds where model performance is lower (not shown), and therefore the additional stores may represent unusual flow dynamics that help improve performance in difficulttomodel watersheds.
Based on the above interpretation of the role of each store, we analyze the impact of each learned signature on the stores. The results of the Method of Morris sensitivity analysis are plotted in Figure 8, which shows the mean variation in store inflow (a parameter) and outflow coefficient (b parameter) caused by a unit change in each learned signature value. Overall, the largest variations in store inputs are for quickflow (Figure 8a). For inflows and outflows, the learned signatures tend to impact groups of similar stores. This occurs particularly for the two slow stores (slow snow release and baseflow), which are usually simultaneously increased or decreased. This shows that the learned signature values make meaningful changes to watershed behavior, rather than trading off one slow store for another. For learned signatures 8 and 15 that have no clear correlation to classical signatures, we investigated their behavior by perturbing the signature and examining the changes in stores volumes and hydrographs in specific watersheds. This analysis showed that the signatures changed the relative volumes of water lost to AET and entering the slowsnow store, which altered the hydrograph volume at midrange flows.
We synthesized the results of the correlation analysis and the sensitivity analysis into Table 4, and added our interpretation of the meaning of each learned signature. The table shows that four learned signatures relate to hydrograph recessions, three relate to seasonality and three to extremes (low or high), two relate to wetness and baseflow and two to ET. The variety of signature interpretations shows that despite the endtoend training judging model performance using Nash Sutcliffe efficiency, the signatures did not solely focus on high flows.
#  Notable correlations with classical signatures, other learned signatures and physical attributes  Sensitivities: Impact of learned signature on store inflows and outflows  Interpretation of learned signature 

1  Lowmoderate with water balance signatures, silt fraction, vegetation  Increases quickflow input, slows all output, increases ET  Use of water by vegetation 
2  High correlation with half flow date, and also EventRR_TotalRR_ratio  Increases slow snow and baseflow input, speeds up quickflow and interflow output  Seasonality of flow 
3  Moderate with waterbalance, high flows and overland flow  Little change in input, speeds up output especially fast stores  Flashiness 
4  Strongly correlated with #2  Some increase in baseflow input, little change in output  Seasonality of flow 
5  High correlation with recession shape beta  Increases slow snow and baseflow input, speeds up all output except fast snow  Shape/curvature of recessions 
6  High correlations with runoff ratio, BFI and variability index  Increases quickflow input, slight speed up in fast stores output; increases ET  Wetness and baseflow volume 
7  Moderately correlated with #6, but emphasizes high flow (q95) over low (q5)  Increases slow snow and baseflow input, speeds up fast stores output  Wetness and baseflow volume 
8  Low correlations with all classic signatures  Increases quickflow input, little change in output  Winter ET and midrange flow volumes 
9  Moderate correlations with low flow frequency, recession shapes beta, K  Increases quickflow input, increase in interflow and fastsnow output  Speed and shape of recessions 
10  Moderately correlated with #7  Some increase in fastsnow input, speeds up all output especially slow stores  Frequency of extremes, especially of lowflows and zeroflows 
11  High correlations with saturation excess and storage threshold significance, elevation  Increases slow stores input, speeds up all output especially fast stores  Threshold change in runoff with precipitation depth 
12  Strongly correlated with #7, #10  Some increase in fast stores input, speeds up snow and baseflow output  Speed of recessions 
13  Strongly correlated with #2, #4  Increases fast stores, especially fast snow input, speeds up all output especially slow stores  Seasonality of flow 
14  High correlation with runoff ratio  Increases fast stores input, speeds up all output especially slow stores; decreases ET  ET and runoff ratio 
15  Low correlations with all classic signatures  Increases quickflow input, little change in output  ET and midhigh flow volumes 
16  Moderate correlations with EventRR_TotalRR_Ratio, First recession slope  Some increase in fast snow input, speeds up all output; decreases ET  Speed of early recessions and midrange flow volumes 
4 Discussion
4.1 Implications for Signature Use
The choice and design of signatures for applications in hydrology and ecohydrology is widely discussed. Questions include how to choose signatures among multiple options (Olden & Poff, 2003), what criteria to use for signature choice (McMillan et al., 2016) and the impact of uncertainty in signature values (Westerberg & McMillan, 2015; Westerberg et al., 2016). The results from this study show that signatures can be learned using an encoder–decoder framework, and that the learned signatures represent a wide variety of aspects of streamflow dynamics including seasonality, recession speed and shape, wetness and baseflow importance, ET magnitude and high and low extremes. Using these signatures as input to a model in a watershed without other gauging data improves the model performance over equivalent models that use classical signatures or no signatures.
In summary, we demonstrated how to replace expertdesigned signatures with MLbased signatures, and we showed that it improves performance in a flow prediction application. The implication of our results is that for hydrological applications that use signatures, the user should consider ML signatures as an alternative. In applications where signatures are selected based on their ability to explain data or improve performance, ML signatures are likely to outperform classical signatures. This includes ecohydrological applications where signatures are selected based on their ability to explain species abundance or biological disruption in large samples of sites (Carlisle et al., 2017), or applications that use signatures to predict processes in the upstream watershed (Wlostowski et al., 2020). In modeling, it includes applications where signatures are selected as calibration targets based on their relationship to individual model parameters. The ML framework used to optimize the learned signatures should be adapted based on the application.
In contrast, where signatures are handselected by the user based on the theoretical underpinnings of the signature, or to manually emphasize one aspect of the flow dynamics over another, ML signatures cannot yet replace classical signatures. This includes ecohydrological applications that require a mechanistic or processbased explanation between the signature and the ecological response (Poff et al., 2010). It further includes catchment classification and model calibration applications where the user chooses signatures to balance the emphasis on high and low flows, or on magnitudefrequencyduration aspects of the flow dynamics (Kennard et al., 2010; Yadav et al., 2007). These applications rely on an expert understanding of the signature interpretation and require signatures to be chosen rather than optimized.
4.2 Machine Learning Model Structures in Hydrology
We used a processinformed ML model that tracks water volumes in stores, similar to other recent differentiable, parameterlearning approaches in hydrology (Feng et al., 2022). We showed that the stores learn recognizable hydrological dynamics such as losses to ET, snow storage and melt, and quickflow and baseflow stores. Our framework uses modern ML to learn these complex processes from the data, while enforcing the fundamental knowledge that catchments store and release water. The combination of machine learning for difficulttomodel aspects and classical methods for wellunderstood processes has been successful in computer vision. In that area, hybrid systems are stateoftheart for 3D reconstruction tasks, combining ML for appearancebased data association with classical geometric methods (Choy et al., 2020; Ranftl & Koltun, 2018). In hydrology, the potential to combine “datadriven discovery” with ”knowledgedriven regularization” offers an emerging pathway to bring the ML and geoscience communities together (Razavi et al., 2022).
As the hydrology community makes more extensive use of ML models, we will develop a knowledge base on ML model structure choices. As for classical hydrologic models, an ML hydrologic model should be chosen such that it can reproduce the hydrologic dynamics of interest, and can be calibrated (trained) using the data available. There are many choices in ML model design, including the model structure (e.g., convolutional neural net, fully connected network or LSTM), number and sizes of model layers, activation functions, regularization methods, network initialization, training strategies including data length and batch size and the use of ensembles. These choices are known as the hyperparameters of the network. As for classical hydrological models, much effort can be put into tuning these hyperparameters where the aim of this study is to optimize performance. However, in this study, our primary aim was to discover which hydrologic signatures an ML model would learn, and interpret them in terms of their relationships to classical signatures and to model dynamics. Therefore, there were many model structure choices that we did not fully investigate (including input data length and use of ensembles) in order to focus our attention on the learned signatures. Future work could test these choices in more depth, to meet joint aims of learning signatures and maximizing model predictive performance.
4.3 Challenges and Future Work
In our model, we chose to use eight stores, which is sufficient to model fast, slow and snow storage, and 16 learned signatures, which is sufficient to capture important catchment dynamics. Previous works represent the state of the catchment with 64 or 256dimensional LSTM hidden state (Feng et al., 2020; Lees et al., 2021). We achieve comparable model performance with a simpler model state, which is also easily interpretable. Alternatively, a model with more signatures and stores might have the capacity to learn more complex processes.
One concern is that the type of stores and signatures learned by the model might not be repeatable, which introduces uncertainty into the interpretation of the ML process. We conducted exploratory tests to train the model with different random seeds, which showed that the five most important model stores have similar dynamics, and learned signatures have similar behavior. However, further work is still needed to understand why certain signatures are learned, and how they should be interpreted.
An important application of hydrological signatures is that they can be regionalized (predicted for ungauged watersheds). The high spatial correlations of the signatures suggest that they are suitable for prediction from watershed attributes, which would enable this framework to be applied in ungauged catchments. Further analysis of the error in predicted signatures and its impact on forecast quality would be required.
5 Conclusions
We trained an ML network with an encoder–decoder structure to simultaneously learn the ML equivalent of hydrologic signatures, and predict streamflow dynamics. The decoder used a processinformed ML model structure that learned inflow distributions and outflow coefficients for eight model stores that simulated water storage and release. Using endtoend training, we optimized the learned signatures to improve performance when used as input to the processinformed ML model. We interpreted the meaning of the learned stores and signatures by comparing them with classical stores and signatures. The hybrid model structure combines classical components such as model stores with ML components such as learned inflow and outflow parameters. This promising approach benefits from an ML model's ability to simulate complex relationships and learn across watersheds, while also allowing the user to interpret model behavior.
Our results showed that the learned signatures have many positive characteristics that are desired in a set of hydrological signatures. Model configurations that used learned signatures as input in otherwiseungauged watersheds showed improved performance over those that used classical signatures or no signatures. Interpretation of the learned signatures showed that they cover a variety of streamflow dynamics, including seasonality, recession speed and shape, wetness and baseflow importance, ET magnitude and high and low extremes. Most signature applications aim for such variety in signatures (McMillan et al., 2016). The maps of learned signatures showed that they are spatially correlated. Signatures with high spatial correlations are less sensitive to flow uncertainty, are more highly related to physical attributes and are better predicted by hydrologic models (Addor et al., 2018). Therefore, we found that ML learning for hydrologic signatures provided a successful approach. In applications where signatures are selected based on their ability to explain data or improve performance, we recommend that hydrologists consider replacing expertselected signatures with learned signatures.
Acknowledgments
Support to H. McMillan was provided by the NSF Hydrologic Sciences Program, NSF Division of Earth Sciences, Grant/Award 2124923. This work used the COMET supercomputer, which was made available by the Atmospheric River Program Phase 2 and 3 supported by the California Department of Water Resources (awards 4600013361 and 4600014294, respectively) and the Forecast Informed Reservoir Operations Program supported by the U.S. Army Corps of Engineers Engineer Research and Development Center (award USACE W912HZ1520019).
Open Research
Data Availability Statement
The CAMELS U.S. data set is available at https://dx.doi.org/10.5065/D6MW2F4D (Addor et al., 2017; Newman et al., 2015). The Toolbox for Streamflow Signatures in Hydrology toolbox (Gnann et al., 2021a) used to calculate hydrologic signatures is available at https://github.com/TOSSHtoolbox/TOSSH and archived at https://zenodo.org/record/6462813 (Gnann et al., 2021b). The code to reproduce our analysis is available at https://github.com/mcmillanhk/HydroML and archived at https://doi.org/10.5281/zenodo.6712302 (Botterill & McMillan, 2022).