A Brief Analysis of Conceptual Model Structure Uncertainty Using 36 Models and 559 Catchments

The choice of hydrological model structure, that is, a model's selection of states and fluxes and the equations used to describe them, strongly controls model performance and realism. This work investigates differences in performance of 36 lumped conceptual model structures calibrated to and evaluated on daily streamflow data in 559 catchments across the United States. Model performance is compared against a benchmark that accounts for the seasonality of flows in each catchment. We find that our model ensemble struggles to beat the benchmark in snow‐dominated catchments. In most other catchments model structure equifinality (i.e., cases where different models achieve similar high efficiency scores) can be very high. We find no relation between the number of model parameters and performance during either calibration or evaluation periods nor evidence of increased risk of overfitting for models with more parameters. Instead, the choice of model parametrization (i.e., which equations are used and how parameters are used within them) dictates the model's strengths and weaknesses. Results suggest that certain model structures are inherently better suited for certain objective functions and thus for certain study purposes. We find no clear relationships between the catchments where any model performs well and descriptors of those catchments' geology, topography, soil, and vegetation characteristics. Instead, model suitability seems to relate strongest to the streamflow regime each catchment generates, and we have formulated several tentative hypotheses that relate commonalities in model structure to similarities in model performance. Modeling results are made publicly available for further investigation.


Introduction
There is an ongoing debate in hydrology whether a "one model fits all" approach should be pursued, based on the assumption that the fundamental hydrological processes are the same everywhere (e.g., Fenicia et al., 2011;Linsley, 1982;Perrin et al., 2003;Savenije, 2009). This assumption has led to development of rainfall runoff models that are designed to be applied across a wide range of catchments (see, e.g., discussion of the GR4J model in Fenicia et al., 2011, and consider more recent applications of this model in 142 catchments in the United States Oudin et al., 2018). This assumption is contrasted by the concept of "uniqueness of place" (Beven, 2000), the idea that in a practical sense every catchment is unique because there are limits to our understanding of fundamental processes and the availability of sufficiently detailed measurements. As a result of this uniqueness, many hydrological models have been developed that all aim to represent the dominant processes in a given catchment (e.g., Singh & Woolhiser, 2002). While theoretically we should be able to use a single model based on fundamental hydrologic principles, in practice there are many different models available that all represent a certain view of which hydrologic processes are important and how these should be mathematically represented. Choosing an appropriate model out of all possible options is critical to obtain accurate simulations that are the result of plausible representations of the hydrology in a given catchment (Kirchner, 2006). Knowing how much uncertainty is associated with the choice of model structure is also important for quantifying the reliability of model predictions (e.g., Biondi et al., 2012).
Conceptual hydrologic models are the focus of this study. Many different conceptual models exist, and there is much variety in how these models work. Models such as GR4J (four parameters Perrin et al., 2003) use a process-aggregated approach where model fluxes represent the aggregated results of all possible processes. Others such as MODHYDROLOG (15 parameters Chiew, 1990;Chiew & McMahon, 1994) follow a more process-explicit approach where fluxes and states explicitly relate to specific processes such as interception, Given the current incomplete knowledge on between-model differences, an opportunity now exists to study similarities and differences in the behavior of multiple different models across a wide variety of places.
A challenge of such a study is that it can be difficult to keep analysis and visualization manageable due to the large number of results involved. Investigating every interesting individual case is infeasible, and instead lessons must be learned from emergent patterns across the full sample (Hrachowitz & Clark, 2017). Large-sample emergent patterns can provide unique insights into how well models function across a variety of different catchment types and inform understanding of hydrologic similarity between different places. Potential benefits of such studies include more thorough understanding of the places where a given hydrologic model can be used with some measure of confidence and improved ability to model ungauged catchments through regionalization approaches. The aim of this paper is thus to explore the performance and associated model structure uncertainty of 36 conceptual hydrologic models across 559 catchments, covering a wide range of climatic and catchment conditions. Our research objectives are further specified in section 2.

Rationale, Research Questions, and Methodology
In this study, we calibrate 36 different model structures for streamflow simulations in 559 catchments using three different objective functions and evaluate model performance during a separate time period (details in section 3). This gives a total sample of 60,372 model application test cases. This section defines four research questions and describes how the modeling results are analyzed to answer these questions.

Defining a Lower Level of Expected Model Performance
Our approach to model calibration and evaluation expresses model performance as Kling-Gupta efficiency scores (KGE Gupta et al., 2009). A score of 1 indicates perfect agreement between simulations and observations. Scores lower than 1 are difficult to interpret beyond "higher is better," and KGE does not include a built-in benchmark that can be used to distinguish "good" and "bad" scores (Gupta et al., 2009;. Therefore, we first specify a lower benchmark that provides the necessary context to interpret model KGE scores (Garrick et al., 1978;Pappenberger et al., 2015;Schaefli & Gupta, 2007;Seibert, 2001;Seibert et al., 2018). The lower benchmark is the minimum score we expect each model to obtain before we consider the model a plausible choice for the catchment under consideration. Research Question 1 is thus What level of model performance do we expect any model to obtain before we consider the model a plausible option for a given catchment?
A traditional benchmark (i.e., the "score to beat") in hydrology is the mean annual flow due to its inclusion in the Nash-Sutcliffe efficiency score (NSE Nash & Sutcliffe, 1970). This choice is both quite simplistic (e.g., Garrick et al., 1978) and not equally difficult to beat in different catchments (Schaefli & Gupta, 2007), depending on how seasonally variable the flow in any given catchment is. The interannual mean for every calendar day has been proposed as a benchmark that can account for seasonality in the flow regime (Garrick et al., 1978;Schaefli & Gupta, 2007), provided that this seasonality is stable between years. This is not the case in catchments where the flow observations on any given calendar day are heavily skewed, as might be the case in catchments with very irregular occurrences of high flow peaks. In such cases the interannual mean might be far away from many of the sample values and the interannual median will be closer and more representative of the typical flow regime.
We therefore calculate both the interannual mean and median flow per calendar day for each individual catchment, using data from the calibration period only. We then evaluate the performance of both of these data-based models during the evaluation period and choose the highest KGE score as our benchmark for that particular catchment. This benchmark KGE score gives a sense of how predictable the flow in each catchment is using only streamflow observations at the same temporal resolution as the models use. It represents the minimum accuracy score we expect from any of the conceptual models before considering them as plausible model structures for a given catchment.

Model Structure Equifinality of Plausible Model Structures
The set of plausible model structures for each catchment contains only those models that beat the daily flow benchmark in that location. These models are different, but all are potentially realistic descriptions of the relevant hydrologic processes in the catchment. However, not all plausible models will beat the benchmark by an equally large margin. We assume that models that outperform the benchmark by a larger margin are better choices to use in a given catchment than models that beat the benchmark by a smaller margin and use this concept to quantify the extent of model structure equifinality. If the best model is joined by multiple other models that exceed the benchmark by a similar amount, model equifinality is considered to be high. Research Question 2 is thus How many of the 36 model structures in our sample can be considered plausible in a given catchment, and how high is model equifinality within this subset?
Differences in objective function values are commonly used to quantify model structure uncertainty and equifinality (e.g., Fenicia et al., 2008;Hogue et al., 2006;Winter & Nychka, 2010). We therefore report (1) the KGE score of the best model in each catchment, (2) the difference between the best model's KGE score and the benchmark score in each catchment, (3) the total number of plausible models in each catchment (i.e., the number of models with KGE scores above the benchmark score), and (4) the number of plausible models that fall within 0.01, 0.05, 0.10, and 0.25 KGE value [-] of the best model expressed as cumulative distribution functions (CDFs) across all catchments.

Relating Differences Between Models to Number of Model Parameters
The number of model parameters is often used to explain differences in model performance. In a study of 19 conceptual models and 429 catchments Perrin et al. (2001) find a tendency for models with more parameters to better fit calibration data but not evaluation data and decreasing robustness (defined as the decrease of mean model performance between calibration and verification periods) for models with more parameters. They suggest that models with more degrees of freedom (i.e., calibration parameters) tend to reproduce errors or noise in the calibration data, a phenomenon called overfitting that is also described by other authors (e.g., Beven, 2012;Schoups et al., 2008;Shaw et al., 2011). Overfitting is related to but not the same as parameter identifiability, which is here used to refer to the ability to identify a unique optimal value for a given calibration parameter. Parameter overfitting leads to an increase in calibration performance combined with a decrease in performance robustness. This is a common expectation when statistical models are used (see, e.g., Figure 12 in Lute & Luce, 2017). This behavior also manifests through evaluation performance that decreases in bias but increases in random scatter for models with a higher number of free parameters (Höge et al., 2018;Lute & Luce, 2017). In conceptual hydrologic models parameters are used as part of equations intended to describe hydrologic behavior, and ideally models that are appropriate descrip-10.1029/2019WR025975 tions of the dominant processes in a catchment would perform well in such a catchment, regardless of the number of parameters used. Research Question 3 is thus What are the differences in model performance, and how do these relate to the number of model parameters?
The analysis for this research question is divided into two parts. First, we investigate the extent to which each model outperforms the benchmark and if any differences between models can be attributed to the number of model parameters. To this end we report (1) the number of catchments in which a model beats the benchmark, (2) the margins by which the model beats the benchmark, (3) how the model ranks compared to the other models in our sample, and (4) the difference in performance between the model and the best model in a given catchment, all in the context of the number of model parameters. A full sensitivity analysis for each combination of model, catchment, and objective function is outside the scope of this work, and we therefore use the total number of calibrated model parameters as a basic surrogate for the effective number of parameters.
Second, we investigate the tendency of models toward overfitting by comparing model performance during calibration, evaluation, and the change between the two periods. We assess this visually with boxplots and use the statistical Mann-Whitney test (Mann & Whitney, 1947) to quantify any differences in a pair-wise comparison of all models. The Mann-Whitney test tests the null hypothesis that two different samples are taken from a single distribution, that is, that 1 = 2 . If models with more parameters do indeed better fit the data during calibration, we expect that the Mann-Whitney test results show a tendency to reject the null hypothesis for model pairs with a very different number of parameters, combined with a tendency to not reject the null hypothesis for model pairs with a similar number of parameters. The expectations that models with more parameters have lower robustness is tested in the same manner.
The KGE can be decomposed into its three constituent parts, which reflect the similarity between simulations and observations in terms of the correlation between the two, the ratio of standard deviations, and the ratio of means (Gupta et al., 2009). The latter two components can be seen as indications of the scatter and bias of the simulations and investigate these as a second test for overfitting, both visually as box plots and through Mann-Whitney tests (Mann & Whitney, 1947). Our expectations for the Mann-Whitney test are as outlined in the previous paragraph.

Model Suitability for Different Catchments
Model development generally takes places on geographically small scales, such as one or a few research catchments where in-depth knowledge of catchment conditions can inform the choice of model structure (e.g., Ambroise et al., 1996;Fenicia et al., 2016;McGlynn et al., 2002;Peters et al., 2003). Across larger scales, the relation between climatic conditions and conceptual model performance has been well studied (e.g., Dakhlaoui et al., 2017;Fowler et al., 2018;Lidén & Harlin, 2000;Merz et al., 2011;Van Werkhoven et al., 2008;Van Esse et al., 2013). Catchment-averaged attributes beyond climatic data have proven useful to to assess conceptual model strengths and weaknesses across the United Kingdom . Such studies are generally conducted with a limited number of models, a limited number of catchments, or within a geographically small and thus relatively similar area (such as Austria, Merz et al., 2011, andFrance, Van Esse et al., 2013). Until recently no data were available to allow such studies to also investigate the relationship with catchment structure across a large and varied domain. The CAMELS data set provides catchment attributes spread across six main categories: climate, geology, topography, soil, land cover, and streamflow (Addor et al., 2017). We attempt to use these descriptors to clarify the relation between model performance and catchment type. Research Question 4 is thus How does relative model performance relate to known catchment attributes?
Because efficiency scores are not easily compared between places (e.g., Schaefli & Gupta, 2007), we instead rank the plausible model structures in each catchment based on their KGE scores and use model ranks for this analysis. Given the size of the data sample, we limit this aspect of the study to an exploratory analysis based on Spearman rank correlations between model ranks and catchment attributes only.

CAMELS Catchment Data
This study uses the CAMELS data set (Addor et al., 2017), which provides time series of meteorological variables and streamflow (Newman et al., 2015), and tables with catchment attributes for 671 catchments Figure 1. Catchments in the CAMELS data set. Five hundred fifty-nine catchments were used in this study (blue), after removing those catchments with uncertain area estimates (purple; >10% differences between two geospatial data sets and the USGS reported value) or water balance errors (yellow) or both (red). (a) Geographical location and reason for exclusion from this study. (b) Catchment area distribution of the 559 selected catchments. (c) Aridity index against 1 − runoff ratio for the 559 selected catchments.
in the contiguous United States. We perform several basic data checks and remove those catchments with large (>10%) discrepancies between catchment area as used for averaging of the meteorological time series and area as published by the USGS (US Geological Survey, 2018) or the higher resolution GAGES II data set (provided as part of the CAMELS data set). We use preliminary screening (e.g., Martinez & Gupta, 2011) to remove those catchments that fall outside the energy limit and water limit on the Budyko curve (Budyko, 1974). This leaves 559 catchments for use in this study, distributed across the contiguous United States (Figure 1). The CAMELS data provide three different forcing products (Newman et al., 2015) at a daily resolution. This study investigates lumped models (catchments are treated as a single entity) and thus uses catchment-averaged forcing data. We follow Newman et al. (2015) and Addor et al. (2017) in using the Daymet product, which is based on the highest spatial resolution of all three products (1 km × 1 km compared to 12 km × 12 km for Maurer and NLDAS products) and is more likely to provide accurate data for smaller catchments and locations with complex topography. Time series of daily precipitation and temperature are part of the CAMELS data, and time series of potential evapotransiration (PET) are estimated using the Priestley-Taylor method (details in supporting information Text S1 Priestley & Taylor, 1972).

MARRMoT Modeling Framework
This study uses 36 conceptual model structures from the Modular Assessment of Rainfall-Runoff Models Toolbox (MARRMoT) v1.0 (Knoben et al., 2018a, which organizes conceptual model code in a single uniform framework. This has the main advantage that the implementation of models and fluxes is consistent and any differences in simulation are thus solely due to differences in model structure. The MARRMoT models used in this work are all based on published literature and cover a wide range of possible structures, from a simple one-parameter model to structures with up to 6 stores or 15 parameters. The toolbox is provided with literature-based parameter ranges for each model to support parameter sampling or optimization. These standardize the parameter ranges as much as possible, so that models have the same amount of parameter freedom (e.g., in the case of interception capacity, all models that simulate the interception process use a range of 0-5 mm). The differential equations that express each model's changes in storage(s) with time are numerically approximated with a fixed-step implicit Euler method, which uses the same step size as the forcing data (detailed settings for reproducibility can be found as part of the data package that accompanies this paper). The implicit Euler method provides better accuracy and stability compared to the Explicit Euler method, at the cost of increased computational times (Kavetski et al., 2006;Schoups et al., 2010). Figure 2 provides an overview of the 36 models used in this work.

Model Setup
Data for each catchment are divided into two 10 year periods covering 1 January 1989 to 31 December 1998 (calibration) and 1 January 1999 to 31 December 2009 (evaluation), respectively. Average climate characteristics are approximately constant between these periods with the exception of regions with high mean precipitation (P ≥ 5 mm/day; precipitation has decreased somewhat) and regions with low mean temperatures (T ≤ 5 • C; temperatures have increased). Estimated potential evapotranspiration rates are approximately constant between the two periods (see Figure S1). Streamflow records are complete during this period for 546 catchments. For the remaining 13 catchments, days with missing streamflow values are ignored during the calculation of objective function values. Missing values account for 0.013% to 4.8% of all observations in these catchments.
The 36 models in this study are calibrated for 559 catchments using three different objective functions, each based on the KGE (Gupta et al., 2009): where subscripts obs and sim refer to observed and simulated time series of flow, respectively, r is the linear correlation coefficient between observed and simulated flow, denotes the standard deviation of flows, and the mean of flows. We aim to compare model performance for a variety of flow conditions; therefore, our choice of objective functions emphasizes higher flows, lower flows, and a combination of both. The objective functions used are the KGE calculated on time series of flow (KGE(Q)), the KGE of inverse flows (KGE(1/Q), and the mean of KGE(Q) and KGE(1/Q). Inverse flows are shown to be more appropriate than a log transform to emphasize low flows (e.g., Pushpalatha et al., 2012;Santos et al., 2018). Pushpalatha et al. (2012) add a constant e to observed and simulated streamflow to avoid problems with inverting zero flow values. They recommend e to be set at 1% of the mean observed flow, because this limits the impact of the added constant on the resulting NSE(1/Q) values in their study. Because no such guidance yet exists for KGE and because NSE and KGE are conceptually based on the same three components (Gupta et al., 2009), we assume that this value is an appropriate choice for KGE(1/Q), too.
We use the Covariance Matrix Adaptation Evolution Strategy (CMA- ES Hansen, 2016;Hansen & Ostermeier, 1996Hansen et al., 2003) to calibrate model parameters. CMA-ES is a single-objective optimizer that compares favorably to various other methods for finding the global optimum of difficult functions and in rugged objective function landscapes (Arsenault et al., 2014;Hansen et al., 2003Hansen et al., , 2010. The algorithm has seen successful application in hydrology (e.g., Arsenault et al., 2014;Fowler et al., 2018;Peterson & Western, 2014), as well as many other fields (Hansen, 2009). The algorithm is allowed to run either until the change in objective function of all members in the current generation and the range of objective function values in at least the preceding 10 generations is below 1E−3 or until the standard deviation of the normal distribution used to sample parameter values for the new generation drops below 1E−3. These are problem-dependent algorithm settings (Hansen, 2016) that we consider an acceptable compromise between accuracy and speed for the 60,372 combinations of models, catchments, and objective functions. Details about CMA-ES stopping criteria and algorithm exit flags can be found in supporting information Text S2.
Model warm-up periods are used to reduce the impact of uncertain initial conditions on model performance. Recent studies have attempted to provide guidelines for warm-up period length in conceptual models (Kim et al., 2018), but these studies are limited in number of models (1 and 2, respectively) and catchments (18 and 1, respectively), and it is therefore difficult to generalize their findings to a large-sample study such as this. Instead of using a fixed number of warm-up days, we determine the initial storages in an iterative procedure by letting the model repeat Year 1 of the data period until the stores reach an equilibrium for the first day of the year (<1% change in storage value[s] between runs). Storage values might not converge for certain parameter sets (e.g., when a store of unlimited depth has very low outflow), in which case the procedure is stopped after 50 iterations.

Results
Results presented here are based on data for the KGE(Q) objective function obtained from the evaluation period, unless specifically indicated as being calibration results or relating to one of the two other objective functions.
Where applicable, findings for each model only include results from those catchments where the model exceeds the minimum benchmark level of expected performance. Each section also includes a brief summary of findings for the other two objective functions, KGE(1/Q) and 1/2*[KGE(Q) + KGE(1/Q)]. Figures for these objective functions are part of the supporting information for brevity. score that a model must beat to be considered a plausible model structure in each catchment. The benchmark is treated as any other model in the sense that the benchmark simulations are calculated from flow observations in the calibration period, and the benchmark KGE(Q) score is calculated by comparing these simulations to observations from the evaluation period. (b) Type of benchmark (mean or median calendar day flow) that gives the higher benchmark KGE(Q) score.

Defining a Lower Level of Expected Model Performance
The performance of the benchmark time series (i.e., the mean or median daily flow regime) varies across space and is subject to strong spatial organization (Figure 3a). KGE scores are lowest (KGE ≤ −0.5) in very arid areas and highest in snow-dominated areas (with values up to KGE = 0.89). Approximately 80% of these benchmarks are obtained by using the mean calendar day flow, the remainder being obtained from the median calendar day flow (Figure 3b). These results set a baseline for minimum expected model performance by indicating how predictable the flow regime is.
This spatial organization is also visible for the KGE(1/Q) and 1/2*[KGE(Q) + KGE(1/Q)] objective functions ( Figures S3 and S4). Benchmark values for KGE(1/Q) are generally higher than those for KGE(Q) and are in 96% of cases obtained by using the median calendar day flow. As expected, the results for the  Figure 4a shows that the maximum achieved evaluation efficiency in each catchment (i.e., what the best model out of 36 achieves) is subject to strong spatial organization, although exceptions to the pattern exist. Maximum efficiency ranges from −0.11 to 0.93. Note that these values are raw KGE scores and are not yet adjusted by the benchmark score. In geographical terms, maximum model performance tends to be lowest in the central United States (plains areas east of the Rocky Mountains) and certain parts of the southwest. These areas share a tendency to be very arid (see Figure 3c in Addor et al., 2017). Figure 4b shows the number of models that fall within certain performance thresholds. Curves that stay closer to the bottom indicate that fewer models have a KGE value within 0.01/0.05/0.10/0.25 of the best model in a given catchment. For example, the blue line indicates that in approximately 350 catchments, no model has a KGE value within 0.01 of the best model, while in the remaining 200 catchments at least one and up to eight models have performance within 0.01 of the best model for each catchment. These results show that model structure equifinality can be very high: In many catchments several models can be virtually indistinguishable (within 0.01 KGE of each other) in terms of efficiency scores, and in the vast majority of catchments up to 28 different models can be close (within 0.05 KGE) to the best model.

Model Structure Equifinality of Plausible Model Structures
Comparing maximum model efficiency to the predefined benchmark values in each catchment provides context for the maximum model efficiency scores (compare Figures 4a and 4c). The difference between maximum model performance and benchmark is smallest in mountainous regions and generally high in arid regions. In 11 catchments no model outperforms the benchmark ( KGE < 0), all characterized by a high fraction of precipitation occurring as snowfall. In these places, no model in our sample is able to simulate the persistent features of the hydrograph (i.e., features that recur every year) better than the benchmark does. These 11 catchments are excluded from further analysis, because our model ensemble contains no structures that meet our plausibility criterion. Figure 4d shows that in mountainous regions the number of models that beat the benchmark is low. This can partly be explained by not all models having a snow module (only eight models do), but even having a snow module is no guarantee that a model can beat the benchmark. In contrast, in wet, nonsnowy regions the vast majority of models beats the benchmark and in 289 out of 559 catchments every single model in our sample provides more accurate simulations than the benchmark gives (although that does not automatically imply that all model simulations are equally close to observations in these catchments, see Figure 4b). Model choices matters most in the arid regions where maximum model efficiency is lowest. Models do exist that can provide reasonable simulations here, but they must be carefully selected.
Spatial patterns of maximum model KGE and KGE distributions are roughly similar for the three objective functions (see Figures S5 and S6). Maximum evaluation efficiency ranges from −0.74 to 0.96 for the low flow objective function (KGE(1∕Q)) and −0.15 to 0.91 for the combined flow objective function ( 1 2 [KGE(Q) + KGE(1∕Q)]). Model equifinality is lower (i.e., fewer models are close to the best model in each catchment), especially for the combined objective function. There are more catchments (19 and 24, respectively) where no model beats the benchmark and fewer catchments where most models can beat the benchmark (in only 189 and 200 catchments, respectively, do more than 30 models beat the benchmark). Many models struggle to achieve accurate low flow simulations but the maximum model KGE scores for the KGE(1∕Q) objective show that this is not impossible, only that it requires careful model selection. Adopting a multiobjective approach reduces model equifinality by the largest degree.  Figure 5 compares performance of individual models during evaluation. With the exception of the simplest Model m01, models beat the benchmark in approximately equal numbers (Figure 5a). Models that include a snow module (shown as blue bars) tend to beat the benchmark in more catchments than models without a snow module for obvious reasons. There is substantial variety in the margin by which models beat the benchmark (Figure 5b), showing that certain models are much better suited to flow simulation with the KGE(Q) objective function than other models are. This is reflected in the ranks these models obtain (Figure 5c). Number of parameters is a poor predictor of how a model will perform. Certain models, such as m36 (15 parameters), m28 (12 parameters), and m13 (7 parameters), tend to rank better (toward Rank 1), whereas other models, such as m26 (10 parameters), m21 (9 parameters), m25 (6 parameters), and m06 (4 parameters), tend to rank much worse (toward Rank 36). Many models with a snow module show bimodal distributions, indicating that in certain catchments (i.e., those with snow) they are one of the best options available, but this does not imply they are among the better choices in other catchments. Differences in 10.1029/2019WR025975 model suitability for the KGE(Q) objective function can also be seen in how far away each model tends to be from the best model in each catchment (Figure 5d). Models such as m28 (12 parameters), m13 (7 parameters), and m05 (6 parameters) tend to perform similar to the best model in any catchment, whereas models such as m34 (12 parameters), m21 (9 parameters), and m06 (4 parameters) tend to be much further from the best model in each catchment. These results suggest that the choice of model parametrization (i.e., which equations are used and how parameters are used within them) is more important to dictate a model's strengths and weakness than how many parameters the model has.

Relating Differences Between Models to Number of Model Parameters 4.3.1. Differences Between Models During Evaluation
There is greater variety in the number of catchments in which a model beats the benchmark for the other two objective functions (see Figures S7 and S8) and broadly speaking these results support the conclusion that certain model structures are much better suited for certain objective functions. Of particular note are Models m28, m27, and m21 because they showcase three different model types: m28 performs well under all three objective functions; m27 performs reasonably well with the KGE(Q) objective function but performs substantially worse for the other two objectives; m21 performs poorly on the KGE(Q) objective function but performs very well on the KGE(1/Q) objective, moving from being consistently one of the worst choices to consistently one of the best.

Tests for Overfitting
Contrary to expectations, a higher number of model parameters does not necessarily lead to higher efficiency values during calibration (Figure 6a). In other words, models with higher degrees of freedom (more parameters) are not consistently better at fitting the calibration data. In fact, several of the models that show lower calibration efficiency ranges (e.g., m21 and m26) have a relatively high number of free parameters (9 and 10, respectively). Both simpler (e.g., m02, four parameters) and more complex models (e.g., m35, 15 parameters) show higher ranges of efficiency values during calibration.
Evaluation performance shows a similar pattern (Figure 6b): There are certainly differences between the ranges of efficiency values obtained by different models, but this seems unrelated to the number of parameters each model has. Overall, evaluation efficiency ranges are somewhat lower than calibration ranges, which indicates either a change in catchment conditions that the models insufficiently account for (e.g., change in climatic forcing), or a degree of overcalibration (i.e., the models are calibrated to a certain amount of data noise). Analysis of each model's performance change between calibration and evaluation periods ( Figure 6c) shows that, whatever the cause, distributions of performance change are similar across all models. Figure 6c also shows that performance decline during evaluation does not always occur, and in approximately a quarter of all catchments model performance instead increases during evaluation (note that these are not necessarily the same catchments for each model). While this may seem like a high number, we note that many studies conducting similar analyses deliberately choose periods with contrasting climate between calibration and evaluation periods and find declining model performance under contrasting conditions, whereas here climatic conditions are relatively similar in both periods (see Figure S1).
Pair-wise Mann-Whitney statistical tests (see Figure S9) confirm that there are certainly differences between the distributions of model performance but that there are no clear patterns that relate to the number of model parameters. For example, the null hypothesis that calibration performance of Models m17 (4 parameters) and m37 (15 parameters) are drawn from the same distribution cannot be rejected (p > 0.95). Analysis on a per-catchment basis (not shown for brevity) also shows that no significant (p < 0.05) relation exists between either calibration performance, evaluation performance or robustness, and the number of model parameters.
Analysis of the constitutive KGE components (correlation, scatter, and bias; Figure S10) shows that the expectation that bias decreases while scatter increases for models with a higher number of parameters (see, e.g., Höge et al., 2018) is not a general rule that can be applied to these conceptual models. Collectively, models show a tendency to overestimate the bias component ( sim > obs , although exceptions such as m01, m17, m14, and m22 exist; Figure S10c). Models also show a tendency to overestimate the scatter component ( sim > obs ; Figure S10b) but again exceptions exist. The clearest variability can be seen in the correlation component ( Figure S10a) where certain models score substantially lower than others, but here too no relation with number of parameters can be seen.
Pair-wise Mann-Whitney tests (see Figure S11) indicate that distributions of values for KGE components can be different for different models but that results for models with more parameters are not necessarily statistically different from results from models with fewer parameters in a consistent way. These results support the findings in the previous section and again suggest that the number of calibrated model parameters is not an effective measure for explaining differences in conceptual model performance.
The main conclusions can be found for the other two objective functions as well: Models perform quite differently, but this cannot be consistently explained by an increasing number of parameters (see Figures S12-S19

Correlations Between Model Ranks and CAMELS Catchment Features
The 52 CAMELS catchment features are divided into six categories: climatic conditions, observed streamflow signatures, geologic properties, topographic properties, vegetation properties, and soil properties. The Figure 7. Spearman rank correlation between model ranks in the evaluation period and the different categories of CAMELS catchment data (separated by dotted lines, these are properties of vegetation, topography, soil, streamflow, geology, and climate). For each model, correlations are only calculated for catchments where the model beats the benchmark. Models are ranked from best to worst, with Rank 1 indicating the best model. Only correlations with p value <0.05 are shown, and color intensity corresponds to the strength of the correlation. Models are sorted manually in an attempt to place models with similar correlation patterns close together. Example interpretation: See the dark green points for the combination of models with a snow module (indicated with an asterisk (*)) and "fraction P(recipitation) as snow" that indicates a strong negative correlation (bottom left of the figure). As the observed snowfall fraction of catchments increases, models with a snow module tend to achieve lower rank numbers, that is, toward Rank 1, and thus rank better than the other models in our sample.
categories are not fully independent. The connection between climatic conditions and streamflow regimes on continental to global scales is well established (e.g., Addor et al., 2018;Berghuijs et al., 2014;Knoben, Woods, and Freer 2018;Kuentz et al., 2017), and this shows in the CAMELS data as high correlations between climatic conditions and observed streamflow signatures , see also Figure S20 for cross correlations). Climatic conditions and to a lesser extent observed streamflow signatures also correlate strongly with vegetation attributes. Geologic and soil properties contain the most independent information. Scatter plots of model ranks and CAMELS catchment attributes (not shown for brevity) indicate that empirical relationships exist between model ranks and certain types of catchments but also that substantial scatter around any main relationship is present. Figure 7 summarizes these relationships using the Spearman rank correlation coefficient. Note that for each model only those catchments are included where the model beats the benchmark.
The strongest correlations for most models can be found with observed streamflow signatures (for a description of these signatures, see Addor et al., 2018) and to a slightly lesser extent with climatic conditions. If streamflow signatures are seen as a way to describe flow regimes, this suggest that certain models are relatively more or less suited for certain flow regimes (compared to the other models in our study). The established connection between streamflow regimes and climatic conditions explains why model ranks also correlate with climatic conditions. Correlations with those CAMELS attributes that describe the catchments' geology, topography, soil, and vegetation are generally the weakest, except in cases where those attributes also correlate with climatic conditions or observed streamflow signatures. For example, mean elevation (topography) correlates strongly with fraction precipitation as snowfall (climate). This can imply several things: (1) Uncertainty in the attributes data is too high to find any clear relationships with model performance; (2) we are not looking at the right catchment attributes because these do not seem to explain the hydrologic and model behavior; (3) models work better/worse for certain streamflow regimes, but regimes are not a unique result from a certain arrangement of catchment attributes.
These conclusions are similar for the other two objective functions. Correlations for the KGE(1/Q) objective function (see Figure S21) are generally lower than those in Figure 7, but the strongest correlations can be seen between model ranks and observed streamflow signatures. Correlations for the 1 2 [KGE(Q)+KGE(1∕Q)] objective function (see Figure S22) are similar to those in Figure 7, in terms of both pattern and strength.

Model Structure Similarity
In Figure 7, models are sorted manually along the x axis, such that models with similar correlation patterns in the y direction are placed close to one another. This allows us to define model groups that contain model structures with similar performance ranks across the sample of catchments.
An obvious relation exists between models that include a snow component (m12 to m34, leftmost on the x axis) and catchments where a larger fraction of the annual precipitation occurs as snowfall, where these models naturally achieve higher efficiency scores than models without the capability to simulate snow accumulation and melt. The next group consists of Models m34, m21, m26, m28, and m29. These perform relatively better in baseflow-dominated catchments without flashy streamflow behavior. These particular models share a structural feature that consists of a soil moisture routine that simulates a variable contributing area, which then drains into a linear reservoir. Models m01, m08, m14, m17, m20, m22, and m25 (at the right on the x axis) share a tendency to rank better in catchments with low precipitation, low mean flows and low flows (Q5) and a larger number of high precipitation events. This suggests drier catchments with low flows punctuated by the occasional high flow event. These models all have a mechanism that allows incoming precipitation to reach the stream quickly (either saturation excess or a bypass mechanism) and also contain a mechanism that ensures that very low (up to 0) flows can be generated. This mechanism is either threshold-based, where no flow is generated unless a storage threshold is exceeded, or evaporation-based, where evaporation can occur from multiple stores and can thus be used to prevent water from reaching the stream. Model m01 is the most extreme member of this group, containing only a saturation excess mechanism. The models in this group suggest that very different model structures are capable of reproducing similar flow regimes. The remaining models can be roughly divided into two groups: Models m02 to m36 and m03 to m40. These models do not share any obvious characteristics and do not show many pronounced correlations.
Similarity of correlation patterns can be seen for the other two objective functions too (see Figures S21 and S22) but model groupings are different. This might imply that different parts of the model structure influence whether model structures behave similarly on a given objective function. Further analysis for the other two objective functions is considered out of scope for this work.

Synthesis
Large-sample analysis such as this study can provide unique insights into our ability to model a wide variety of catchments and how models differ from one another in a practical sense. Here, we return to the research questions posed in section 2. We first answer the question "What level of model performance can reasonably be expected in a given catchment?" by defining benchmarks based on daily flow observations in the calibration period (Figure 3). This results in benchmark evaluation KGE scores that range between −0.65 and 0.87, with 98.8% of catchments having benchmark values that are higher than what would be obtained by using the mean annual flow (i.e., KGE = 1− √ 2; . The benchmark scores gives us the necessary context to evaluate model performance, by showing what the typical seasonal signal in the data is and which efficiency scores can easily be obtained in a given catchment. Benchmark scores show strong spatial organization and are typically highest in snow-dominated catchments and lowest in arid catchments. Answers to our second question, "How many of the 36 model structures in our sample can be considered plausible in a given catchment, and how high is model equifinality within this subset?" are dependent on where we draw the line between plausible and implausible models. Without an explicit statement about the benchmark that we expect our models to beat we might have concluded that our model sample performs worst in arid catchments (see the strong negative gradient in Figure 8c). In fact, this would have been in line with existing literature that states that it is harder to obtain high efficiency scores in arid locations than it is to obtain such scores in more humid regions (e.g., Fowler et al., 2018;Krysanova et al., 2017;Melsen et al., 2018;Newman et al., 2017;Van Esse et al., 2013). With our specified benchmark arid regions do not stand out as places where our model ensemble does poorly. Despite the challenges to hydrologic modeling in arid regions (Pilgrim et al., 1988), our model ensemble is able to beat the benchmark in arid and humid regions by approximately equal margins (Figure 8d). Instead, we have reason to doubt the ability of the models in our sample to simulate cold-region hydrology: Models tend to achieve higher KGE evaluation scores as fraction snowfall increases but improvement over benchmark drastically lowers. Of course, when the benchmark scores increases there is less potential for improvement to be achieved by any model and lower KGE values might be expected (e.g., with a benchmark KGE = 0.95, the potential for improvement is only 1−0.95 = 0.05). However, in 11 snow-dominated catchments our model ensemble is unable to reproduce the persistent seasonal streamflow signal identified by the benchmark model and no model in the ensemble can beat the benchmark score. This might suggest any combination of missing or inappropriate process representations, issues with the input data or problems with model calibration. In the vast majority of nonsnowy catchments many to all of the models can beat the benchmark (Figure 4). They do not do so by equal margins but in approximately 200 catchments up to 8 model structures achieve practically the same efficiency score as the best model (<0.01 KGE difference) and in approximately 500 catchments up to 28 models can be close (<0.05 KGE difference) to the best model. Logically, in cases where model structure equifinality is high, not 10.1029/2019WR025975 every model can be an equally good representation of the catchment under consideration and that these models produce hydrograph simulations of similar accuracy does not mean that they do so for the right reasons (Kirchner, 2006). These results provide evidence from a very large sample of model structures and catchments that better assessment of model structural adequacy (e.g., Gupta et al., 2012) and process fidelity in models (e.g., Clark et al., 2016;Kirchner, 2006) should become the norm. Relying on aggregated efficiency scores alone is insufficient to determine which models are appropriate choices for which catchments.
We next answer the question "What are the differences between models, and how do these relate to the number of model parameters?" The expectation that models with more parameters are vulnerable to overfitting cannot be seen in our results. This contrasts with findings by Perrin et al. (2001), who reported a tendenc for conceptual hydrology models with more parameters to better fit calibration data but not evaluation data, and an inverse relation between model robustness (defined in their paper as the decrease of mean model performance between calibration and verification periods) and number of model parameters. This pattern is not visible in our sample (Figures 6, S12, and S14) and not found by our use of statistical tests ( Figures S9, S13, and S15). Equally, the expectation that models with more degrees of freedom generate errors with reduced bias and increased scatter (Höge et al., 2018;Lute & Luce, 2017) is not seen in the constitutive components of the KGE objective function (Figures S10, S11, and S16-S19). While overfitting (i.e., performance loss in evaluation due to noise fitting during calibration, e.g., Beven, 2012;Schoups et al., 2008;Shaw et al., 2011) is a clear issue with high-degree polynomials (Grayson & Blöschl, 2001;Schoups et al., 2008), these principles do not seem to apply to our sample of conceptual hydrologic models and catchments. Consequently, the number of parameters might only be a good way to quantify model complexity in the restricted case that the models with more parameters contain the models with fewer parameters as a special case (e.g., a fifth-order polynomial contains fourth-order polynomials as a special case). This condition is generally not met in model intercomparison studies which typically aim for diversity, not similarity, in the models included (e.g., see the models used in Franchini & Pacciani, 1991;Perrin et al., 2001;Seiller et al., 2012;and this study).
Whereas most models show a tendency to perform well on specific objective functions but not on others (a common finding in studies comparing multiple models across multiple objectives; see, e.g., Fowler et al., 2018;Perrin et al., 2001;Seiller et al., 2012), certain models seem to display more well-rounded behavior and tend to rank better regardless of the objective function used. It is therefore noteworthy and somewhat unexpected to find that Model m28 is consistently among the best, if not the best, model structure in the majority of catchments and for all three objective functions. This model is the MARRMoT version of the Xinanjiang model Zhao, 1992), modified with a unique feature not seen in any other model in our sample, namely a double parabolic curve that is used to represent the fraction of the catchment that contributes to free drainage (Jayawardena & Zhou, 2000). Nonlinear treatment of saturated area representation has been linked to more flexible model performance within groundwater-dominated catchments before  and we can speculate that this specific double parabolic formulation gives the model a unique capability that allows it to perform well in a wide variety of catchments. Interestingly, it is difficult to generalize these findings because for every model (including m28) certain catchments can be found where that model is one of the best structures (in terms of efficiency scores during evaluation) and equally catchments can be found where that model is one of the worst options (Figure 9 Perrin et al., 2001, presents a similar finding). This shows a critical weakness of model comparison studies that use a small number of basins: Results are conditional on the choice of catchments and thus very difficult to generalize to other places.
Large-sample studies allow patterns to emerge, of which Figure 9a provides an example: Some model structures seem relatively unsuitable for flow simulation with the KGE(Q) objective function, but of the models that do tend to rank better on this objective, models with more parameters appear to have more flexibility and tend to rank better in larger numbers of catchments than models with fewer parameters. This leaves modelers working with conceptual models in large numbers of catchments facing a dilemma: Parsimonious models are preferable because their parameters will be better identifiable (e.g., Jakeman & Hornberger, 1993;Nash & Sutcliffe, 1970;Wagener et al., 2003), but, as our results suggest, models with more parameters seem to have the flexibility to accurately reproduce hydrographs in a much wider range of catchments without any obvious risk of being overfitted to the calibration data.  (b, d, f) in the bottom three models during evaluation. Note that these instances are only counted in cases where the model at least beats the catchment-specific benchmark. For example, there are only 548 counts of any model being ranked first on the KGE(Q) objective, while our sample contains 559 catchments, because in 11 catchments not a single model beats the benchmark and in these catchments no model is assigned a rank. Models that include a snow module are indicated with an asterisk (*).
Last, we attempt to answer "How does relative model performance relate to known catchment attributes?" We found the strongest relation between relative model rank and observed streamflow signatures, while relations with climatic attributes and catchment descriptors were less clear (Figures 7, S21, and S22). The lack of correlation with some attributes may be due to low information content in these attributes but might also be the results of our experimental design. Given that models were calibrated to streamflow data, it is perhaps not surprising to see model performance correlate most strongly with streamflow signatures, because the models are naturally sorted by their ability to replicate certain regimes as a consequence of the calibration procedure. It is possible that the relation between model performance and other catchment attributes becomes clearer when those attributes are specifically used during calibration (e.g., calibrating against streamflow, evaporation and soil moisture observations simultaneously might clarify relations between model performance and climate and/or soil characteristics), but such calibration approaches also carry new challenges with them such as the commensurability between model states and real-world observations. Currently, our findings reinforce the idea that certain model structures are better suited for simulation of certain flow regimes and our results suggest that models that share certain structural elements show similar suitability for certain flow regimes. Our results give rise to several hypotheses about conceptual model behavior (see section 4.4.2 and the note on Model m28 in this section), but all were formulated after we calibrated, evaluated, ranked, and grouped the models. Strict testing of the hypotheses is thus necessary (see, e.g., Beven, 2000Beven, , 2018Clark et al., 2011;Fenicia et al., 2014;Kirchner, 2006;Pfister & Kirchner, 2017) before these ideas can be used to guide model development.

The Need to Select an Appropriate Hydrological Model
This study provides large-sample evidence for the need for more thorough and process-based model evaluations (e.g., Clark et al., 2016;Gupta et al., 2012;Kirchner, 2006;Wrede et al., 2015). There are too many models that are superficially similar in terms of efficiency scores but internally different in terms of process representation. To increase hydrologic understanding and generate robust long-term projections of future water resources, more effort needs to be devoted to understanding (1) which hydrologic processes are dominant where, (2) which model structures contain appropriate representations of these dominant processes and should thus be used in a given catchment (for a given definition of "appropriate"), and (3) how dominant hydrologic processes and consequently the criteria for what constitutes an "appropriate" model might change in the future in a given catchment. Only with such understanding can we confidently select a model structure for a given catchment and study purpose.
It has been long known that models with fewer calibrated parameters can compete with more parameter -heavy models in terms of model performance (e.g., Jakeman & Hornberger, 1993;Perrin et al., 2001). Overfitting and the inability to properly identify parameter values through calibration to streamflow data are often cited as a reason for this (e.g., Beven, 1989;Kuczera & Mroczkowski, 1998). Parsimonious models with few calibration parameters are often preferred over more complex models to avoid these issues. However, such simple models cannot contain all potentially relevant hydrologic processes, because this would require more parameters than can be identified from streamflow data alone. This leads to a dilemma succinctly stated in Kuczera and Mroczkowski (1998): "A simple model cannot be relied upon to make meaningful extrapolative predictions, whereas a complex model may have the potential but because of information constraints may be unable to realize it." Yet, such complex models are required if predictions under changing conditions are to be made (Kirchner, 2006). Our results suggest that models with a larger number of parameters (up to 15 in this study) are less vulnerable to parameter overfitting than might be expected (although parameter identifiability might remain an issue; see section 5.4). Therefore, when a choice must be made between several model structures for prediction under changing conditions there is no clear justification for selecting the model with the fewest calibration parameters as the preferable alternative (cf. Oreskes et al., 1994;Reichert & Omlin, 1997). Instead, analysis of the dominant hydrologic processes, possible changes in these processes and each models' ability to reflect both current and future processes should form the core of such a decision.

How Representative Are the Catchment and Model Sample?
We have remarked on the fact that results from modeling studies are conditional in the sample of models and catchments used. We therefore briefly summarize our findings about the representativeness of our catchment and model sample and refer the interested reader to supporting information Text S3 for more details.
We use the hydrological climate classification of Knoben, Woods, and Freer (2018) to quantify where the CAMELS catchments fall in relation to the global distribution of hydroclimates. This classification uses three axes that describe annual average aridity, the within-year variability in the water-energy balance and the fraction of precipitation that occurs as snow. There are few CAMELS catchments on either end of the aridity scale, few catchments with low within-year aridity seasonality, and snow-dominated CAMELS catchments cover a fairly narrow range out of all possible snow-dominated conditions. In geographical terms, care should be taken when extrapolating our findings to climates with more extreme aridity values (e.g., deserts and tropical rain forests), to regions with less seasonally varied aridity values (e.g., climatic transition zones on the edges of deserts and rain forest), and to places with a low mean temperatures combined with a less pronounced summer-winter temperature cycle (e.g., taiga).
It is commonly assumed that models in an ensemble are sufficiently varied if the model simulations bracket the observations (Clark et al., 2008). This is the case for the majority of catchments during evaluation. Clear exceptions are mountainous snow-dominated catchments and several catchments on the Pacific Northwest, which can indicate a lack of diversity in our model ensemble (likely the case for snow-dominated catchments) or the presence of bias in the forcing data (a likely explanation for the Pacific Northwest). Despite mostly bracketing the observations, the model ensemble shows a strong seasonal bias with a tendency toward underprediction in late spring and summer and overprediction in late autumn and winter. This could be due to the spacing of the MARRMoT models in the overall model space. Skewed sampling of model structures can bias model comparison studies but the extent to which this is an issue is currently difficult to quantify. Metrics that define model similarity and model spread in the total model space are needed to address this question in more depth.

Study Limitations
This section briefly describes various limitations in the current study set up and possible ways to address these in future work. First, our analyses are mostly based on general performance metrics that aggregate model performance into a single efficiency score. Higher resolution diagnostics such as seasonal or time step based performance metrics (see, e.g., Coxon et al., 2014), or signature-based calibration and evaluation (see, e.g., Westerberg et al., 2014) might provide insight into why there is considerable equifinality in aggregated model performance across our sample.
Second, we use lumped models, catchment-averaged daily forcing data and average catchment attributes. Spatial heterogeneity is not accounted for beyond parametrizations of contributing catchment area in certain model structures, in an attempt to keep the analysis manageable. This lack of spatial explicitness makes it challenging to get clear answers to questions that require more nuanced analysis of the hydrograph and/or detailed process consideration, and much work remains to be done.
Third, this work focuses on model structure uncertainty and leaves data and parameter uncertainty mostly unaccounted for. Our experimental design is constrained by a need to limit computational times but ignoring data uncertainty (see, e.g., McMillan et al., 2012McMillan et al., , 2018 can force the calibration procedure to compensate for errors in the measured rainfall-runoff relationship and thus influence results. Equally, although we see no evidence of parameter overfitting, it is still possible that parameters are poorly identifiable. We have performed a short investigation of the impact of using only a single parameter set per model per catchment and believe that patterns across all catchments and models can be inferred from this approach (see supporting information Text S4 for details). We caution against using our data to investigate a single catchment without accounting for parameter uncertainty, because differences between calibrated parameter sets can be substantial in a given basin, even if patterns across the sample of all basins remain relatively stable. Our results do suggest that models can be divided into groups, where models within each group are similarly suited toward particular flow regimes. These groups can be used to select a small number of promising models within our ensemble, with each selected model being representative of several others. This reduces the computational load and is a potential way to allow future studies more room to account for data and parameter uncertainties.

Fostering Further Work
The computational demands of studies such as this can be high. To facilitate further research, calibration results of all models (parameters, simulated model storages and fluxes and obtained efficiency values) are made available on the University of Bristol data repository (dx.doi.org/10.5523/bris.2zutxh2qeep6y2cy 6scwgk9eqj). The CAMELS data set and MARRMoT modeling toolbox are also freely available and can be found through their respective references.

Conclusions
We calibrated 36 lumped conceptual models for streamflow simulation in 559 catchments across the United States, using three different formulations of the KGE as objective functions. We used a benchmark based on the mean or median calendar day flow (depending on the catchment) to define a baseline of expected model performance for each catchment. This benchmark proved hardest to beat in mountainous snow-dominated catchments: In 11 of these catchments no model managed to beat the benchmark, indicating that persistent features of the hydrograph are systematically poorly simulated in these places. In wet nonsnowy catchments, the majority of models managed to beat the benchmark. In arid catchments model choice seemed to matter most: Models do exist that beat the benchmark (and by similar margins as models in wetter catchments do), but these must be carefully selected. In nearly all catchments model equifinality can be high. For approximately 500 catchments, between 1 and up to 28 models can be within 0.05 KGE from the best model in each catchment. Our results indicate that there is little relation between model performance and number of parameters and there is no evidence of increased risk of overfitting of models with more parameters compared to models with fewer parameters. Instead, our results suggest that the choice of model 10.1029/2019WR025975 parametrization (i.e., which equations are used and how parameters are used within them) is more important to dictate its suitability for flow simulation with a given objective function and the flow regimes the model is capable of simulating well. In fact, our results suggest that if the model is suitable for a given objective function, models with more parameters tend to have increased flexibility compared to models with fewer parameters. This flexibility allows them to perform well outside the calibration period in larger numbers of catchments. It remains difficult to explain the type of catchments where a model might do well with attributes that quantify the catchments' geologic, topographic, soil, and vegetation attributes. Instead, model suitability seems to relate strongest to the streamflow regime each catchment generates, and we show an initial assessment that relates commonalities in model structure to similarities in model performance. Given our catchment-averaged approach to model use, data, and analysis and the fact that our hypotheses about model structure similarity were formulated after we calibrated and evaluated our models, more detailed investigation of between-model differences is needed, and care should be taken when applying our findings to future modeling efforts that extend beyond the limits of our approach.

Data Availability Statement
CAMELS data can be downloaded from https://ncar.github.io/hydrology/datasets/CAMELS_attributes (Addor et al., 2017). The latest MARRMoT model code and supporting information can be downloaded from https://github.com/wknoben/MARRMoT (Knoben et al., 2018c); MARRMoT v1.0 which was used for this work is available online (from https://dx.doi.org/10.5281/zenodo.2482542). A data package containing calibrated parameter values for the models used in this work, obtained efficiency values, and time series of simulated flows, internal fluxes, and model states can be downloaded from the University of Bristol data repository (dx.doi.org/10.5523/bris.2zutxh2qeep6y2cy6scwgk9eqj).