Simulating Runoff Under Changing Climatic Conditions: A Framework for Model Improvement

Rainfall‐runoff models are often deficient under changing climatic conditions, yet almost no recent studies propose new or improved model structures, instead focusing on model intercomparison, input sensitivity, and/or quantification of uncertainty. This paucity of progress in model development is (in part) due to the difficulty of distinguishing which cases of model failure are truly caused by structural inadequacy. Here we propose a new framework to diagnose the salient cause of poor model performance in changing climate conditions, be it structural inadequacy, poor parameterization, or data errors. The framework can be applied to a single catchment, although larger samples of catchments are helpful to generalize and/or cross‐check results. To generate a diagnosis, multiple historic periods with contrasting climate are defined, and the limits of model robustness and flexibility are explored over each period separately and for all periods together. Numerous data‐based checks also supplement the results. Using a case study catchment from Australia, improved inference of structural failure and clearer evaluation of model structural improvements are demonstrated. This framework enables future studies to (i) identify cases where poor simulations are due to poor calibration methods or data errors, remediating these cases without recourse to structural changes; and (ii) use the remaining cases to gain greater clarity into what structural changes are needed to improve model performance in changing climate.


Introduction
For effective water resource planning under climate change, it is essential to understand how catchments respond to changes in climatic forcing. Future climatic changes may go beyond the variability of the past (Covey et al., 2003;Forster et al., 2007;Meehl et al., 2007;Milly et al., 2008) and be amplified by hydrologic systems (e.g., Saft et al., 2015;van Dijk et al., 2013;Vogel et al., 1999), so it is important to learn what we can from past climate sequences and ensure hydrological models are improved by this learning. Such models are key tools for quantifying rainfall-runoff responses to climate model projections (e.g., Bergström et al., et al., 2014;, informing planning of possible responses and/or adaptation. It is therefore critical to ensure rainfall-runoff models can provide robust simulations under changing climatic forcing, in line with the wider current International Association of Hydrological Sciences emphasis on change in hydrology and society (Montanari et al., 2013).
In practice, rainfall-runoff models often show significant reductions in performance when applied in climatic conditions different to the calibration data. Vaze et al. (2010) tested four rainfall-runoff model structures on 61 catchments in Australia and reported that model performance tended to decline in proportion to the change in climatic variables between calibration and validation period, which has been confirmed by many other studies (e.g., Broderick et al., 2016;Coron et al., 2012Coron et al., , 2014Freer et al., 2003;Refsgaard & Knudsen, 1996;Saft et al., 2016;Seiller & Anctil, 2015;Seiller et al., 2012). A related problem is that model parameters, classically thought of as representing time-invariant properties of river catchments, usually vary depending on which time period a model is calibrated to. Merz et al. (2011) reported that these changes are strongly related to the average climatic conditions (e.g., temperature) in the calibration period (see also Brigode et al., 2013;de Vos et al., 2010;Freer et al., 2003;Wilby, 2005). The nonstationarity of calibrated parameter values may be a form of compensation for models that are missing key processes (or constraining them poorly) occurring in catchments that are subjected to long-term drying or wetting, with poor validation performance the ultimate outcome (Beck, 2002;Wagener, 2003).
Faced with rainfall-runoff models that are often deficient under changing climatic conditions, there is a clear need for new or improved models (and modeling methods) that are more robust. However, very little progress has been made toward improving models, or creating new ones, for better robustness under changing climatic conditions. To illustrate this, a literature review was conducted to identify every study with a DOI that cited both of the studies mentioned in the previous paragraph-namely, Vaze et al. (2010) and Merz et al. (2011). Of the 55 studies fulfilling these criteria, only one (Westra et al., 2014) resulted in a new model structure. The most common topics were climate change impact assessments and associated methods (10 studies), papers examining input or output uncertainty or sensitivity (7 studies), model intercomparison (6 studies), and papers proposing improvements to modeling practice (e.g., improved calibration or validation procedures; 6 studies). Each of these topics are individually important and worthy contributions to the literature. However, the collective lack of studies proposing new or improved models is both striking and concerning and is particularly surprising given the proliferation of software schemes that facilitate comparison of alternative model structures, such as FUSE , SUPERFLEX  and SUMMA (Clark et al., 2015), and others that can choose between candidate structures (e.g., Marshall et al., 2007). Note that under the phrase new or improved models we include systematic approaches to temporal changes in parameters and explicit inclusion of these in modeling frameworks (e.g., Kelleher & Shaw, 2018).
One reason for the paucity of progress in model development is that structural inadequacy is only one cause of poor simulations, with confounding factors such as deficient calibration methods and/or data errors also contributing to simulation errors Beven & Westerberg, 2011;Coron et al., 2014;Kavetski et al., 2006;McMillan et al., 2012;Seibert, 2003;Singh et al., 2011). Thus, the problem is initially one of diagnosing the salient cause of model failure, and this paper presents a framework to do this. While recognizing that any given case of poor performance may be due to a mixture of reasons, the tests described herein provide a rational basis to prioritize remedial action, which could include (i) improving the calibration method; (ii) rectifying data errors; and (iii) model structural improvements (i.e., changes to model equations). Our view is that changes to model equations are appropriate only after conducting basic tests to detect data errors, and assessing the possibility of poor parameterization, and we suggest methods herein for each purpose.
This framework is significant because it complements existing literature concerning rainfall-runoff model assessment and improvement. We argue that the following three types of framework are required to improve rainfall-runoff simulations in changing climate: (1) frameworks to test model validation performance (examples include Coron et al., 2012;Klemeš, 1986;Seibert, 2000 andThirel et al., 2015b); (2) frameworks to determine the salient cause of poor validation performance-this is a gap in the current literature, which this paper seeks to address; and (3) if the cause is confirmed to be deficient model structure, frameworks to detect and fix model structural issues (examples include Gupta et al., 2008;Westra et al., 2014). Therefore, the framework presented herein sits between existing schemes, and together with them it completes the workflow that modelers require to move from split sample testing through to model structural improvement.

Rationale of Framework
In this section the logical basis of the framework is explained. First, we define notation: Let model performance (by whatever measure) be denoted by E, and let T be some threshold of performance that defines model adequacy-that is, T is a minimum acceptable value of E. E and T are defined separately for the calibration period and for an evaluation period with contrasting climate (Klemeš, 1986, Test 2a). Note we hereafter avoid the term validation because models of open systems can never be validated (Oreskes et al., 1994); we use the term evaluation instead. We recognize that there is more to split sample testing than a simple comparison of performance metrics (see section 5.3 and Gupta et al., 1998;Yapo et al., 1996) but for the present section we seek to present the rationale for the framework in the simplest possible terms.
In the literature, the most common outcome is that models fail split sample testing during evaluation. This means that the parameter set chosen during the split sample test meets the criteria E cal > T cal but fails the other criteria (i.e., E eval < T eval ). Expressed diagrammatically, this parameter set is the triangle in Figure 1a. However, the parameter set(s) chosen during split sample testing does not necessarily represent the full capabilities of the model structure (Fowler et al., 2016). This can be revealed by considering an ensemble of randomly generated parameter sets, shown in Figure 1a as gray dots. For example, parameter sets are available with E eval > T eval (e.g., the diamond). Also, other parameter sets fulfill both E cal > T cal and E eval > T eval simultaneously (e.g., the star), which indicates that no changes to the model structure are required to meet both acceptance thresholds, but the wrong parameter set was initially chosen during the split sample test. In this case, the mathematical optimum during calibration (triangle, highest E cal value) is not a hydrological optimum (e.g., star) where more consistent performance in both time periods is found (e.g., Andréassian et al., 2012). Figure 1a is a conceptual example only-in practice, the coverage of parameter sets in the space will vary for each case, and a key task of the framework is to determine this coverage. The framework uses the geometry of the coverage to prioritize remedial action in cases of split sample failure. T cal and T eval are used as dividing lines through the 2-D space (Figure 1b) to create regions β 1 and β 2 , respectively. An additional region α is defined as β 1 ∩ β 2 . Coverage over these areas provides valuable information to guide remedial action, such as the following: • In the case of coverage of α (Figure 2a), no changes to the model structure are required to meet both acceptance thresholds. The priority should be to improve the calibration method, so that when the Differential Split Sample Test (DSST) is repeated a more suitable parameter set (or sets) is identified; • For coverage of β 1 and β 2 but not α (Figure 2b), the model structure is flexible (it can fit data it is calibrated to) but not transferable (good parameter sets in one period are poor in others), a situation best remedied by model structural changes; Figure 1. (a) Conceptual diagram of model performance E for a randomly generated ensemble of parameter sets, along with modeling performance acceptance thresholds T for calibration and evaluation periods (with contrasting climate).

Water Resources Research
• If β 1 coverage exists but β 2 does not (or vice versa; Figures 2c and 2d), something is impeding performance in one of the periods more than the other, which may indicate a temporal trend in data errors or a model structure more suited to one set of climatic conditions than another. Further tests are required to distinguish between these causes (see below); and • In the case of coverage of neither β 1 nor β 2 (Figure 2e), something is impeding performance over both climatic periods, and there is a relatively higher chance that data errors are present (which can be tested separately as described below) or the model may be inadequate in a way that varies little with climatic conditions.
The core tasks of the framework are to (a) categorize the coverage as per the above and (b) undertake further tests aimed at confirming and supplementing the diagnosis suggested by the coverage, namely, • Multicatchment cross checks that repeat the above test in similar nearby catchments. A comparable coverage strengthens the initial diagnosis, whereas dissimilar coverage may reveal one-off error sources such as observation errors that affect one catchment but not others; and • Data-focused tests to identify data errors, for example, based on temporal consistency of different data.
Together, these tests have sufficient diagnostic power to prioritize remedial action in most cases of DSST failure. For each test, further detail is provided in subsequent sections.

3.
Step-by-Step Description of Framework Figure 3 gives a flow chart showing the steps needed to apply the framework. This paper focuses on the diagnostic steps (shown in red), rather than the corrective actions (shown in green). Many corrective actions can be undertaken using existing techniques, as noted in the text where relevant. The framework is most applicable to cases of DSST failure-to make this explicit, we include the DSST as the first step.

Diagnostic Steps
1. DSST: This is Test 2a from Klemeš (1986), involving these steps: (a) divide the historic record into two or more periods with contrasting climate; (b) calibrate the rainfall-runoff model on one of the periods and evaluate on the other(s); (c) quantify model performance with some numerical measure E; and (d) evaluate performance for each period by checking whether E exceeds acceptance thresholds T, for each period. As discussed in the supporting information, Text S1: • E, the acceptance metric, should reflect the modeling purpose. Multiple acceptance metrics may be used if appropriate (e.g., Thirel et al., 2015); • T, the thresholds of acceptance, should be defined a priori and preferably reflect the modeler's understanding of the level of accuracy required for decision making; • The calibration objective function need not correspond to E. Robust simulation performance in changing climatic conditions depends on fidelity of process representation, so calibration objective function(s)  Assuming the existence of such, since nearby catchments may be hydrologically different. 2 That is, cases where an entire data collection regime is systematically flawed.
3 Generally poor in this case means poor regardless of climatic period.

Water Resources Research
should be chosen to extract information relevant to these processes from the calibration data (Fowler et al., 2018;Gupta et al., 1998); • More than one evaluation period may be used (Coron et al., 2012;Thirel et al., 2015); • The calibration algorithm could be optimization or an ensemble method (e.g., Beven & Binley, 1992;Vrugt et al., 2008). The latter may be used so long as the experimental design yields an objective DSST result.
All subsequent steps of the framework are for cases where the model fails the DSST.

2.
Determine coverage over regions α, β 1 , and β 2 : Coverage can be determined by either generating a random ensemble of parameter sets, as per Figure 1a, or using a multiobjective optimizer to generate the Pareto front (dotted line in Figure 1) between E cal and E eval . While a random ensemble is simpler, an impractically large number of parameter sets may be needed to achieve sufficient sampling density (Fowler, 2017, Figure C.4). Thus, a multiobjective optimizer is recommended (e.g., Hadka & Reed, 2013;Vrugt & Robinson, 2007). The full range of possible coverage results, along with associated diagnoses, are given in Figure 4. 3. The α acceptance test: The model structure passes this test if the model structure has coverage of the α region (i.e., if parameter sets exist there; Figure 4a) and fails otherwise. A pass means that no changes to the model structure are required to exceed all acceptance thresholds, even though the parameter set chosen at step 1 did not do so. For an α pass, the suggested remedial action is to improve the calibration method used in the DSST so that a more suitable parameter set (or sets) can be identified in a repeat DSST. Model structures failing the α test cannot meet the thresholds no matter how they are calibrated, so it is pointless to focus on DSST calibration method, and the only options are to consider possible data errors or structural deficiency.

Inappropriate PET formulation
Using a formulation that ignores wind in an area subject to trends in wind v Part 1: Is there a long-term trend in a variable that is not included in the PET formulation? If yes, revise PET formulation accordingly, and Part 2: Does trialing the revised PET formulation allow coverage of the α region? Or at least, significantly improve performance?

10.1029/2018WR023989
Water Resources Research 4. The β acceptance tests: One β test is undertaken for each acceptance metric. The result is a pass if the model structure has coverage of the corresponding β region, and a fail otherwise. β results are interpreted in step 7 (cf. Figure 4). 5. Cross checks with other catchments: In this step, the α and β acceptance tests are repeated in nearby similar catchments. Although highly recommended, this step depends on data availability and is not always possible, particularly if nearby catchments are not hydrologically similar or are ungauged. The confirmation (or not) of acceptance test results in nearby similar catchments allows considerable diagnostic insight (see Figure 4 and step 7), and analysis across larger samples may lead to improved process understanding in its own right (Gupta et al., 2014). 6. Data checks: Table 1 lists checks that can be undertaken at this step. For tests with two parts (iii and v), only the first part may be required. Although some tests involve simulation, these tests are still categorized as data checks because of their focus on data quality/suitability. 7. Diagnostic synthesis: In this step, the various test outcomes are interpreted and used jointly to deliver a diagnosis; that is, the β coverage test outcomes are used together with multicatchment cross checks and data checks to develop greater weight of evidence.

Corrective Actions
The corrective actions are listed here using the step numbers from Figure 3. Note that, unlike earlier steps, steps 8-11 are not sequential. Rather, they are undertaken based on results of steps 1-7, as per the arrows in Figure 3 and the triangular diagrams in Figure 4 and Table 1.
8. Structural improvement (triangular diagram with arrow facing down and to the right in Figure 4 and Table 1) means changing one or more of a model's equations. Methods to formulate structural improvement are not covered here because, as noted, the present focus is on diagnostic tests rather than remedial actions. However, structural improvements are discussed by, for example, Ambroise et al. (1996), Beck (2005), Gupta et al. (2008), de Vos et al. (2010), Westra et al. (2014), and Kelleher and Shaw (2018). 9. Changes to forcing or training data (triangular diagram with arrow facing down and to the left in Figure 4 and Table 1): An important clarification is that this step does not involve changing the type of forcing or training data. Rather, this step involves fixing, if possible, a problem identified with the existing data. Examples include reinterpolating precipitation inputs after removing a problematic rain gauge, excluding from calibration/evaluation those parts of the record known to contain significant errors in flow gauging, or adopting a more appropriate formulation of potential evapotranspiration (PET). 10. Revising calibration method (triangular diagram with arrow facing up in Figure 4 and Table 1): Having confirmed coverage of the α region, the focus should be on finding a parameter set in this region using data from the calibration period only, via an improved DSST calibration method. Potential options here include changes in objective function (Fowler et al., 2018), a different calibration algorithm, multiobjective methods including subperiod analysis (e.g., Choi & Beven, 2007;Freer et al., 2003;Gharari et al., 2013), and limits of acceptability methods (Beven, 2006). Trial and error should be avoided by choosing methods rationally with recourse to hydrological theory (i.e., why would we expect an improved outcome?).
When seeking to improve calibration methods, modelers should be mindful that there may not be sufficient information in the calibration data to parameterize the model. If processes that are inactive or unimportant during the calibration period become important when climatic conditions change, it may be unreasonable to assume that the model parameters that govern such processes are identifiable based on prechange data alone Ljung, 1998;Reichert & Omlin, 1997; see van Werkhoven et al., 2008 for a demonstration of this issue across space). Alternatively, information regarding these processes may be present in the calibration period in nondischarge data types (e.g., groundwater, soil moisture, and remotely sensed data), so calibration strategies that use these data may aid identifiability of key parameters. This is difficult to know a priori, so it is suggested to initially seek an improved calibration method with the data at hand (step 10), introducing additional data types later as required and if available (step 11).
11. Additional data types in calibration: Additional data types could include observations of groundwater, water quality, soil moisture, vegetation data, actual evapotranspiration (AET), snow data, qualitative or fuzzy measures that reflect subjective understanding of dominant processes, or any other data type that may aid model identification (see supporting information Text S2). This step may be also undertaken to increase the realism of the model (Clark et al., 2011;Freer et al., 2004;Seibert & McDonnell, 2002).

Example Catchment and Data
This section describes a single-catchment case study, taking the perspective of a climate change impact assessment. Although the framework is also applicable to large catchment samples (supporting information Text S10), for clarity of presentation we focus on one catchment only. The aim is a calibrated rainfall-runoff model suitable for water resources assessment under projected future climatic conditions. The selected catchment is Harvey River in the south west of Australia (Figure 5a), upstream of the gauge at Dingo Road (station 613002, area 148 km 2 , mean annual rainfall 1,000 mm/year, runoff ratio 0.2) which is one of Australia's Hydrologic Reference Stations (Turner, 2012). Downstream of the study area, the river flows into the 57 GL Stirling Reservoir, used for metropolitan supply. The catchment has a strongly seasonal climate with hot dry summers and cool, wet (but snow-free) winters, with >80% of flow occurring in the months of July to November. High sensitivity of runoff to slight differences in long-term rainfall is typical of this area due to nonlinear groundwater-surface water interactions (Hughes et al., 2012;Kinal & Stoneman, 2012). Interannual variability in rainfall is high, and a run of low-rainfall years in the 2000s led to a cumulative rainfall deficit (Figure 5b), relative to earlier decades. The associated region-wide decline in streamflow led to construction of a desalination plant for Perth and a greater dependence on groundwater (Petrone et al., 2010). Human impacts are minimal since the study catchment is entirely forested (mostly Jarrah hardwoods, Eucalyptus marginata), with no temporal changes to land use. No major bushfires have affected the catchment over the hydrometeorological record.
A lumped modeling approach with a daily time step is adopted, with daily catchment average rainfall derived from the interpolated (~5 km) gridded rainfall product of Jones et al. (2009), and PET estimates derived according to the Wet Environment method from Morton (1983), using the gridded data set produced by Jeffrey et al. (2001) extracted at the catchment centroid. The extracted PET has an average annual value of 1,340 mm/year, approximately 35% greater than rainfall. Catchment boundaries are derived using D8 flow analysis on postprocessed Shuttle Radar Topography Mission data published by Gallant et al. (2011), on a grid size of 1 s (approximately 30 m).

Example Model Structures
Since the focus is on the framework itself, this paper presents no new model structures. Instead, we choose a model structure which has two preexisting variants that provide a case study in model improvement, namely, IHACRES. Here both versions have a daily time step and are spatially lumped. The first version, termed IHACRES-A in this paper, follows the descriptions of Jakeman et al. (1990) and Jakeman and Hornberger (1993) with six free parameters. A schematic is provided in the supporting information Figure S1. IHACRES-A implements an index of catchment wetness that increases linearly in response to rainfall (as a function of parameter c) and decreases nonlinearly in response to PET (as a function of parameters c, T w , and f). The wetness value determines the proportion of rainfall converted to runoff. Runoff is routed through two parallel linear stores with different time constants (parameters T q and T s ). The split between the routing stores is determined by parameter V s . Model parameters and thresholds are provided in supporting information Table S1.
The second version, IHACRES-B, includes changes by Ye et al. (1997), who retained the index of catchment wetness but allowed for a nonlinear relationship (described by new parameter p, p ≥ 1) between the index and runoff proportion. They also enforced a threshold on index values (set by new parameter l, l ≥ 0) which must be exceeded before any rainfall can be converted to runoff. Since they worked in ephemeral catchments, Ye et al. (1997) removed the slower of the two routing storages. However, given Harvey River flow is relatively sustained and historically perennial, it is appropriate to retain both routing storages in IHACRES-B, giving eight free parameters. Thus, IHACRES-B is an extension of IHACRES-A and IHACRES-A is a special case of IHACRES-B with p = 1 and l = 0.
In the following demonstration of the framework, we begin with IHACRES-A, switching to IHACRES-B in later steps.

Application of Framework 4.3.1. DSST (Step 1)
First, we select a reference metric E and thresholds T. Given the water resources context of the case study, E should quantify the ability of the model to replicate runoff volumes, at time scales relevant to the water supply system response (days-weeks). We adopt the daily Kling-Gupta efficiency ( Evaluate performance for each period by checking whether E exceeds acceptance thresholds T. In this case, E cal > T cal , but E eval < T eval . Thus, IHACRES-A fails the DSST.

Determine Coverage (Step 2)
The evolutionary multiobjective optimizer AMALGAM (Vrugt & Robinson, 2007) is used to test the coverage of IHACRES-A in the E cal -E eval space. Algorithm settings are listed in supporting information Text S4. Results are shown in Figure 6.

The α Acceptance Test (Step 3)
The AMALGAM output shows that IHACRES-A has no coverage of the α region. IHACRES-A thus lacks robust parameter sets-it is incapable of meeting both acceptance thresholds with the same parameter set, no matter how it is calibrated, so there is little use attempting to improve the DSST calibration method. Thus, as per Figure 3, we follow steps 4-7.

The β Acceptance Tests (Step 4)
The AMALGAM output shows IHACRES-A has coverage of both the β 1 region (as also confirmed in step 1) and the β 2 region. IHACRES-A is thus flexible (it can fit data it is calibrated to) but not transferable (good parameter sets in one period are poor in the other) and falls into category (b) from

Test result: No
In the double mass curves in Figure S3, rainfall time series derived for this catchment is compared with rainfall derived for the four closest Hydrologic Reference Station catchments. Each curve maintains an approximately constant gradient, despite known changes in density of gauge network around 2000. ii Can missed events (or added events) be detected via time series/single mass curve comparisons between precipitation and runoff?

Test result: No
Single mass curves for each year, along with time series comparisons, are shown in Figure S4. With one minor exception, all major rain events are accompanied by commensurate runoff, accounting for seasonality. No major runoff events occur without associated rainfall. iii Part 1: Review available information on the precipitation or runoff data. If errors seem likely conduct Part 2. Part 2: Does adding a calibratable, temporally constant scaling factor to precipitation allow coverage of the α region? Or at least, significantly improve performance?
Test result: Significant errors considered to be relatively unlikely (Part 2 not required) Part 1: Analysis of flow ratings and uncertainty (Text S6) reveals that 613002 has relatively high-quality flow data relative to other Hydrologic Reference Stations. Analysis of rainfall data (Text S6) indicates a relatively dense rain gauge network and relatively low spatial rainfall gradients. iv Does a double mass curve (with nearby independent runoff data) have a change in gradient?

Test result: No
The runoff time series is compared among the four closest Hydrologic Reference Stations ( Figure S5). Three of the four curves maintain an approximately constant gradient. The curve for station 610008 has an inflection, but further investigation reveals this is likely due to errors for 610008, not the study catchment. v Part 1: is there a long-term trend in a variable that is not included in the PET formulation (e.g., wind)? If yes, revise formulation and conduct part 2. Part 2: Does trialing the revised PET formulation allow coverage of the α region? Or at least, significantly improve performance?
Test result: PET deemed unlikely to be salient cause of error Part 1: Yes, there is a local long-term increasing trend in wind (cf. Donohue et al., 2009, p27; note that this contrasts with the general results reported by McVicar et al., 2012) and thus possible trends in evaporative demand not characterized in the adopted Wet Environment method (Morton, 1983). However, IHACRES-A was also deficient in other catchments with the opposite wind trend (text S11), so we did not proceed to Part 2.

10.1029/2018WR023989
Water Resources Research rainfall 1030 mm/year, runoff ratio 0.25) and 614044 (~50 km distant; area 73 km 2 , mean annual rainfall 930 mm/year, runoff ratio 0.03). Both have broadly similar geology and land use to the study catchment, but 614044 has 8% less rainfall and 90% less runoff so is less similar hydrologically. AMALGAM results (supporting information Text S5) for 613146 match those from the study catchment relatively closely, with the same category (b) in Figure 4. Results for 614044 indicate considerable differences, with IHACRES-A having coverage of neither the β 1 region nor β 2 region, but a broadly similar shaped curve. These results are discussed further in step 7 (section 4.3.7). 4.3.6. Data Checks (Step 6) Each data check from Table 1 is applied, with results given in Table 2 (with plots provided in the supporting information). All test results are negative, with one exception regarding the PET formulation (Morton's Wet Environment). This PET formulation does not consider wind, and a positive temporal trend in wind was reported over 1981-2006 by Donohue et al. (2009;see also McVicar et al., 2008). This is discussed further in the following step, and in supporting information Text S6.

Diagnostic Synthesis (Step 7)
This step considers the results of all tests in steps 4-6 and prioritizes which cause of poor performance should be remediated, be it data errors or model structural inadequacy. In this case, model structural inadequacy seems the likely salient cause. This conclusion is arrived at via multiple lines of evidence, starting with the coverage test results (Figure 6; cf. Figure 4). The replication of similar coverage results in a nearby catchment (step 5) means that one-off data errors (e.g., a deficient gauge) are not a probable cause for the poor simulations, and this is further confirmed by the uniformly negative results in tests i-iv in step 6 ( Table 2). It is possible that the poor performance could be partly due to the PET formulation-note that IHACRES-A underestimates flow in the evaluation period, which is consistent with this hypothesis, as discussed in subsequent steps and supporting information Text S6. However, in this case we have access to IHACRES-A results from a wide sample of catchments (supporting information Text S11), some of which have neutral or deceasing wind trend, yet have similar coverage results. Thus, it is reasonable to prioritize model structural improvement, undertaking step 8, not step 9.

Structural Improvement (Step 8)
Having diagnosed model structural inadequacy, a method is now required to improve the structure. Multiple suitable methods are available in the literature (e.g., Gupta et al., 2008;Wagener et al., 2003;Westra et al., 2014). However, the focus here is on diagnosis (i.e., red steps, not green, from Figure 3), so we simply note that a multifaceted analysis of the model residuals (presented in supporting information Text S7) reveals that the two extra parameters proposed by Ye et al. (1997)-the first related to rainfall-runoff nonlinearity, and the second a threshold of runoff production-are supported by the simulation error characteristics for this case study. Thus, we adopt IHACRES-B for subsequent steps.   Figure 7, along with DSST results relevant to subsequent steps. 4.3.11. The α Acceptance Test (Step 3, Repeat) The AMALGAM output shows IHACRES-B has coverage of the α region. Thus, despite the poor DSST result, no changes to the IHACRES-B model structure are needed to meet both acceptance thresholds.

Revise Calibration Method (Step 10)
Having confirmed coverage of the α region, the aim is now to improve the DSST calibration method so that a parameter set in this region can be identified by a repeat DSST. The new calibration method must be able to identify an α-region parameter set using only data from the calibration period, ie. without reference to . Fowler et al. (2018 recommended two objective functions for use in drying climates; having already adopted the first one (Index of Agreement), we now test the second one (Split KGE). In addition, the subperiod (SuPer) calibration method of Gharari et al. (2013) is tested (details in supporting information Text S8). Both methods work by examining multiple subperiods within the calibration period and seeking a balance between model performance across the subperiods. Calibration results are shown in Figure 7. It is seen that still E val < T val for both methods. Results using the KGE as the objective function-a common calibration method-are also shown for reference. With so many calibration methods unable to identify a parameter set in the alpha region, it seems increasingly likely that there is insufficient information in the 1970-2004 streamflow data to correctly parameterize IHACRES-B. Thus, it is reasonable to trial an additional data type in calibration.

Additional Data Types (Step 11)
Finally, an extra data type is added to the DSST calibration method-namely, remotely sensed AET. Of the different data types that could be chosen here, AET is selected because a review of model fluxes indicates significant differences in the seasonal pattern of AET depending on which objective function is used, as shown in Figure 8. Calibration to KGE (parameter set 4) results in AET being highest in September, versus November for calibration to split KGE (parameter set 3). In terms of physical processes, Hughes et al.'s (2012) findings of local groundwater decline suggest sustained summer AET is possible because of plant access to groundwater, a situation more consistent with parameter set 3. Aiming for a model of maximum plausibility and hopefully increased robustness, we recalibrate IHACRES-B using a metaobjective function composed of two equally weighted components, one of which relates to AET. Specifically, the first component is the match in streamflow quantified by the Split KGE (the best performing objective function from the previous step), and the second component is the match with Moderate Resolution Imaging Spectroradiometer (MODIS)-based AET estimates of Guerschman et al. (2009), where the closeness is quantified in terms of the relative seasonal AET pattern (i.e., ignoring overall magnitude; see supporting information Text S9). The new DSST results are favorable, with E eval increasing from 0.70 to 0.76 (Figure 7). Model realism increases in two ways ( Figure 8): (1) a closer match with AET seasonal patterns; and (2) more plausible time series of catchment wetness, exhibiting a downward trend during the historic record that is qualitatively consistent with reported decline in groundwater (Hughes et al., 2012;Kinal & Stoneman, 2012). Trends are less realistic for the flow-only calibrations (Figure 8, parameter sets 3 and 5) because the time series oscillate close to 0 during the calibration period, giving little space for decline during the subsequent drier (evaluation) period (or a future, drier GCM scenario).
This final calibration method almost, but not quite, meets the acceptance thresholds. Thus, in this case study, we are left with the knowledge that the model structure can meet the acceptance thresholds, but we do not yet have a calibration method that can identify the acceptable parameter sets using 1970-2004 data alone.

Interpretation of DSST and Coverage Tests
As already emphasized, this study makes a clear distinction between different types of model failure, and we recommend that this distinction be carried on in future practice. DSST failure (the failure of a parameterized

10.1029/2018WR023989
Water Resources Research model to fulfill acceptance thresholds) was contrasted with coverage failure (the failure of every parameter set in a model structure to fulfill acceptance threshold(s)). DSST failure is contingent on the chosen calibration method, whereas coverage failure is independent of calibration method, so that failure in the latter must be due to model structural failure or data errors. In the literature, separate tests to distinguish these failure types (such as the coverage tests) are uncommon, and DSST failures are often interpreted as implying model structural failure. This can be profoundly misleading. For example, consider possible interpretations from the following DSST results from sections 4.3.1 and 4.3.9:  Interpreted under the false notion that DSST failure implies model structural failure, two conclusions would be as follows: (i) Both structures are unable to meet acceptance thresholds and (ii) the structural changes from Ye et al. (1997) made minimal difference to performance. As demonstrated, both conclusions are incorrect. For this reason, we strongly recommend that future studies conduct targeted tests to distinguish between failure types, such as the coverage tests suggested in this paper.

On Independence in Split Sample Testing
This framework touches on questions regarding split sample testing and what constitutes acceptable use of evaluation (often called validation) data. A dilemma arises because steps 10-11 (revising calibration method) involve entering a feedback loop that chooses the calibration method based (at least partially) on model performance in the evaluation period, thus compromising strict adherence to the idea of independence in split sample testing. On the other hand, not undertaking these steps when it is known that the model structure has α coverage means knowing that the model structure is capable but not allowing oneself to use this capability. As hydrologists continue in the quest for more robust simulations, this dilemma may arise more frequently and should be further debated. Furthermore, this issue should be a strong motivating factor for a well-informed initial choice of calibration method (Fowler et al., 2018;Krause & Boyle, 2005;Seibert, 2000;Yapo et al., 1996).

Are Models That Pass the DSST Adequate?
A common goal of DSSTs is to assess the adequacy of a model for simulation in the future, possibly under change. Is a model that passes the DSST adequate for simulation under projected future climatic conditions? Below we present multiple levels where success of a model at one level is necessary but not sufficient for success at the next level ( Figure 9). Each level has distinct causes of inadequacy, as discussed below. Note that issues of uncertainty and quality in forcing data (e.g., precipitation) for future scenarios are not discussed here but are significant (e.g., Cloke et al., 2013).
Consider the set of all models (orange in Figure 9) that pass the DSST by fulfilling acceptance thresholds. No acceptance metrics can perfectly reflect the model purpose, so simulations from models in this set may be deficient over the calibration period in ways not captured by the criteria (Krause & Boyle, 2005). Various strategies may help to detect such deficiencies, including referring to a wider set of criteria (Gupta et al., 1998(Gupta et al., , 2008Thirel et al., 2015) and using a variety of checks and visualizations (Bennett et al., 2013;Thirel et al., 2015). It is important to select criteria that are closely matched to the model purpose, as models that are adequate for one type of application may not be adequate for other types, since they are often unable to match different aspects of the flow regime simultaneously (e.g., high flows and low flows with the same parameter set).

Water Resources Research
Moving to the next level in Figure 9, models that display an adequate match (however this is defined) in numerical outputs may not do so for the right reasons (Beven, 2006;Kirchner, 2006;Klemeš, 1986). For example, a model may activate a process that experimental data or site experience reveals is the wrong mechanism for the catchment (e.g., Hortonian flow in a catchment with high infiltration). Strategies for revealing this kind of inadequacy include site familiarization/seeking local knowledge (Holländer et al., 2014), more difficult tests through extra data (e.g., Mroczkowski et al., 1997), and dialogue between modelers and experimental hydrologists (Freer et al., 2004;Seibert & McDonnell, 2002). All of these may help to discriminate between parameter sets that are equifinal with respect to numerical output. We note that the meaning of right reasons may vary depending on the underlying perceptual model and philosophical viewpoint of the modeler (e.g., Gupta et al., 2012).
Finally, even models that match historic data for the right reasons (green) may not be adequate to simulate under projected change (blue) because the future may be so different from the past as to change the mechanisms that govern the rainfall-runoff relationship (cf. Saft et al., 2015). New processes may become dominant, and living components of the system (e.g., vegetation) may respond unexpectedly due to complex feedbacks (Curtis & Wang, 1998;D'Odorico & Porporato, 2004;Rodriguez-Iturbe et al., 1999) possibly leading to less resilience following disturbances (Peterson et al., 2009). The concept known as trading space for time (Peel & Blöschl, 2011) may be useful, whether for model parameterization (Singh et al., 2011) or to test parameterized models in climatic conditions beyond the study area's historic record by using another catchment entirely (e.g., proxy basin split sample tests; Klemeš, 1986;Refsgaard et al., 2014). One risk with such methods is that, assuming catchments coevolve, they reflect the actions of processes over centuries or millennia, in contrast to climate change which is likely to occur relatively quickly. An additional confounding factor is that records of catchment hydroclimatology are themselves subject to increasing CO 2 concentrations (Roderick et al., 2015). In general, our ability to successfully transition from the green category to the blue is limited because our knowledge of the future is fundamentally incomplete. Thus, although the DSST may be the best possible evaluation method (Refsgaard et al., 2014), the adequacy of models that pass the DSST is far from guaranteed.

Further Research Regarding Calibration Methods
The case study demonstrates the importance of calibration methods in modeling outcomes, and further research is recommended in this area. As stated, choice of calibration method (including selection of objective function) should be grounded in hydrological theory and aim to select parameters with high fidelity to dominant processes. Achieving this will likely require consideration of typical errors present in training and forcing data (e.g., Coxon et al., 2014;McMillan et al., 2012;Schoups & Vrugt, 2010;Sorooshian & Dracup, 1980), and such considerations were shown by Fowler et al. (2018) to yield significantly improved DSST results. Practically, one potential problem for the modeler is that there are many studies suggesting improved calibration and sampling methods but little guidance to choose between them. For example, a water resource-focused climate change impact study could potentially use many methods, each with strong supporting theory which have been developed or demonstrated for changing climates. Methods include (i) those that use calibration data differently by defining wetter and drier subperiods of the calibration period (Choi & Beven, 2007;Freer et al., 2003;Gharari et al., 2013); (ii) trading space for time approaches which incorporate information from other catchments in the same region to predict hydrologic response (e.g., Singh et al., 2011); (iii) alternative objective functions that place different focus on hydrological behaviors, for example, applying transforms on flow values before sum of squares calculations or using sum of absolute errors rather than sum of squares (Fowler et al., 2018, see also Krause & Boyle, 2005;Willmott et al., 2012). Furthermore, methods that identify ensembles that are likely to be more robust to change such as (iv) data depth approaches (Bárdossy & Singh, 2008) and (v) limits of acceptability approaches (Liu et al., 2009) may also be relevant to the context of changes in climatic conditions. In response to this prolific variety of methods, we recommend three classes of study to increase the value of existing work: (a) intermethod comparison, to guide method choice; (b) studies combining ideas from different methods-for example, methods (iii) and (iv) could be gainfully combined; and (c) studies testing calibration methods in regions with relatively high hydroclimatic variability (e.g., the crash tests of Coron et al., 2012), to provide assurance that the methods work successfully on the most challenging available data, while acknowledging that this does not guarantee adequacy for future conditions, as discussed above.

Generality of the Framework
It is noted that the framework can be easily adapted for multiple catchments, to serve the needs of largesample hydrology studies (Gupta et al., 2014). For example, supporting information Text S10/ Figure S9 give an example across multiple catchments from Australia, showing how analysis can be meaningfully summarized across large samples of catchments. Further, we argue that most elements of the framework are applicable beyond simple conceptual models and climate change studies. For example, the principle of the coverage check as a first step following split sample failure (followed by updating the calibration method in the case of alpha coverage) is generally relevant to all applications of split sample testing, including changes across space rather than time, for changing land use rather than changing climate, and/or to process-based as well as conceptual models. We are currently applying the framework to stress test a gridded continentalscale hydrological model, using consecutive coverage tests while sequentially adding catchments to discover at which point the model equations can no longer perform across the required spatial domain, and we look forward to reporting these results in a future article.

Conclusions and Recommendations
This paper presents a framework to improve rainfall-runoff simulations under a changing climate. Model evaluation based solely on the DSST is hampered due to contingency on the chosen calibration method, and it is difficult to distinguish which cases of DSST failure are truly caused by model structural inadequacy. The proposed framework addresses this problem by diagnosing the salient cause of poor model performance, be it structural inadequacy, poor calibration method, or data errors. We demonstrated methods to explore the limits of model robustness and flexibility over multiple climatic periods, which can be used to discriminate between these causes and prioritize remedial action.
Using a case study catchment from Australia, improved inference of structural failure and clearer evaluation of model structural improvements were demonstrated. Interpreted under the false notion that DSST failure implies model structural failure, the modeler would have wrongly concluded from our results that the model structural improvements made minimal difference to model performance in the case study. In contrast, the coverage tests in the framework demonstrated an enhanced robustness previously hidden because the wrong parameter set was initially chosen in the DSST.
In the literature, DSST failure is often assumed to imply model structural failure, and the above result shows that this assumption can be profoundly misleading, possibly leading to erroneous comparisons between candidate model structures and inability to properly assess the benefit of structural improvements. It is recommended that future studies use this framework to (i) identify cases where poor simulations are due to poor parameterization or data errors, remediating these cases without recourse to structural changes and (ii) use the remaining cases to gain greater clarity into what structural changes are needed to improve model performance in changing climate.