The Added Value of Different Data Types for Calibrating and Testing a Hydrologic Model in a Small Catchment

Abstract This study investigated the added value of different data for calibrating a runoff model for small basins. The analysis was performed in the 66 ha Hydrological Open Air Laboratory, in Austria. An Hydrologiska Byråns Vattenbalansavdelning (HBV) type, spatially lumped hydrologic model was parameterized following two approaches. First, the model was calibrated using only runoff data. Second, a step‐by‐step approach was followed, where the modules of the model (snow, soil moisture, and runoff generation) were calibrated using measurements of runoff and model state variables and output fluxes. These measurements comprised laser‐based measurements of precipitation, satellite and camera observations of snow, ultrasonic measurements of snow depth, eddy covariance measurements of evapotranspiration, time domain transmissometry‐based soil moisture measurements, time‐lapse photography of overland flow, and groundwater level measurements by piezometers. The two model parameterizations were evaluated on annual, seasonal, and daily time scales, in terms of how well they simulated snow, soil moisture, evapotranspiration, overland flow, storage change in the saturated zone, and runoff. Using the proposed step‐by‐step approach, the relative runoff volume errors in the calibration and validation periods were 0.00 and −0.01, the monthly Pearson correlation coefficients were 0.92 and 0.82, and the daily logarithmic Nash Sutcliffe efficiencies were 0.59 and 0.18, respectively. By using different sources of data besides runoff, the overall process consistency improved, compared to the case when only runoff was used for calibration. Soil moisture and evapotranspiration observations had the largest influence on simulated runoff, while the parameterization of the snow and runoff generation modules had a smaller influence.


Introduction
Observed runoff represents the overall, aggregated catchment behavior. Therefore, runoff observations are the most common information used for identifying the parameters of hydrologic models. However, parameters of conceptual and physically based hydrologic models usually cannot be identified accurately using observed runoff alone as it is difficult to decide, whether the model performs well for the right reasons (Beven & Freer, 2001;Savenije, 2001;Viglione et al., 2018). One way of dealing with this issue is to use additional measurements of input and output fluxes and model states in hydrologic modeling. Additional information on hydrologic processes helps to constrain and validate hydrologic models and testing whether they get the right answers for the right reasons (Grayson et al., 1992).
Most of the studies use other measurements of fluxes and states in multiple objective calibration as a part of the objective function. Previous studies used ground-based or alternatively remote sensing products or their combination as such additional information on hydrologic processes. Soil moisture (Kundu et al., 2017;Kunnath-Poovakka et al., 2016;Parajka et al., 2006;Rajib et al., 2016;Shahrban et al., 2018), evapotranspiration (Gui et al., 2019;Herman et al., 2018;Immerzeel & Droogers, 2008;Kunnath-Poovakka et al., 2016), and groundwater level data (Demirel et al., 2019;Seibert, 2000) were often used for model calibration to improve the model's internal consistency. These studies showed the added value of different observations besides runoff, for example, for soil moisture, evapotranspiration, and groundwater levels. But only a few studies combined the different type of observations (e.g., Avanzi et al., 2020;Kuras et al., 2011). In the past, these data were used mainly in connection with calibration of physically based hydrologic models, where measurements could be more explicitly linked to the simulations than for conceptual hydrologic models.
The performance of the distributed hydrology soil vegetation model was evaluated by Thyer et al. (2004) and Kuras et al. (2011) using field data. Their study area was located in British Columbia, Canada. While Thyer et al. (2004) focused mainly only on the micrometeorological part of the process-based model (such as snowmelt and energy balance) and also used observed hydrograph from another, nearby catchment, Kuras et al. (2011) also evaluated the subsurface and surface runoff dynamics with a spatially extensive database. Thyer et al. (2004) and Kuras et al. (2011) achieved a daily Nash Sutcliffe efficiency for runoff of approximately 0.90, and Thyer et al. (2004) stated that runoff simulations were most sensitive to snowmelt characteristics as runoff was driven by spring snowmelt in their high elevated, forested study region. In another study, Wei et al. (2016) used measurements of snow water equivalent, snow depth, transpiration, stomatal feedback to vapor pressure, soil and forest properties, and soil moisture to parameterize a physically based model without a flow routing module to simulate the water balance in a 4 km 2 catchment in the United States. Without using runoff data for model calibration, they could reproduce the annual and monthly variabilities of potential runoff (combined outputs of surface runoff and deep drainage) with a Nash Sutcliffe efficiency of 0.62 and 0.56, respectively. A similar study was performed by Kuppel et al. (2018) in the Scottish Highlands with a fully distributed ecohydrological model. They could simulate daily runoff reasonably well without using runoff observations, but the model performance was substantially better, when runoff was also included in the calibration. Kuppel et al. (2018) argued that the spatiotemporal footprint of the observations involved in model calibration had to be carefully considered. They found that certain variables could only be well simulated, when the model was calibrated to measurements of that variable, for instance, soil moisture in gley soils and transpiration in Scots pine stands.
When ground-based monitoring data are not available, remote sensing products may be a useful alternative for parameter estimation (López et al., 2017;Nijzink et al., 2018;Silvestro et al., 2015). López et al. (2017) found on a Moroccan catchment with Mediterranean and semiarid climate that runoff could be better estimated when both remotely sensed evapotranspiration and remotely sensed soil moisture products were involved in calibrating a large-scale hydrologic model compared to a scenario, when these products were used independently. Silvestro et al. (2015) also found that using data from both ground stations and remotely sensed products, that is, land surface temperature and surface soil moisture estimates, improved the models internal consistency in two Italian catchments with temperate climates. Nijzink et al. (2018) comprehensively tested nine remotely sensed products on 27 European catchments with diverse landscapes and climates. The products included remotely sensed soil moisture, evaporation, total water storage, and snow accumulation. They found that two soil moisture products, the Gravity Recovery and Climate Experiment (GRACE) total water storage anomalies, and in snow-dominated catchments the Moderate Resolution Imaging Spectroradiometer (MODIS) snow cover products helped the most in constraining model parameters, when runoff data were not available. Remotely sensed surface water extent and water levels have also been found to be useful proxies on large river basins (Liu et al., 2015;Revilla-Romero et al., 2015;Sun et al., 2012Sun et al., , 2015, while Corbari et al. (2015) used satellite land surface temperature data for distributed hydrologic model calibration. Ruiz-Perez et al. (2017) used only MODIS Normalized Difference Vegetation Index (NDVI) data to calibrate an ecohydrological model and obtained good runoff estimates at the catchment outlet. The resolution of these remote sensing products, both in time and space, is usually too coarse for small catchments. Therefore, small catchment scale processes require field observations (e.g., Avanzi et al., 2020). However, there are only a few extensively monitored research catchments in the world, where such long-term field observations are available.
An alternative is to use stepwise calibration. Most of the studies used runoff signatures, for example, low flows, high flows, annual runoff, and so forth to calibrate their models step-by-step (e.g., Fenicia et al., 2007;Gelleszun et al., 2017;Hogue et al., 2000;Lu & Li, 2015). For instance, Hogue et al. (2000) separated the parameterization of the low and high flow simulations. Gelleszun et al. (2017) separately calibrated the parameters influencing runoff volume and peaks, seasonality and low flows, and the shape of the hydrograph. Fenicia et al. (2007) compared two multiobjective model calibration approaches. One of these approaches was a stepped calibration approach, where they separately calibrated certain parameter sets associated with different processes. These processes influenced distinct aspects of the system response, low flows, high flows, and lag time. Lu and Li (2015) proposed a different calibration strategy, grouping the model parameters according to time scales (annual, seasonal, and daily) where they are the most sensitive. Only a few studies used measurements on different fluxes and states in a stepwise fashion. These studies performed stepwise model calibration by looking at the internal state variables and processes of the model (e.g., Arheimer et al., 2020;Avanzi et al., 2020;Hay et al., 2006;Kuras et al., 2011;Ning et al., 2015). Hay et al. (2006) calibrated solar radiation, potential evapotranspiration, water balance, and daily runoff using measurements of fluxes and states and runoff, while Ning et al. (2015) calibrated the water storage and runoff generation. Inspired by these studies, which used stepwise model calibration approaches, in this study we aimed to link model simulations with field observations focusing on the different hydrologic processes. We used a large set of field observations of input and output fluxes and states besides runoff to calibrate our model in a step-by-step way.
The objective of this study was to investigate the added value of different data types of hydrologic processes for calibrating and testing a hydrologic model in a small catchment. Our hypothesis was that, by using additional data apart from runoff for calibrating a lumped conceptual hydrologic model, process consistency would improve. Our goal was to propose a stepwise approach for constraining hydrologic model by using runoff data and observations of snow, soil moisture, evapotranspiration, overland flow, and groundwater levels. We aimed at linking field observations with lumped, conceptual hydrologic model simulations by using all the available data in a stepwise mode. The model performance was evaluated at the annual, seasonal, and daily time scales. The analysis was performed in the Hydrological Open Air Laboratory (HOAL) in Austria, a 66 ha experimental catchment, where a large variety of long-term field observations are available .

Study Area
The study was conducted in a small experimental catchment, the Hydrological Open Air Laboratory (HOAL) in Petzenkirchen, located in the western part of Lower Austria (Figure 1), approximately 100 km west of Vienna (48°9′ N, 15°9′ E) . The drainage area of the catchment is 66 ha at the catchment outlet. The natural surface water outlet of the catchment is the Seitengraben stream. The elevation of the catchment ranges from 257 to 323 m above sea level, with a mean slope of 8%. The stream is approximately 620 m long and has a medium slope of 2.4% (Eder et al., 2010(Eder et al., , 2014Exner-Kittridge et al., 2016;Széles et al., 2018).
The climate is humid. Mean annual  air temperature, precipitation, runoff and evapotranspiration, and storage change estimated from the water balance are 9.6°C, 782 mm/year, 184 mm/year, and about 598 mm/year, respectively. Seasonal maxima of air temperature, rainfall amount, and intensity occur in the summer. Mean monthly runoff tends to peak in winter or early spring. The amount of snow falling in winter is small and quickly melts in the catchment. On average (2013)(2014)(2015)(2016)(2017), snow is observed on less than 10% of the days in a year. A significant amount of snow was only observed in 2016 and 2017 in the catchment, while the winters of 2014 and 2015 were almost snow-free.
The catchment is dominated by agricultural land use. Eighty-seven percent of the catchment area is arable land; the rest is forested, paved, or used as pasture. Main crops are winter wheat, maize, winter barley, and winter oilseed rape.

Data
Rainfall has been recorded by a tipping bucket at 7, 14, and 19 hr between 1986 and 1991 situated approximately 700 m away from the catchment outlet, then by an automated weighing rain gauge (Kroneis Pesa) since 1991. Rainfall has been measured with high temporal resolution (1 min) by a weighing rain gauge (OTT Pluvio) situated 200 m from the catchment outlet since 2002. In 2012, four additional weighing rain gauges (OTT Pluvio) were installed in the HOAL. Since 2013, a laser-based present weather sensor located at the weather station (Campbell PWS100) has measured the size and velocity of water droplets in the air with 1 min temporal resolution, and time-lapse photographs have been taken by the weather station camera (Sanyo VCC-MCH5600P) every minute during daylight. Since 1986, air temperature (Pt 1000) has been recorded at 7, 14, and 19 hr by a thermometer. Since October 2012, air temperature and relative humidity at 2 m height (HMP 155), air pressure (Campbell EC100), and snow depth (SR50AT) have been measured at the HOAL weather station with half hourly temporal resolution ( Figure 1). Twenty-nine time domain transmission soil moisture stations, 19 permanent and 10 temporary stations, with sensors at 0.05, 0.10, 0.20, and 0.50 m depths have measured the water content in the unsaturated zone since 2013 with hourly temporal resolution. Since August 2012, grass evapotranspiration has been measured by a closed-path eddy covariance station (Campbell EC155) at the weather station, and crop evapotranspiration has been measured by two open-path eddy covariance stations (Campbell IRGASON) at various locations according to the agricultural crop rotation.
Nineteen piezometers (SWS Mini Driver) located in the riparian forest close to the stream have monitored the groundwater level at a resolution of 5 min since 2013.
Runoff has been monitored at the outlet of the catchment by a calibrated H-flume with a pressure transducer since 2001 and additionally with an ultrasonic probe since mid-2010 with 1 min temporal resolution ( Figure 1). Details on the sensors are given in Blöschl et al. (2016).
While we could have modified the model structure to tailor it to the runoff processes in the HOAL, we chose not to do this. Instead, we used a more general model structure that could be used in a wider range of catchments.
Within the snow module according to (1)-(5) in Table 1, precipitation P (mm/day) is separated into P S solid and P R liquid precipitation depending on the wet bulb temperature parameter T wb (°C) (Bergström, 1976;Blöschl et al., 1991;Jennings et al., 2018;Steinacker, 1983). The catch deficit of precipitation gauges during snowfall is corrected by a snow correction factor SCF (−). Snowmelt M (mm/day) is simulated based on the degree-day concept, using a degree day factor DDF (mm/°C/day) and a melt temperature parameter T m (°C).
Within the soil moisture module according to (6)-(8) in Table 1, the fraction of precipitation producing runoff and evapotranspiration is simulated as a function of the soil moisture state SM (mm) of the catchment. If the soil moisture storage exceeds a threshold parameter FC (mm), all rainfall and melt contribute to runoff. The characteristics of runoff production are controlled by the nonlinearity parameter β (−). If the soil

Water Resources Research
SZÉLES ET AL. moisture state exceeds the limit for potential evapotranspiration LP (mm), which is the product of FC and parameter LPrat (−), actual evapotranspiration ET (mm/day) reaches its potential rate ET P (mm/day). The potential evapotranspiration ET P was calculated with the modified Blaney-Criddle method (Parajka et al., 2003;Schrödter, 1985).
Within the runoff generation module according to (9)-(15) in Table 1, an upper and a lower reservoirs represent hillslope routing. The proportion of rainfall and melt contributing to runoff enters the upper reservoir and leaves it through three paths. The first path is very fast runoff q 0 (mm/day) with very fast storage time k 0 (day), if a threshold of the storage state LS UZ (mm) is exceeded in the upper reservoir. The other two paths are an outflow from the upper reservoir q 1 (mm/day) with a fast storage time k 1 (day) and percolation to the lower reservoir with a constant percolation rate c P (mm/day). Water leaves the lower zone as baseflow q 2 (mm/day) with a slow storage time k 2 (day). The outflow from the two reservoirs is routed by a triangular transfer function representing runoff routing in the stream, where B MAX (day) is the maximum base at low flows and c R (day 2 /mm) is a free scaling parameter.
Following a sensitivity analysis (supporting information Text S1 and Table S1), the study proceeded along two modeling approaches. In the first approach, the model was calibrated in one step on a daily temporal resolution, using only runoff data in the objective function. In the second approach, the model was parametrized step-by-step using additional data besides runoff, starting with (1) snow accumulation and melt processes, (2) evapotranspiration and soil moisture changes, and (3) runoff generation and storage changes in the saturated zone. The abbreviations of the scenarios are listed in Table 2. The calibrated model parameters are found in Table 3 for each scenario. For simulating snow accumulation, a half-hourly temporal resolution was used, while a daily time step was used to simulate snowmelt. For the other two modules, a daily time step was used for model calibration. The model performance was evaluated at the annual, seasonal, and daily time scales.  Snow module Separation of solid and liquid precipitation: Model input: (3) -T m (°C) Threshold temperature, above which melt starts -DDF (mm/°C/d) Degree day factor -SCF (−) Snow correction factor Snow melt: Soil moisture accounting module Change in soil moisture: Model input: Fraction of precipitation generating runoff: Model parameter: -FC (mm) Field capacity, maximum soil moisture storage -β (−) Nonlinear parameter for runoff production -LPrat (−) Parameter related to the limit for potential evapotranspiration Actual evapotranspiration: Routing in the rivertransfer function: Model output:

Water Resources Research
SZÉLES ET AL.

Model Calibration With Only Runoff Data
According to the first calibration approach (Scenario R), the model was calibrated to observed runoff by maximizing Z 1 (−) multiobjective function 16. The main idea was to capture the water balance by minimizing the relative volume error and to mimic the recession parts of the hydrographs. Z 1 was maximized for calibrating all 14 model parameters, using the DEoptim R package for parameter optimization (Ardia, Boudt, et al., 2010;Ardia, Ospina Arango, et al., 2010;Ardia et al., 2016;Mullen et al., 2011).
where lNash Q (−) is the logarithmic Nash Sutcliffe efficiency for daily runoff according to 17 where Q (mm/day) is daily simulated runoff, Q o (mm/day) is daily observed runoff, and Q o (mm/day) is the mean of the observed runoff. Estimation of all model parameters using runoff and precipitation phase data R + Snowmelt Estimation of all model parameters except wet bulb temperature T wb using runoff and snow cover data R + ET Estimation of soil moisture accounting and runoff generation parameters using runoff and actual evapotranspiration data R + SM Estimation of soil moisture accounting and runoff generation parameters using runoff and soil moisture data R + 50ET + 50SM Estimation of soil moisture accounting and runoff generation parameters using runoff, actual evapotranspiration (w ET = 50%), and soil moisture (w SM = 50%) data R + ET + G Estimation of runoff generation parameters using runoff and runoff generation data; the soil moisture accounting parameters were fixed in Scenario R + ET R + SM + G Estimation of runoff generation parameters using runoff and runoff generation data; the soil moisture accounting parameters were fixed in Scenario R + SM R + 50ET + 50SM + G Estimation of runoff generation parameters using runoff and runoff generation data; the soil moisture accounting parameters were fixed in Scenario R + 50ET + 50SM Note. Parameters which were fixed at a certain scenario are shown in bold.

SZÉLES ET AL.
VE Q (−) (Criss & Winston, 2008) is the relative volume error for runoff according to 18 The modeling results were evaluated on three time scales: annual, seasonal, and daily. On the annual time scale, the volumes of observed and simulated runoff were compared, and the relative volume error was calculated according to 18. On the seasonal time scale, the monthly average observed and simulated daily runoff was compared. The monthly Kling-Gupta efficiency KGE Q,m (−) (Gupta et al., 2009) for runoff was calculated according to 19 using the hydroGOF R package (Zambrano-Bigiarini, 2020) where r Q,m (−) is the monthly Pearson correlation coefficient for runoff according to 20 where σ Qm (mm/day) and σ Qm,o (mm/day) are the standard deviation of simulated and observed monthly average runoff, Q m and Q m,o , respectively. Beta Q,m (−) is the bias, that is, the ratio between the mean of the simulated monthly average runoff and the mean of the observed monthly average runoff. Beta Q,m is ideally 1. Alpha Q,m (−) is the variability ratio, that is, the ratio between the standard deviation of the simulated monthly average runoff and the standard deviation of the observed monthly average runoff. Alpha Q,m is ideally 1.
On the daily time scale, daily logarithmic Nash Sutcliffe efficiency for runoff lNash Q was calculated according to 17.

Model Calibration With Runoff and Additional Data
We divided the model parameter estimation procedure into separate steps, looking at the processes associated with the three modules of the model (snow, soil moisture, and runoff generation), which were linked to field observations. Using runoff and additional data, the 14 free parameters were gradually fixed, step-by-step, proceeding along the three modules of the model. This sequence was chosen to follow the main direction of the movement of water in the water cycle. First, all 14 parameters were calibrated using runoff and snow data. In the next step, the snow parameters were fixed, and only the soil moisture and runoff generation parameters were calibrated using runoff and actual evapotranspiration and/or soil moisture data. In the last step, the snow and soil moisture parameters were fixed, and only the runoff generation parameters were calibrated using runoff, overland flow, and storage change data.

Snow Parameters
Simulation of the snow accumulation, that is, the phase of precipitation (Scenario R + Snowacc), and snowmelt (Scenario R + Snowmelt) was optimized using observations of runoff, four precipitation gauges, the present weather sensor, time-lapse photos of snow cover in the catchment, MODIS snow cover images, and data of the snow depth sensor (Figure 4).
Depending on the size and velocity of the precipitation measured by the present weather sensor, the observed precipitation P, which was calculated as the average of the four gauges, was assigned an output code describing its type (Figure 4). The output codes (PWS100 SYNOP 4680 codes) with 1 min temporal resolution were resampled to half-hourly resolution, using the median, to match the half-hourly temperature measurements. The resampled output codes were assigned to one of the four categories: no data (0), no precipitation (1), rain (2), and snow (3).

Water Resources Research
All 14 free parameters were optimized to fit the modeled runoff and phase of the precipitation to the observed one. Z 2 (−) multiobjective function was maximized using the DEoptim R package for parameter optimization, according to 21 where Z Snowacc (−) is the number of half hours with poor phase simulations according to 22 where n false (−) is the number of those half hours, when the model simulated precipitation phase (rain, snow, or no precipitation) did not agree with the observed precipitation phase. The weights on the single objectives were found with sensitivity analysis.
Out of the optimum parameter set, the wet bulb temperature parameter T wb was kept constant in the following optimization steps.
In order to optimize the snowmelt simulations, an observed snow cover index SCI o was set up, showing if snow was observed (1) or not (0) in the catchment on a daily basis, based on three types of observations ( Figure 4):

10.1029/2019WR026153
Water Resources Research 1. Time-lapse photos were manually checked: If snow cover was visible in the catchment, SCI o was assigned 1 (Snow), otherwise 0 (No Snow). 2. During periods when time-lapse photos were not available due to camera malfunction or power outage, daily MODIS Normalized Difference Snow Index (NDSI) images were checked (Hall & Riggs, 2016a, 2016b. The territory of the HOAL was extracted and reprojected (MRT, 2004) from the h19v06 MODIS tile with 500 m spatial resolution. Out of the eight pixels, five cover more than 50% of the catchment area; therefore, the average of these five pixels was calculated. The catchment average MODIS NDSI values were reclassified to three categories: snow (40 ≤ NDSI ≤ 100), no snow (0 ≤ NDSI < 40), and no data (100 < NDSI), choosing 40 as a threshold based on Dozier (1989). If the NDSI values were classified to category snow or no snow, the composite snow cover index was assigned 1 or 0, respectively. 3. If the NDSI values were classified as no data, the snow sensor data were checked. If the recorded snow depth was above 0 cm, the composite snow cover index SCI o was assigned 1, otherwise 0.
The simulated snow cover index SCI was assigned 1 (snow observed in the catchment), if the modeled snow water equivalent SWE (mm) was larger than 2 mm. Otherwise, it was assigned 0. The 2 mm threshold was chosen based on sensitivity analyses.
The 13 free parameters (all parameters except T wb wet bulb temperature, which was fixed in the previous step) were optimized to fit the modeled runoff and snow cover index SCI to the observed one. Z 3 (−) multiobjective function was maximized using the DEoptim R package for parameter optimization, according to 23 where Z Snowmelt (−) is the number of days with poor snow cover index simulations similarly to 22, that is, the simulated snow cover index did not agree with the observed one.
Out of the optimum parameter set, the remaining three snow parameters (SCF, DDF, and T m ) were kept constant in the following optimization steps.
The modeling efficiency in terms of simulating snow accumulation and snowmelt was evaluated by analyzing Z Snowacc and Z Snowmelt for the scenario, when only runoff was used for model calibration (Scenario R) and for the scenarios when additional information on snow was used besides runoff (Scenarios R + Snowacc and R + Snowmelt, respectively).

Soil Moisture and Evapotranspiration Parameters
Simulation of catchment average soil moisture and evapotranspiration was optimized using runoff data and observations of the soil moisture sensors, the eddy covariance systems, and results from a soil survey ( Figure 5).
For actual evapotranspiration, a catchment average evapotranspiration was calculated for each day. Measurements of the closed path eddy at the weather station (measuring grass evapotranspiration) and the two mobile eddies (measuring evapotranspiration from different crop types, such as maize, winter wheat, winter barley, or bare soil) were weighted with the area of the different land use types. For the riparian forest close to the stream, due to the lack of evapotranspiration measurements, a crop coefficient was used. The grass evapotranspiration from the weather station eddy was then multiplied with this crop coefficient. A similar method was used in case of data gaps or in case the evapotranspiration rates from a certain main crop type were not measured in a certain year.
For soil moisture, the average of all stations over all depths was considered. To make the simulated soil moisture comparable with the measurements, the soil moisture time series were standardized according to 24 where SMs (−) is the simulated standardized soil moisture, SM (mm) is the simulated soil moisture, and SM (mm) and σ SM (mm) are the average and the standard deviation of the simulated soil moisture over each period (calibration and validation periods separately). The observed standardized soil moisture SMs o (−) was calculated similarly to 24.

Water Resources Research
A multiobjective function Z 4 (−) 25 was maximized for calibrating 10 model parameters, using the DEoptim R package. The snow module parameters (optimized according to section 3.3.1) were not changed in this step. Z 4 combined runoff efficiency, the daily Nash Sutcliffe efficiency for evapotranspiration Z ET , and standardized soil moisture Z SM with different weights according to 25 where w ET (−) is the weight on the evapotranspiration objective, ranging between 0 and 1, and w SM (−) is the weight on the soil moisture objective according to 26 Nash ET is the daily Nash Sutcliffe efficiency for actual evapotranspiration 27 where ET (mm/day) is the simulated actual evapotranspiration, ET o (mm/day) is observed actual evapotranspiration, and ET o (mm/day) is the mean of the observed actual evapotranspiration. The Nash Sutcliffe efficiency for the daily standardized soil moisture Nash SMs (−) was calculated similarly to 27 with SMs o ¼ 0.
Results of a soil survey performed on a 50 × 50 m raster were used to constrain the field capacity FC. The upper boundary of FC was set to 450 mm considering that the maximum of the observed, depth-averaged field capacity was 430 mm in the catchment (Murer et al., 2004).

Water Resources Research
This optimization step was repeated 10 times, to check the stability of the optimized model parameters.
Out of the optimum parameter set, the three soil moisture parameters (LP rat , FC, and β) were kept constant in the following optimization steps. Three main scenarios were chosen for further analysis. In Scenario R + ET, w ET was chosen as 1; therefore, only evapotranspiration information was used besides runoff in the multiobjective function Z 4 . In Scenario R + SM, w ET was chosen as 0; therefore, only soil moisture information was used besides runoff in the multiobjective function Z 4 . In Scenario R + 50ET + 50SM, w ET was chosen as 0.5; therefore, both evapotranspiration and soil moisture information were used besides runoff in the multiobjective function Z 4 . These parameters were not changed in the following optimization step.
The modeling efficiency in terms of simulating soil moisture and actual evapotranspiration was evaluated by analyzing the relative volume error for actual evapotranspiration VE ET similarly to 18 and the monthly Kling-Gupta efficiency according to 19 for actual evapotranspiration and standardized soil moisture, KGE ET,m and KGE SMs,m , respectively, as a function of weight on the evapotranspiration objective w ET . Furthermore, the annual (relative volume error for actual evapotranspiration) and monthly performances (monthly average actual evapotranspiration rates and standardized soil moisture) were also compared for selected scenarios (Scenarios R, R + ET, R + SM, and R + 50ET + 50SM).

Runoff Generation Parameters
The model parameters related to the subsurface dynamics were optimized using runoff data, time-lapse photos of saturation patterns, and piezometer measurements of groundwater levels ( Figure 6).
In order to optimize the very fast runoff q 0 simulations, saturation excess runoff events were identified in the valley bottom located in the centroid of the catchment by a digital camera at the weather station according to Silasari et al. (2017). The days when the model simulated very fast runoff (q 0 ) were calibrated to the dates when overland flow events were observed by the camera.
Monthly average storage change in the lower zone reservoir dS LZ (mm/month) was optimized using piezometer observations of groundwater levels GWL, located in the riparian zone. Generally, groundwater levels are higher in the riparian zone close to the stream compared to the catchment average groundwater levels. This might introduce bias and potentially different temporal dynamics. However, based on more recent data, that is, piezometric data from deep boreholes, which are located not only in the riparian zone but also on the slopes and ridgelines of the catchment, this bias was likely small. The dynamics of the deeper riparian piezometers located on the right side of the stream matched the dynamics of the deep borehole piezometers. Riparian groundwater level data with 5 min temporal resolution were averaged to daily values, and missing data were spline interpolated for each riparian piezometer. The daily, gap-filled time series were aggregated to monthly values. The monthly average groundwater level data were differentiated in time, and the difference was multiplied by drainable porosity 0.036 to get the monthly storage change values dS o (mm/month). The storage change calculated for each piezometer was then spatially interpolated using Thiessen polygons, and a catchment average storage change was calculated from the interpolated raster map. The simulated monthly average standardized lower zone storage change dSs (−) was calibrated to the observed catchment average standardized storage change dSs o (−). The observed and simulated standardized storage change time series were calculated analogously to 24.
A multiobjective function Z 5 (−) 28 was maximized for calibrating seven model parameters, using the DEoptim R package. The snow module and soil moisture accounting module parameters (optimized according to sections 3.3.1 and 3.3.2) were not changed in this step. Z 5 combined runoff efficiency, the relative number of days with correctly simulated very fast runoff Z OF (−), and the relative number of months with correctly simulated sign of the standardized storage change Z dSs (−) with different weights where w OF (−) is the weight on the overland flow OF objective, ranging between 0 and 1, and w dSs (−) is the weight on the storage change objective according to 29 10.1029/2019WR026153

Water Resources Research
Z OF (−) is the relative number of days, when very fast runoff q 0 was correctly simulated 30 where n q0 (−) is the number of those days when the model simulated very fast runoff q 0 and overland flow was simultaneously observed by time-lapse photos taken by the camera located at the weather station and n OF (−) is the total number of days, when overland flow was observed in the catchment.
Z dSs (−) is the relative number of months, when the model correctly simulated the sign of the standardized storage change 31 where n sgn(dSs) (−) is the number of those months when the sign of the model simulated standardized lower zone storage change was the same as the sign of the observed standardized storage change and n dSs,o (−) is the total number of months when storage change was observed.
This optimization step was repeated 10 times to check the stability of the optimized model parameters. Three main scenarios were chosen for detailed analysis: R + ET + G, R + SM + G, and R + 50ET + 50SM + G. In each scenario, equal weights were put on the overland flow (w OF = 0.5) and storage change (w dS = 0.5) objectives.
The modeling efficiency in terms of simulating overland flow was evaluated by analyzing Z OF as a function of w OF . The modeling efficiency in terms of simulating storage change in the saturated zone was assessed by Z dSs as a function of w OF .

Runoff Simulation
The modeling efficiency in terms of simulating runoff was evaluated on annual, monthly, and daily time scales. On the annual time scale, the volumes of observed and simulated runoff were compared. On the seasonal time scale, monthly average observed and simulated runoff time series were compared. On the daily time scale, daily runoff time series were compared.

Model Calibration With Only Runoff Data
When the model was calibrated using only runoff data, the performance of runoff simulations was very good during the calibration period on each time scale (Table 4). During model validation, the annual and seasonal performances of runoff simulations were still good. The relative volume error was below 20% and the monthly Pearson correlation coefficient was around 0.75, the monthly bias Beta Q,m was close to 1.0, and the monthly variability ratio Alpha Q,m was also close to 1.0 during the long validation period, and it was around 0.7 during the shorter, 2 year long validation period. But the daily performance was very low, the daily logarithmic Nash Sutcliffe efficiency for runoff was around 0.2 during both periods.

Model Calibration With Runoff and Additional Data 4.2.1. Snow Simulation
The first step of the step-by-step model parameter estimation was to find the optimal snow parameters.
Calibrating the model parameters to runoff and the observations from the present weather sensor (Scenario R + Snowacc) gave 0.31% and 0.40% of poor simulation times steps, that is, the phase of the model simulated precipitation differed from the observed one, in the calibration and validation periods, respectively. This was slightly better than the simulations that used only runoff data for calibration instead (Scenario R) ( Table 5). Compared to Scenario R, in Scenario R + Snowacc, the calibrated wet bulb temperature T wb increased from −0.1°C to 0.6°C (Table 3). This value is more realistic considering that it is closer to 1.0°C, which is the median of the wet bulb temperature observed in the catchment during precipitation events with a shift in precipitation phase (calculated according to Text S2).
Calibrating the snowmelt parameters to the observed snow cover index SCI o (Scenario R + Snowmelt) gave 4.4% and 6.3% of poor simulation time steps which, again, was better than Scenario R. In Scenario R, the calibrated snow correction factor SCF indicated 30% increase in snowfall precipitation, which was higher than expected for lowland catchments (Table 3). The lower SCF values in Scenarios R + Snowacc and R + Snowmelt were much more realistic (Table 3). The higher snow correction factor SCF in Scenario R was then compensated by more intense snowmelt, that is, higher degree day factor DDF model parameter, which was much higher than expected and found in flatland catchments in Austria (Merz et al., 2011;Sleziak et al., 2020).
The trade-off between simulating runoff and snowmelt accurately was not large, that is, using additional information on snow did not have a large influence on runoff simulations during the calibration period (Table 4), but it changed the simulation results during the 22 year long validation period by causing larger relative volume error, higher monthly variability ratio, and lower logarithmic Nash Sutcliffe efficiency for runoff. During the second, 2 year long validation period, the runoff simulation performance on the seasonal and daily time scales generally improved (Table 4).

Soil Moisture and Evapotranspiration Simulation
The second step of the step-by-step parameter estimation was to fix the parameters of the soil moisture accounting module.
The smallest relative volume errors of evapotranspiration were achieved for soil moisture weights of w SM = 0.3 and w SM = 0.0 in the calibration and validation periods based on the median of 10 model runs (Figures 7a and 7b). Generally, the model tended to overestimate evapotranspiration (Figures 7c and 7d). This could be a consequence of using the Nash Sutcliffe efficiency for evapotranspiration in the objective function, where the model was fitted to the peaks and not lower values of evapotranspiration rates. Furthermore, there is a mismatch between field observations and the HBV type, soil moisture-dependent evapotranspiration calculations. For example, during precipitation events, measured evapotranspiration drops to zero, while model simulations increase due to the higher soil moisture content. During model 10.1029/2019WR026153  (2013)(2014)(2015) and First and Second Validation (1991-2012

10.1029/2019WR026153
Water Resources Research validation, the relative volume error for evapotranspiration was closer to zero, especially when using only evapotranspiration information in the objective function (Figures 7b and 7d).
The monthly Kling-Gupta efficiency for evapotranspiration was above 0.6 both during model calibration and validation (Figures 8a and 8b). The monthly Kling-Gupta efficiency for standardized soil moisture was slightly lower, if only evapotranspiration was used in the objective function (Figures 8c and 8d). When both evapotranspiration and soil moisture objectives were involved in the objective function, the proposed approach outperformed Scenario R (Figures 8a-8d). The root mean squared error between simulated and observed monthly average evapotranspiration and standardized soil moisture was generally lower, when besides runoff extra information on evapotranspiration and/or soil moisture were involved in the model calibration (Figures 8e-8h).
For further analysis, calibrated model parameters from three main scenarios were chosen, when a weight of w SM = 0.0 (R + ET), w SM = 1.0 (R + SM), and w SM = 0.5 (R + 50ET + 50SM) was used on the soil moisture objective in the objective function when calibrating the soil moisture accounting module.
In Scenario R, the parameter related to the limit for potential evapotranspiration LPrat was very close to zero, meaning that actual evapotranspiration was almost always reaching its potential rate, even when the soil was very dry (Table 3). If evapotranspiration and soil moisture were used for model calibration, LPrat was closer to 1.0. This means that a certain wetness in the soils, that is, more water, was needed to reach potential evapotranspiration (Table 3). The nonlinear parameter for runoff production β was below 1.0, if only runoff was used for model calibration (Scenario R; Table 3), meaning that more water was allocated for runoff and less for soil moisture storage. Considering the clayish soil types in the HOAL, this was highly unrealistic. β was well above 1.0, if actual evapotranspiration and/or soil moisture were also used for model calibration (Scenarios R + ET, R + 50ET + 50SM, and R + SM). This refers to a more nonlinear runoff generation, which would be expected in the catchment considering the soil types. According to the soil survey, the catchment  Table 2. Numbers in legend indicate root mean squared error.

10.1029/2019WR026153
Water Resources Research average field capacity FC was around 400 mm. This means that the calibrated field capacities for Scenarios R + ET and R seemed more realistic. However, when checking the simulated storage time series, the Scenarios R + 50ET + 50SM and R + SM stood much closer to reality. In Scenario R, when only runoff was used for model calibration, the root zone soil storage remained well below saturation, and it even reached 0 mm in the summer months. This was not realistic, and it was not supported by the soil moisture observations in the catchment. When soil moisture and actual evapotranspiration data were used together for model calibration, the root zone soil storage never dried out. In Scenario R + ET, most of the water was stored in the soil, the upper zone storage was only a few mm during events, and the lower zone storage was around four times smaller. When also soil moisture was used for model calibration, the water was more evenly distributed between the storage elements (Scenarios R + SM and R + 50ET + 50SM).
The results indicate a clear trade-off between accurately simulating actual evapotranspiration, soil moisture, and runoff. Using additional information unequivocally improved evapotranspiration and soil moisture simulations, especially on the daily time scale (Table 6). Using additional information on evapotranspiration and soil moisture besides runoff generally improved runoff simulations during the 2 year long validation period, while it only slightly deteriorated the results during model calibration and the 22 year long validation periods (Table 4). The difference between the scenarios in terms of runoff efficiency was generally small during model calibration and the 22 year long validation period. Compared to the 22 year long validation, the volume error reduced by order of magnitude, and the daily runoff simulation performance improved for the 2 year validation (Table 4). While soil moisture information improved runoff efficiency, evapotranspiration information deteriorated it during the 2 year long validation period (Table 4).

Runoff Generation and Routing Simulation
The third step of the step-by-step parameter estimation was to fix the parameters of the runoff generation and routing module.
Using the parameters from Scenario R + 50ET + 50SM from the previous step, the model was run 10 times ( Figure 9). The median of the relative number of days with good overland flow simulations, that is, days when the model simulated very fast runoff (q 0 ) and overland flow were simultaneously observed, immediately exceeded 0.5 as soon as the weight on the overland flow part in the compound objective function was larger than zero (Figures 9a and 9b). In terms of overland flow simulations, the results outperformed Scenario R for each w OF weight, as the model did not simulate overland flow at all when only runoff was used for model calibration. The median of the relative number of months with good storage change simulations, that is, months when the sign of the model simulated storage change agreed with the sign of the observed storage change, became gradually 10-20% lower as the weight on the overland flow objective increased, and during calibration, the results also underperformed Scenario R (Figures 9c and 9d).
Compared to Scenario R, when runoff generation data were also used for model calibration, the calibrated very fast storage time k 0 became smaller (1.1 day for Scenario R and below 0.5 day if runoff generation data were also used for model calibration; Table 3), while the fast and slow storage times, k 1 and k 2 , respectively, increased (Table 3). Faster overland flow and slower subsurface runoff correspond well with field observations. Overland flow events usually last a few hours according to camera observations, while the outflow from the subsurface reservoirs take several months due to the heterogeneous subsurface properties of the catchment.
For further analysis, the runoff generation parameters of the three main scenarios (R + ET + G, R + SM + G, and R + 50ET + 50SM + G) were calibrated by choosing the weight on the overland flow objective as w OF = 0.5.
Regarding the trade-off between simulating runoff and runoff generation processes accurately, by using additional information on overland flow and storage change besides runoff further improved runoff simulations during the 2 year long validation period, while it further deteriorated the daily performance during model calibration and the 22 year long validation (Table 4). Generally, the scenarios when either only soil moisture or both soil moisture and evapotranspiration were used in the model calibration performed the best (Table 4).

Runoff Simulation
The final step was to evaluate the different scenarios in terms of simulating runoff on annual, seasonal, and daily time scales (Figures 10-12).
When comparing the evolution of the runoff simulation efficiencies through the different scenarios during the 2 year long validation period (2016)(2017), it is clear that using additional data besides runoff for model calibration improved the runoff simulation efficiencies ( Figure 10). The relative volume error slightly increased when snow data were also used for model calibration. But it definitely decreased, when soil moisture, evapotranspiration, overland flow, and storage change information was also used for model calibration.
The monthly and daily runoff performance improved when additional data were also used for model calibration. Evapotranspiration and soil moisture data had the largest influence on runoff simulations, which also agreed with the results of the sensitivity analysis (Table S1). The sensitivity analysis showed that the most sensitive parameter belonged to the soil moisture accounting module. During the 2 year long validation period, the results improved the best if either both soil moisture and evapotranspiration or only soil moisture information was used during model calibration besides runoff.
On the annual time scale, the performance of the proposed step-by-step approach was similar to the performance of the traditional Scenario R, when only runoff was used for model calibration. The proposed method even outperformed Scenario R during the shorter, 2 year long validation period (Val2, 2016(Val2, -2017. Runoff was overestimated during the longer, 22 year long validation period both by Scenario R and the proposed calibration approach (Figures 11a-11c).
Similarly, on the seasonal time scale, the proposed method could efficiently model runoff. The root mean squared error between simulated and observed monthly average runoff was lower during the validation periods (Figures 11d-11f).
On the daily time scale, the model performed worse during model calibration and the 22 year long validation, if additional data besides runoff were used for model calibration. The peaks were underestimated,

10.1029/2019WR026153
Water Resources Research and the recession was more delayed, when additional data were involved in the model calibration. But the daily performance slightly improved during the 2 year long validation period by using additional data and not only runoff ( Figure 12).

Discussion
Actual field measurements are of utmost importance to understand and model catchment processes. In this study, we showed that by using a variety of different field observations, we were able to simulate not only runoff but also other hydrological processes generally efficiently using a lumped conceptual hydrologic model and a stepwise model calibration approach. This means that the model simulated runoff generally well for the right reasons (Grayson et al., 1992) on the annual and seasonal time scales. Avanzi et al. (2020) drew similar conclusions, when they found that ground-based measurements allowed identifying more hydrologic model parameters than runoff alone. Rakovec et al. (2016) also noted that although it was necessary to constrain the model against runoff, this was not sufficient to simulate other variables, for instance, soil moisture, accurately.
In previous studies, which used additional data, not only runoff, to calibrate a hydrologic model, the simulation of certain state variables usually improved. But generally, these were only the targeted state variables, such as snow water equivalent only, or soil moisture only, or evapotranspiration only. For instance, Kundu et al. (2017) reported that using remotely sensed soil moisture improved the rainfall-runoff response, but not the routing simulations. Rajib et al. (2016) also pointed out that using remotely sensed soil moisture improved the simulation of surface soil moisture, but runoff, evapotranspiration, and root zone soil moisture were less affected. In this study, we managed to improve snow accumulation, snowmelt, soil moisture, evapotranspiration, overland flow, and storage change simulations at the same time.

Water Resources Research
While the inclusion of data on state variables and fluxes can improve their simulations, this usually comes at a cost of slight deterioration of runoff simulations (e.g., Gui et al., 2019;Parajka et al., 2007;Rientjes et al., 2013;Seibert, 2000). Seibert (2000) explained this effect by problems with the model structure.
When he modified the model structure of the model, the runoff simulation efficiencies increased. Gui et al. (2019) explained the deterioration of runoff simulations when including remotely sensed evapotranspiration in the calibration by a lack of emphasis on evapotranspiration in the model compared to runoff processes, and Rientjes et al. (2013) suggested that small errors in actual evapotranspiration can cause large errors in runoff due to the difference in the volume of water. Rientjes et al. (2013) also pointed out a mismatch between the definition of satellite-based actual evapotranspiration estimates and the HBV-based estimates. According to Parajka et al. (2007), "similar values of the runoff objective function do not necessarily imply a similar hydrological response of the catchment." In the present study, the runoff calibration efficiency also deteriorated slightly when including additional data besides runoff, mainly on the daily time scale. The change was smaller during the validation periods. Moreover, we could improve runoff simulations during the second, 2 year long validation period, on all time scales, that is, annual, monthly, and daily. Overall, using additional information besides runoff gave more realistic parameter combinations (Table 4), and the simulation of model internal fluxes and states improved. The deterioration of runoff efficiencies during model calibration and the first validation periods can be explained by model structural errors or mismatches between the conceptual model and field observations, which are both discussed below.
The sensitivity of runoff simulations to changing model parameters depended on the module of the model, as shown by the sensitivity analysis (Table S1). Although parameterizing the snow module in this catchment

Water Resources Research
is not as relevant as in an Alpine catchment due to the small amount of snow, the proposed method might be useful for other studies, where the precipitation partitioning is a crucial modeling step but usually lacks validation (e.g., Jennings et al., 2018). Parameters describing the soil moisture accounting routine of the model were the most sensitive according to the sensitivity analysis. Therefore, in this study observations of soil moisture and evapotranspiration played the most important role in parameter estimation. Using either both evapotranspiration and soil moisture or only soil moisture observations, the model's most sensitive routine could be parameterized well. Previous studies are mixed in terms of whether soil moisture and/or evapotranspiration observations improve the efficiency of runoff simulations. Nijzink et al. (2018) found that soil moisture satellite products were more effective than evaporation products for deriving more constrained parameter distributions. López et al. (2017) showed that estimating runoff was more efficient, if both soil moisture and evapotranspiration satellite products were involved in the model calibration. Bergström and Lindström (2015) argued that the relative importance of these observations of course depends on the time step of the model considering that evapotranspiration volumes are less than storage in the unsaturated zone on a daily basis. Similarly, Baroni et al. (2019) pointed out that different data collection strategies should be considered for different variables, depending on their degree of coupling to the atmosphere. Considering that changes in the atmosphere are faster, evapotranspiration observations are necessary on a daily basis, while changes in soil moisture and groundwater levels are more consistent over time, as these are more decoupled from the atmosphere.
Linking observations with model simulations is not a straightforward task, and local experience with the catchment processes may be a substantial advantage (Avanzi et al., 2020;Holländer et al., 2009). For physically based and spatially distributed models, modelers can often directly and explicitly compare measured and simulated volumes of water and energy fluxes (e.g., Kuras et al., 2011;Thyer et al., 2004). For conceptual hydrologic models, especially if the model is spatially lumped, a balance between the lumped conceptual model concept and spatial and temporal variability of processes has to be found. The spatial organization of soil moisture and other variables within the catchment may matter for runoff generation which cannot be captured by lumped models as they aggregate spatial processes variability (Blöschl et al., 1995;Viglione et al., 2018;Western et al., 1998). In this study, our aim was to use such objective functions and compare  Table 2. 10.1029/2019WR026153

Water Resources Research
such quantities that helped bridging the gap between the model and observations. Therefore, instead of comparing volumes, we compared the standardized values of observed and simulated soil moisture and storage change. For the snow module, we used a binomial snow cover index based on different types of measurements to decide if there was snow in the catchment and a threshold for simulated snow water equivalent, only above which the catchment was considered snow covered. For the subsurface module, we compared the sign of the standardized storage change. When different objectives were tested instead of using these, the model often gave unrealistic results.
It is important to note that the aim of this study was not to modify or optimize the structure of the model but to use a conceptual model structure that has been used in various climatic regions in previous studies both in gauged and ungauged basins (e.g., Blöschl et al., 2013;Parajka et al., 2007;Sleziak et al., 2018), and that is general enough so that it could be used to simulate runoff in arbitrarily chosen basins. Previous studies proved that this type of model structure works well in Austrian catchments, both in Alpine and lowland environments (e.g., Parajka et al., 2007;Sleziak et al., 2018). Based on our current understanding on the HOAL, that is, how this catchment works conceptually, the HBV model structure is a very good representation of the catchment. Regarding the runoff generation in the catchment, the saturation excess overland flow events in the valley bottom identified by the weather station camera according to Silasari et al. (2017) could be well matched with the very fast runoff response of the HBV-based TUWmodel. If a certain storage state threshold LS UZ is reached in the upper reservoir of the model, that is, filling up from the bottom, which corresponds to saturation excess runoff mechanisms, very fast runoff starts. Regarding the subsurface mechanisms, Exner-Kittridge et al. (2016) distinguished between two main aquifers in the HOAL, a shallow and a deep one. Their contribution to runoff depends on the hydrologic conditions, for example, low or high flow conditions, seasonality, and so forth. The runoff generation module of the HBV-based TUWmodel consists of two subsurface reservoirs, the upper reservoir contributes to rainfall events, while the outflow from the lower reservoir takes place on much longer, monthly time scales.
Although we had a certain understanding how the catchment works conceptually, on small catchment scale, the role of inhomogeneous surface and subsurface properties (e.g., complex geology, cracks, and earthworm paths resulting in preferential flow paths) which result in nonlinear responses and processes with different thresholds (e.g., flow paths which are only activated above a certain groundwater level) can be more dominant than on large catchment scale where the response is averaged out. On small catchments, there are several exceptional cases, where the model does not work. The shorter the time scale, the more pronounced and visible these exceptions. The model in this study also underperformed on the daily time scale for simulating runoff during the validation periods. This underperfomance slightly improved during the second, 2 year long validation period where additional data were also used for model calibration. A possible reason for this daily underperformance for runoff could be that the model was trained on 3 years (2013-2015), which were extreme years including a very wet year (2013) and two very dry years (2014)(2015). During extreme hydrologic conditions, that is, wet or dry years, the area contributing to runoff, the active flow paths, and the catchment rainfall response can be very different. Furthermore, the validation period was characterized by much more snow, while the calibration period was almost snow-free. A lumped model often cannot handle these differences. Other reasons could be further structural errors of the model, for instance, in the evapotranspiration module. In the evapotranspiration module the simplified representation of actual evapotranspiration often does not match the measurements. During rainfall events, actual evapotranspiration measured by the eddy covariance systems drops to zero, while HBV-based model simulations increase due to the higher soil moisture content. During the drier summer season, water for evapotranspiration might be extracted from deeper soil layers and not the surface layers monitored by the soil moisture sensors. A possible solution might be to use a distributed model, where a difference in model structure could better take into account the inhomogeneities of a small catchment, such as different runoff generation mechanisms, groundwater level-dependent switches in flow paths. In this study we used only the lumped version of the model, which performed well on the annual and seasonal time scales. In order to improve the daily performance of the model, a spatial distribution and modified model structure might be necessary. A spatially distributed model structure may better take advantage of the higher information content of the measurement locations within the catchment. This suggests that an increased density of measurements that capture the spatial distribution of the variables may not necessarily benefit lumped model performance.

Water Resources Research
This study showed that by using field measurements of input and output fluxes and states besides runoff, we were able to simulate these fluxes and states (snow, sail moisture, evapotranspiration, and storage), moreover also the annual and seasonal runoff more efficiently. This finding suggests that hydrologic models which are constrained only by runoff might be simulating runoff well for the wrong reasons, that is, with wrong parameters. But this can be only revealed if other state variables of the model are tested against observations. For small basins, where satellite information may be too coarse, field observations play a crucial role. For instance, if data on evapotranspiration, soil moisture, or saturation areas are available, these may be complementary to existing runoff data or surrogates of runoff data if no runoff measurements exist. They can all help in constraining a hydrologic model.

Conclusions
This study presented a new framework for estimating the parameters of runoff model components in a stepwise fashion from field observations of input and output fluxes and states besides runoff and investigated the value of these data for simulating runoff well for the right reasons in a small agricultural catchment. Our results showed the following: 1. By using the proposed step-by-step model calibration approach with different field observations of input and output fluxes and states besides runoff for parameter estimation, we were able to efficiently simulate these fluxes and states correctly. This means that we simulated runoff well for the right reasons on the annual and seasonal time scales. 2. For this catchment, field observations of soil moisture and evapotranspiration had the largest influence on runoff simulations. 3. Future research and possibly a spatially distributed and modified model structure might be necessary on a small catchment to take into account the role of small-scale inhomogeneities and to improve the daily performance of the model. Distributed and more physically based models may better reflect the catchment dynamics and may be more compatible with the spatially distributed measured dynamics.

Data Availability Statement
The data used are listed in the tables, and the data necessary to reproduce the reported findings are available on the following site: https://owncloud.tuwien.ac.at/index.php/s/uQfz97O14srsW50 (password: 20cWRs19).