In Quest of Calibration Density and Consistency in Hydrologic Modeling: Distributed Parameter Calibration against Streamflow Characteristics

Conventional basin‐by‐basin approaches to calibrate hydrologic models are limited to gauged basins and typically result in spatially discontinuous parameter fields. Moreover, the consequent low calibration density in space falls seriously behind the need from present‐day applications like high resolution river hydrodynamic modeling. In this study we calibrated three key parameters of the Variable Infiltration Capacity (VIC) model at every 1/8° grid‐cell using machine learning‐based maps of four streamflow characteristics for the conterminous United States (CONUS), with a total of 52,663 grid‐cells. This new calibration approach, as an alternative to parameter regionalization, applied to ungauged regions too. A key difference made here is that we tried to regionalize physical variables (streamflow characteristics) instead of model parameters whose behavior may often be less well understood. The resulting parameter fields no longer presented any spatial discontinuities and the patterns corresponded well with climate characteristics, such as aridity and runoff ratio. The calibrated parameters were evaluated against observed streamflow from 704/648 (calibration/validation period) small‐to‐medium‐sized catchments used to derive the streamflow characteristics, 3941/3809 (calibration/validation period) small‐to‐medium‐sized catchments not used to derive the streamflow characteristics as well as five large basins. Comparisons indicated marked improvements in bias and Nash‐Sutcliffe efficiency. Model performance was still poor in arid and semiarid regions, which is mostly due to both model structural and forcing deficiencies. Although the performance gain was limited by the relative small number of parameters to calibrate, the study and results here served as a proof‐of‐concept for a new promising approach for fine‐scale hydrologic model calibrations.


Introduction
Land surface models (LSMs) and distributed hydrologic models have been the major tool to study various hydrologic variables of interest at large scales from basin to globe and provide many critical monitoring/forecast services and risk analysis in water resources management, hazard mitigation (e.g., flood and drought), and climate adaptation (Asadieh & Krakauer, 2017;Hundecha et al., 2016;Madsen et al., 2014). Some models estimate river streamflow directly and others provide grid-cell or catchment level runoff, which is later fed to river hydrodynamic models for streamflow calculations.
The parameterization of the underlying processes in different models may vary, but almost all of them need some degree of parameter calibration against observations to achieve reliable predictions (Beven, 1989;Gong et al., 2015;Troy et al., 2008;Wagener & Gupta, 2005;Yadav et al., 2007). Most of the time, the models are calibrated against observed river streamflow, given its better availability than other training targets like soil moisture or evapotranspiration (Demaria et al., 2007;Gao et al., 2018;Troy et al., 2008;Yadav et al., 2007). Since the beginning, calibration has been an essential effort along with modeling and a continuously evolving process. This is due to the fact that models continue to grow, not only in terms of better parameterization, but also improved spatial resolution, better soil/vegetation data, better/longer meteorological inputs, as well as better training data. On one hand, such new changes can make existing model parameters less optimal. On the other, new developments in river hydrodynamic modeling have been pushing the size of runoff input units to increasingly finer scales, for instance, the Hillslope River Routing (HRR) model (Beighley et al., 2009) uses 74,075 unit catchments over the conterminous United States (CONUS) and the National Water Model (NWM; http://water. noaa.gov/about/nwm) uses the National Hydrography Dataset Plus Version 2 (NHDPlusV2; http:// www.horizon-systems.com/nhdplus/NHDPlusV2_data.php) with~2.7 million unit river reaches defined over the same region. Recent 1-D hydrodynamic models like the Routing Application for Parallel computatIon of Discharge (RAPID; David et al., 2011); WRF-Hydro (Gochis et al., 2018), and CaMa-Flood (Yamazaki et al., 2011;Yamazaki et al., 2013) are all implemented at very high spatial resolution, not to mention the 2-D hydrodynamic models which work at even finer scales (Bates et al., 2010;Schumann et al., 2013;Yin et al., 2016;Yin et al., 2017;Yu & Lane, 2006). So there is an urgent need to improve the calibration density of hydrologic models in order to match the need of river hydrodynamic models.
We will illustrate the issues related to calibration further by focusing on the Variable Infiltration Capacity (VIC) model (Liang et al., 1994;Liang et al., 1996), which has been widely implemented over the past decades in numerous large scale studies Nijssen et al., 2001;Wu et al., 2018;Zhang et al., 2014) and one major source of runoff inputs to river hydrodynamic modeling (e.g., Lohmann routing of Lohmann et al., 1998; Wen routing of Guo et al., 2004;MOSART of Li et al., 2013;DRTR of Wu et al., 2014;HRR of Ray et al., 2016;CaMa-Flood of Willner et al., 2018). The initial effort to calibrate the VIC model at continental scale started in Maurer et al. (2002), where the VIC soil parameters describing soil depth, baseflow drainage and infiltration capacity were adjusted to best match the monthly streamflow records at the outlets of 12 major river basins/sub-basins in the CONUS. The resulting VIC parameters were used in the North American Land Data Assimilation System phase 1 (NLDAS-1) project , and they are still being used in the operational part of the North American Land Data Assimilation System phase 2 (NLDAS-2) project and many other current practices of continental-scale hydrologic assessments (Brekke et al., 2013;Leng et al., 2016;Mahat et al., 2017;Mizukami et al., 2016). This continental calibration is rather sparse (12 basins) and the parameters form large chunks on the map with strong discontinuities between basin boundaries (Figure 1 (a1), (b1) and (c1), Mizukami et al., 2016;Mizukami et al., 2017), which is a result of the limitations on computing power and data availability at the time. Troy et al. (2008) tried to improve the calibration by training the model over more basins. Assuming the travel time is negligible for small basins at monthly scale and streamflow is fully equivalent to runoff, the monthly runoff data over 1131 small and less-regulated basins collected by United States Geological Survey (USGS) were used to estimate (by interpolation) monthly runoff ratio time series every 1°x1°box over the CONUS. The VIC model was calibrated at every 1°x1°box (~800 of them) during a 4-year period and then the calibrated parameters were downscaled to 1/8°resolution. Troy et al. (2008) considerably improved the spatial coherence of parameters by eliminating the discontinuities between large basins, though artifacts (e.g., bull's eyes, see Figure 1 (a2), (b2) and (c2)) exist due to the coarse sampling (1°spacing) that is insufficient to capture the full spatial variability of runoff. More recently, Oubeidillah et al. (2014) made a very large-scale effort to calibrate the model at a higher resolution of 1/24°against USGS WaterWatch monthly runoff dataset for 2,107 hydrologic sub-basins (8-digit hydrologic units, HUC8s, detailed hydrologic maps can be found at https://water.usgs.gov/GIS/huc.html). This effort significantly increased the density of training runoff datafrom 12 (Maurer et al., 2002) and~800 (Troy et al., 2008) to 2,107 over the CONUS, though parameter discontinuity still exists among HUC8 basins (Figure 1 (a3), (b3) and (c3)). Further increases in the spatial density of training runoff data beyond HUC8s seem unlikely until new techniques like Inverse Streamflow Routing (Fisher et al., 2018;Pan & Wood, 2013) become more mature. Therefore, more recent efforts attempted to relate parameters to soil properties through transfer functions such that the parameter maps can be made at high resolution with good spatial patterns (consistent with soil property maps), for example, the Multiscale Parameter Regionalization (MPR) developed in Samaniego et al. (2010). Mizukami et al. (2017) applied an MPR-type method to VIC calibration over the CONUS using USGS observations over 531 basins and it was found that the skill of MPR was very reasonable but relatively lower than basin-by-basin calibration.
Roughly speaking, two types of efforts have been made to increase calibration density (given available river gauges) and estimate spatially distributed hydrologic parameters: 10.1029/2018WR024178
The major difference between the two is which one to regionalizemodel parameters or training data, and the skill of calibration will be highly dependent on the efficacy of such regionalization. In some cases, the behavior of the model parameters is better understood (e.g. their relationship with soil properties), and in others the parameter behavior is too elusive for reliable parameter transfer. Recent studies in streamflow climatology analysis (Beck et al., 2015) suggest that the streamflow characteristics (e.g. long-term mean, high/low flows, baseflow decay) can be fairly well interpolated (regionalized) with the help of ancillary data (climate/soil/terrain/vegetation characteristics) even though the actual streamflow time series remain hard
Motivated by the pressing need to significantly improve the calibration density (and calibration period length) and inspired by Beck et al. (2015) work, this study will calibrate VIC model at 1/8°resolution against streamflow characteristics over the CONUS, increasing the calibration density from 2,107 HUC8 basins (Oubeidillah et al., 2014) to 52,663 grid-cells over the CONUS, for a 20-year training period and 10-year validation period. The paper is structured as follows: Section 2 describes the data source and methodology used to calibrate and validate the parameters. The calibration performance and the improvements to calibration strategy are presented in Section 3. Finally, conclusions are provided in Section 4.

Parameter calibration strategy
We set up the VIC model across the CONUS at the same 1/8°resolution as the GSCD data and calibrated each 1/8°grid-cell (a total of 52,663 grid-cells) independently on streamflow characteristics. The GSCD provides global maps of 17 streamflow statistics including the mean flow, different high/low flow percentiles (flow duration curve -FDC), various baseflow indices, and timing of half flow volume of the year. Such characteristics, though less informative than the full time series, can help constrain many important aspects of the hydrograph. To derive the GSCD, daily streamflow observations from 3,000 to 4,000 catchments (10-10,000 km 2 ) worldwide were used to train artificial neural network ensembles to estimate streamflow characteristics based on 20 climatic and physiographic characteristics for the catchments. The trained neural network ensembles were subsequently applied spatially over the entire ice-free land surface, resulting in global maps of the streamflow characteristics. More details can be found in Beck et al. (2013Beck et al. ( , 2015. Note that the GSCD uses the term "streamflow" for the 1/8°gridded dataset, while the modeling literature normally use the term "runoff" for the same gridded quantity and "streamflow" for the flow values routed to the outlets of river basins (i.e. runoff integrated in time and space). The two terms are equivalent at the grid-scale in this study since the routing processes within the grid-cells can be ignored at the daily scale, and thus they will be used interchangeably here for grid-scale quantities. The term "streamflow" will be used for flows routed to outlets of larger basins.
There are a variety of typical streamflow characteristics proven to be useful for model calibration, including the FDC, baseflow indices, runoff ratio, master recession curve (Addor et al., 2018;Fenicia et al., 2018;Troy et al., 2008;Winsemius et al., 2009;Yilmaz et al., 2008). Moreover, the experiments conducted by Fenicia et al. (2018) showed that calibration against several FDC quantiles and baseflow index can produce virtually indistinguishable streamflow prediction from calibration operated directly on the streamflow time series. Therefore, the four most significant streamflow characteristics were selected as the calibration targets, namely, mean annual streamflow per unit area (QMEAN; mm/yr), baseflow index (BFI in this study, BFI3 in the GSCD; dimensionless), high flow (Q10, the daily flow exceeded 10% of the time; mm/d), and low flow (Q90, the daily flow exceeded 90% of the time; mm/d), to constrain different aspects of the hydrograph.
Given the selected streamflow characteristics and VIC model parameter sensitivity analysis by Demaria et al. (2007), three VIC parameters controlling the generation of surface and subsurface flow were selected for calibration (Table 1), including the variable infiltration curve parameter (b i ), thickness of soil layer 2 (thick 2 ), and fraction of the maximum velocity of base flow at which nonlinear base flow begins (Ds). This is a limited set of parameters to tune, as constrained by the number of calibration targets (four streamflow characteristics), but they are still powerful enough to help adjust most model behaviors.
The Shuffled Complex Evolution (SCE-UA) algorithm (Duan et al., 1992(Duan et al., , 1994 was employed to find the optimal parameter set for each grid-cell. The objective function for each grid-cell was:

Water Resources Research
were GSCD values, and QMEAN m , BFI m , Q 10m , Q 90m are calculated from the model simulation. w i (i = 1,2,3,4) represents the weighting factor, indicating the importance of the corresponding characteristic during calibration. Since the uncertainty behaviors and their definitions in GSCD are different for the four characteristics, they cannot be directly used to determine the weights. Moreover, the weight allocation itself is inevitably a more subjective process, which should be customized according to the specific focus or needs of the study. With considerations on the focus of this work on improving model overall performance and a simple assessment of their relative importance (subjective), w i (i = 1,2,3,4)was set to {0.6, 0.2, 0.1, 0.1} for the four characteristics in this study.

Model set-up, calibration, and evaluation
The VIC model (version 4.0.6, https://github.com/yuanyangthu/VIC4.06_netcdf) simulation was performed at daily time steps in water balance mode (Liang et al., 1994), forced by the meteorological data from Livneh et al. (2013), including daily minimum and maximum temperature, precipitation and wind speed. The initial values for all model soil and vegetation parameters (whether to calibrate or not) were obtained from gridded data sets developed for the NLDAS-2 .
The model ran at two different resolutions during the calibration and evaluation. The calibration was performed at 1/8°for the 20-year period of 1982-2001, with 2 years of model spin-up during 1980-1981.
The model evaluation was performed at a slightly higher resolution of 1/16°for the 10-year period of 2002-2011. The model was also evaluated for the calibration period 1982-2001. This higher resolution was used for two reasons: (1) the meteorological forcing data (Livneh et al., 2013) has its original resolution at 1/16°, (2) finer spatial resolution of the runoff can better match the unit catchments used by the routing model. After the parameters were calibrated at 1/8°, they were resampled to 1/16°using bilinear interpolation before evaluation.
To evaluate the VIC performance against observed streamflow at the outlets of larger basins, the modeled grid-cell runoff (surface and subsurface) was re-mapped and routed using a topographically-based routing model, the HRR model (Beighley et al., 2009). The model is built on an irregular model unit defined from the river network and associated catchment boundaries. For each model unit (catchment over a river reach), the landscape is approximated as an open book with two planes draining laterally to a main channel. The kinematic wave method is used for lateral transport of surface and subsurface runoff, while the diffusion wave (Muskingum-Cunge) method is used for river routing. The model has been applied in many regions, such as the Amazon (Beighley et al., 2009), Congo (Beighley et al., 2011), Susquehanna (Ray et al., 2016;Seyyedi et al., 2014Seyyedi et al., , 2015 and Ohio River Basin (Zhao & Beighley, 2016). In this application, the NLDAS domain was delineated into 74,075 river reaches and catchments based on a derived river network using a threshold area of roughly 100 km 2 . The resulting drainage area for the model units ranges from 0.1 to 2,700 km 2 , with a median value of 120 km 2 . The hillslope, channel and floodplain length and gradients required by the HRR model were determined using the 3 arc-sec (~90 m) elevation and flow direction dataset derived from NASA's Shuttle Radar Topography Mission (SRTM; Lehner et al., 2008). The main routing parameters, hillslope surface roughness, subsurface horizontal conductivity, and channel roughness were derived following Zhao and Beighley (2016).
The observed daily streamflow data in this study were part of the Geospatial Attributes of Gages for Evaluation Streamflow, version II (GAGES II; Falcone et al., 2010), and were downloaded from USGS website (https://waterdata.usgs.gov/nwis/dv/?referred_module=sw). Two groups of catchments were used for the evaluation. The first group of catchments (Group 1) were those used in deriving the characteristics in the GSCD, and the second group of catchments (Group 2) were those not used in training GSCD, which were used to test the robustness of the calibration method, especially in ungauged regions. In addition, these catchments were required to have more than 365 days of valid data (not necessarily consecutive) during calibration/validation period, resulting in 704 (calibration period)/648 (validation period) catchments in Group 1 and 3941 (calibration period)/3809 (validation period) in Group 2 that were suitable for the evaluation. These catchments are small-to-medium-sized catchments ranging from 70 to 26,000 km 2 . Additionally, we also selected five large and not heavily regulated basins which are larger than 26,000 km 2 and have a 30year complete data record from 1982-2011 to complement the model evaluation.
Four skill metrics, the relative bias (RB), alpha, correlation coefficient (CC) and Nash-Sutcliffe Efficiency (NSE; Nash & Sutcliffe, 1970) coefficient, were used to evaluate model performance. RB provides a measure on the total streamflow volume error. Alpha is a measure of relative variability in the simulated and observed values. CC tests the agreement between the simulated streamflow dynamics and gauge observations. NSE measures the overall predictive skill of the model as compared to the mean daily observed streamflow. Besides, another four similar metrics, including histogram intersection (histo match), spatial correlation coefficient (spatial CC), spatial variability error (CV ratio) and SPAtial Efficiency (SPAEF), were introduced to evaluate the spatial pattern of parameters and streamflow characteristics Koch et al., 2018). The metrics are summarized in Table 2.  Figure 1 (c4) was the Arno Ds converted from Nijssen baseflow parameters. The calibration methods to derive these five sets of parameters can be divided into two categories: (1) non-MPR approach, which directly calibrates parameters against observations; and (2) MPR approach, which relates parameters to basin predictors (e.g., elevation, slope, soil texture, or vegetation characteristics) through transfer function and then calibrates these "transfer function parameters". The parameters from Mizukami et al. (2017), derived by the MPR approach, were functions of topography and soil properties and showed very good spatial patterns (Figure 1 a4, b4, and c4). Among the other four sets of parameters, which were derived by non-MPR approach, the parameters calibrated in this study exhibited a much better spatial coherence, without discontinuities across basin boundaries or bull's eye patterns from simple parameter interpolation. For example, the topography, climate and soil characteristics are similar in the Rocky Mountains, which indicates that parameters would not vary significantly. However, the parameters from NLDAS-2 located in the same region or subregion of

Spatial patterns of calibrated parameters
Note. N is the number of days, Q O and Q m are observed and simulated daily streamflow. A and B represent two spatial fields. μ, σ and cov represent mean value, standard deviation and covariance. K and L are the histograms of A and B, each containing n bins, i.e. herein 10 bins.

Water Resources Research
No. 10, 14, 16 and 17 HUC2 basins changed abruptly from one region/subregion to another. Such patchiness also existed in calibrated parameters from Oubeidillah et al. (2014), but across HUC8 basins. Parameters in Troy et al. (2008) exhibited some clusters with high values in the center and low values around. Compared with the former three sources of parameters, the calibrated parameters in this study varied much more consistently in space. The parameter b i controls the shape of the variable infiltration curve in VIC, and effectively dictates the partitioning of rainfall into infiltration and surface runoff. Smaller b i results in an increase in the infiltration capacity and thus a decrease in the amount of surface runoff. Humid regions typically had a higher b i while arid regions had a lower b i , as shown in Figure 1 (a5). In contrast, the parameter thick 2 showed an opposite pattern with higher values in arid regions and lower values in humid regions (Figure 1 (b5)) since thicker soil depths can retain more water and increase the loss due to evapotranspiration. The spatial patterns of the calibrated b i and thick 2 match well with the spatial patterns in aridity and runoff ratio from the CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) data set (Addor et al., 2017), as shown in Figure 2. b i was significantly and negatively correlated with aridity, with a spatial correlation of -0.587 (P<0.001), while positively correlated with runoff ratio, with a correlation of 0.416 (P<0.001). In contrast, thick 2 was significantly and positively correlated with aridity (spatial CC = 0.434, P<0.001), negatively correlated with runoff ratio (spatial CC = -0.418, P<0.001). Ds is a more conceptual parameter controlling the baseflow rate. It showed no obvious relationship with climate, topographic, soil, or vegetation characteristics and was more likely to be the result of more complex interactions between model parameters and streamflow characteristics.

Assessment of the grid-level distributed calibration strategy
The performance of the calibrated model was compared to that of the pre-calibrated model (i.e., the same model setups except using NLDAS-2 parameters in Figure 1 (a1), (b1) and (c1)) to assess the effectiveness of this calibration. NLDAS-2 parameters were used as benchmark for skill evaluation since they have been used for many studies. Figure 3 shows cumulative density function (CDF) plots for the RB, alpha, CC, and NSE values for the catchments of Group 1 for both the calibration (solid) and validation (dashed) periods. All metrics were computed based on daily VIC model simulations before calibration (blue lines) and after calibration (red lines). Tables in Figure 3 show the statistics of Kolmogorov-Smirnov test (K-S test; Eghbali, 1979;Smirnov, 1948), which was used to validate whether the improvements were significant. Significant improvements can be seen in RB as evidenced by steeper CDFs as well as median values that were closer to zero (Figure 3a). For the calibration period, the percentages of RB falling in ranges < -25%, [-25%, 25%] and > 25% were 7.2%, 75.5% and 17.3%, improved from 16.6%, 56.8% and 26.6%, respectively. This indicated that the large underestimation and overestimation was reduced after calibration. The calibration tended to underestimate the flow variability to a greater degree, which, at the same time, reduced the overestimation of the flow variability in pre-calibration ( Figure 3b). Therefore, higher percentages of alpha fell in ranges around 1.0, such as the range [0.5, 1.5]. As expected, there was no obvious improvement in terms of CC since the streamflow characteristics used to calibrate model did not provide much information about the temporal dynamics ( Figure 3c). In terms of NSE, the model performances were improved as indicated by the  Figure 3d. The NSE changed from negative to positive for approximately 10% of catchments for the calibration period. About 50% of the catchments after calibration had an NSE greater than 0.4.
To study the spatial pattern of model performance, the skill metrics were plotted at each catchment of Group 1 for both the calibration and validation periods in Figure 4. The improvements in model performance varied among different parts of the CONUS. The relative bias has been reduced, especially for the eastern part of the CONUS. The high positive streamflow biases in the eastern Great Plains and the southeastern CONUS have been reduced to less than ±30% for the majority of the catchments. The negative biases around the Great Lakes region have been reduced to less than ±10%. Improvements can also be seen elsewhere, for example in the northeast. Basins west of 100°W showed roughly the same strong positive biases after calibration, and thus exhibited little improvement. The change of the flow variability error showed a different spatial pattern, with overestimation being largely reduced in the western CONUS and underestimation being worsened in some regions, such as Texas, Kansas and Missouri. In terms of CC, there were no obvious changes after calibration. The change of NSE did not show an obvious spatial pattern.
Similar results can be seen for catchments which are not used to derive the GSCD ( Figure S1 and S2). For example, the improvements in RB were still concentrated in the eastern CONUS ( Figure S2). The strong overestimation in the southeastern CONUS and the eastern Great Plains has been reduced. The underestimation in the northeast has also been reduced.

Water Resources Research
Clearly, while the calibrated parameter set may provide satisfactory results for wetter regions, it is challenging to capture the streamflow in drier regions (Figure 4), which is consistent with previous studies Naz et al., 2016;Oubeidillah et al., 2014;. Figure 5 shows model metrics as a function of runoff ratio (RC, obtained from GSCD) for each small-tomedium sized catchment. It shows a tendency for VIC to perform worse for catchment with a lower runoff ratio than catchments with higher runoff ratios. Possible reasons for poor performance in the dry region included errors in forcing data (Beck et al., 2016;Lohmann et al., 2004;Mizukami et al., 2017;Naz et al., 2018;Oubeidillah et al., 2014;Troy et al., 2008;, anthropogenic effects on streamflow (Maurer et al., 2002;Troy et al., 2008; as well as limitations in model physics (Demaria et al., 2007;Naz et al., 2018;Newman et al., 2017;Oubeidillah et al., 2014). Issues in dry regions are being addressed in newer versions of VIC model. For example, the "Big Leaf" model, which leads to evapotranspiration (ET) underestimation due to lack of soil evaporation between plants, has been replaced by the "Clumped" model that considers such soil evaporation (see VIC model documentation at https://vic.readthedocs.io/en/master/Overview/ModelOverview/).
Three typical catchments from the southeastern CONUS, the southwestern CONUS and the northern Rocky Mountains were selected to showcase the region-specific model performance and efficacy of the calibration method. Figure 6 shows daily hydrographs for the three catchments for one year (2001) in the calibration period and one year (2002) in the validation period. The catchment Little River near Silverstreet, South Carolina (Figure 7), representative of the southeastern CONUS, exhibited much better model performance after calibration (Figure 6a). The simulation before calibration could capture the daily variability, but produced too much streamflow, especially for the dry season. The calibration significantly reduced the overestimation and showed a good match with the observations for this humid subtropical catchment. Arid regions have always been a challenge for the VIC model. For many catchments in these regions, the calibration was

Water Resources Research
only able to reduce the large wet bias in the model to a certain extent and the post-calibration bias was still too high. For example, Gila River near Gila, New Mexico, which locates in the southwestern CONUS and is a typical arid catchment, exhibited much higher streamflow simulations than the observations both before and after calibration (Figure 6b). The calibration in this study, which did not calibrate snow parameters, did not work well in snowmelt-dominated region. Figure 6c shows the observed and modeled streamflow at Bull Lake Creek above Bull Lake, Wyoming. The station is 1790 m above sea level and therefore the runoff is highly influenced by snowmelt. The modeled snowmelt seemed to have started too early, causing peak flows during April, while the observed peaks arrived in May. This early snowmelt is mainly affected by the one-layer snow band set in VIC and indicates that multiple snow bands should be calibrated and implemented. The VIC model both before and after calibration also underestimated streamflow at this station. This negative bias can be attributed to forcing errors Sheffield et al., 2003) and inefficiency in snowpack related parameterizations and parameters (Xia et al., 2018).
The GSCD was derived entirely using data from small basins. To complement the previous evaluation using the small basins, we performed a fully independent evaluation of the calibration strategy over five large basins (Figure 7). Table 3 shows the model performance for the five large basins at daily, monthly and yearly scales. To give an emphasis on relative bias, which measures the model skills in simulating total amount of

10.1029/2018WR024178
Water Resources Research runoff, annual streamflow is shown in Figure 8. As shown in Table 3 and Figure 8, the total amount of runoff was well simulated after calibration. The Potomac River, which showed a good performance prior to the calibration, continued to perform well. The Alabama River, for which the model overestimated streamflow, showed marked improvements during both the calibration and validation periods, with relative bias reduced to -4.11% and 8.51% from 24.74% and 40.32%, respectively. The calibrated model accurately reproduced the magnitude and variability of annual streamflow. Notable improvements were also observed in the Ohio River and Mississippi River, where the underestimation was strongly reduced. The Arkansas River, which is located in the arid region, showed slightly larger relative biases after calibration. Since the pre-calibration

Water Resources Research
parameters values (NLDAS-2) were already calibrated to maximize NSE in large basins (Maurer et al., 2002), the calibration here did not show much improvements in NSE, especially at daily and monthly scales. The improvements were mostly witnessed by basins which did not perform well prior to the calibration, such as the Alabama River and Mississippi River. Overall, the model performance in large basins after calibration was good, which is inspiring as it is an independent validation of our grid calibration strategy and shows that it is comparable to the large basin calibration approach.
The model performance in the validation period was comparable to that in the calibration period in most catchments (Figure 3, 4, 8 and Table 3). Overall, this indicated that the model parameters were robust enough across different periods with comparable skills. Figure 9 shows spatial maps of the four streamflow characteristics used for the calibration. For each characteristic, we compared four sources, including the GSCD and maps derived from the three simulations (2002-2011) based on NLDAS-2 parameters, Oubeidillah et al. (2014) parameters and parameters calibrated in this study. Note that we directly used the daily runoff simulations from Oubeidillah et al. (2014), considering that the model performed best and their parameters were most valid in the original setup (model version, forcing data and spatial resolution). The spatial pattern of QMEAN after the calibration performed in this study visually corresponded best with the GSCD. The overestimation in the southeastern CONUS and Great Plains and the underestimation in the northeastern CONUS and Great Lakes region by NLDAS-2 have been reduced. In terms of BFI, the GSCD and the map derived from the calibration in this study showed good agreement and a much better spatial coherence than the maps derived from NLDAS-2 and Oubeidillah et al. (2014) calibrations, which showed pronounced discontinuities across large basins and HUC8 basins, respectively. The BFI overestimation by NLDAS-2 in the east has been reduced significantly. The spatial patterns in BFI from the GSCD and the map derived here were quite similar to those from the CAMELS dataset (see Figure 4e of Addor et al., 2017), although different methods were used for the BFI calculation. The baseflow in CAMELS dataset was derived by the Ladson et al. (2013) digital filter, while the baseflow in the GSCD was the 7-day moving minimum streamflow. Compared with the GSCD, the BFI map after calibration had higher BFI in the most regions of the southwestern CONUS, which means the parameters calibrated in this study produced too much slow runoff. The spatial distribution of Q10 and Q90 followed closely the pattern of QMEAN. As was the case of QMEAN, calibration raised both the high flow and low flow in the northeast and Great Lakes region, lowered them in the southeast and Great Plains. However, there were still some large differences between the maps of GSCD and calibrated simulation. For example, the calibrated simulation produced higher Q10 in the area covering Nevada, Arizona, New Mexico and Texas, and higher Q90 in Missouri and Illinois. The limited improvement in Q10 and Q90 may due to the lower weights assigned to them in the objective function. Table 4 further shows the quantitative spatial comparison of the four streamflow characteristics before and after calibration. All the four streamflow characteristics derived here showed higher spatial correlations with those from GSCD. The CV ratio of QMEAN, BFI, and Q90 increased after calibration, showing the calibration improved the spatial variability of these streamflow characteristics. Except the slight decrease in QMEAN (0.988 to 0.980), the streamflow characteristics after calibration witnessed increases in histo match, which means that flow value distributions in different ranges were better reproduced. In terms of SPAEF, a comprehensive metric considering the above three metrics together and providing a more holistic and balanced assessment for spatial performance, the SPAEF of QMEAN, BFI, Q10 and Q90 increased to 0.916, 0.571, 0.807 and 0.785 from 0. 878, 0.233, 0.780, and 0.489, respectively, showing that the calibration contributed to better spatial patterns for the four streamflow characteristics.

Spatial distribution of QMEAN, Q10, Q90, BFI
Overall, the streamflow characteristics after calibration were much closer to the GSCD, demonstrating the effectiveness of the calibration strategy. But it should be noted that perfect agreement between the characteristic maps of GSCD and the VIC-based maps was unlikely, since (1)

Advantages and disadvantages of the calibration strategy
The grid-level distributed calibration against GSCD streamflow characteristics provided a highest possible calibration density in space for large-scale hydrologic modeling. This calibration strategy can also be applied in sparsely gauged or ungauged regions. It has been applied in Lin et al. (2019) for global discharge reconstruction. It is applicable to any hydrologic model and does not try to establish relationships between conceptual model parameters and watershed characteristics, which need to be reestablished when calibrating different models. Moreover, directly constraining the model to reproduce several hydrologically important characteristics of the hydrograph can make the calibration more robust to deficiencies in simple objective functions of streamflow time series since these characteristics are less sensitive to features poorly reproduced by the hydrologic model, such as the errors in the timing and magnitude of several individual flood peaks (Fenicia et al., 2018).
Although this calibration method offers appealing benefits, limitations also exist. First, the use of streamflow characteristics is usually associated with a loss of information in lieu of the streamflow time series, which may have a negative impact on parameter estimation (Fenicia et al., 2018;Shafii & Tolson, 2015). The "adequacy" of available characteristics requires further investigation. This study relied on only four streamflow characteristic values provided by GSCD. Not much information in time dimension except the high flow (Q10) and low flow (Q90) was involved, which may be responsible for the lack of improvements in daily dynamics (Figure 3c and Figure 4). While monthly streamflow characteristics could help to solve this issue, more uncertainties would be encountered when developing the monthly streamflow characteristics. More streamflow characteristics, such as the flashiness index used in Fenicia et al. (2018), could be considered in the calibration process to improve the variability and dynamics of the streamflow modeling.
Additionally, the quality of streamflow characteristics data matters too. GSCD was shown to be very reliable over CONUS thanks to the dense streamflow monitoring networks in the region (Beck et al., 2015), while extra caution should be exercised in global applications.
Second, this calibration strategy and resulted parameters are inevitably forcing-dependent, because the calibrated parameters can compensate forcing biases and model structure problems (Blöschl et al., 2007;Bock et al., 2016;Elsner et al., 2014;Merz et al., 2011;Weiland et al., 2015). This would reduce the skill of the model in replicating results outside of calibration conditions and increase the potential for equifinality of parameter sets and higher model uncertainty. For example, Elsner et al. (2014) demonstrated that the choice of forcing dataset resulted in different parameter combinations, although with similar objective function values. While such forcing dependency is acceptable in the CONUS, given the high quality of the forcing data, this can be an issue in regions with low quality forcing data.
Third, the potential performance improvement will depend on the number of calibrated parameters (Beck et al., 2016). The most sensitive parameters for runoff generation in VIC are mostly "empirical" parameters, i.e., curve shapes, layer thickness and baseflow linearity threshold. Newman et al. (2017) has pointed out that increasing the number of calibrated parameters could improve VIC model performance. Three soil-related parameters were calibrated here. The other soil-related conceptual parameters, such as the maximum base flow velocity and thickness of soil layer 3, could also affect the hydrologic responses and be calibrated rather than estimated a priori using soil texture data or expert opinion. Parameters related to vegetation characteristics could be calibrated as well. Vegetation parameters allow substantially greater  (2017) calibrated minimum stomatal resistance and monthly varying LAI to achieve better model performance. However, at the same time, we need to bear in mind that increasing the number of free calibration parameters may also increase the equifinality problem. Extra attention is needed while calibrating more parameters.
Fourth, the resolution of calibration targets (streamflow characteristics) limits calibration density and thus parameter resolution, a disadvantage against MPR. It would be more difficult to calibrate with this method at a higher resolution because of higher computation cost and potentially higher uncertainty in estimates of gridded streamflow characteristics. Here we calibrated VIC model at 1/8°and then applied the calibrated parameter set at 1/16°after bilinear interpolation. Figure S3 shows model performance for catchments of Group 1 at the two resolutions. Significant differences were found in the distribution of relative bias ( Figure S3 (a)). In general, downscaling parameters from 1/8°to 1/16°produced more runoff. This is because at coarser resolutions, the precipitation intensity is lower due to averaging (Troy et al., 2008)a typical scaling issue. This can be amended if we can downscale runoff properly. The relative biases between 1/8°a nd 1/16°differed a lot in the western mountainous region (Figure 10), where model with the downscaled 1/16°parameters showed much larger positive biases. The actual runoff differences between 1/8°and 1/16°were not so large in this region, but the small runoff magnitude inflated the relative bias. As for alpha and correlation, their distributions at the two resolutions were very close. Overall, there were no significant differences in the distribution of NSE, indicating that scaling parameters from 1/8°to 1/16°had little effect on model performance in this study. More efforts are needed in the future to improve the scale consistency of distributed parameter estimations.

Conclusions
As large-scale hydrologic modeling at fine resolutions is becoming more common and necessary in hydrologic applications, spatially distributed parameters need to be improved. This study introduced a grid-level distributed calibration method using machine learning-based streamflow characteristics for the VIC model over CONUS. The main findings are as follows: 1. A set of distributed parameters at 1/8°resolution was produced with good spatial coherence. The spatial patterns in calibrated parameter values corresponded well with spatial patterns in climate characteristics, such as aridity and runoff ratio. 2. The quality and applicability of the calibrated parameters were validated by better model performance with respect to the observed streamflow from two groups of small-to-medium-sized catchments

Water Resources Research
(70-26,000 km 2 , 704/648 (calibration/validation period) in the group used to derive the GSCD, 3941/3809 (calibration/validation period) in the group not used to derive the GSCD) as well as five large basins. Considerable improvements can be seen in relative bias, with large overestimation and underestimation in the east of the CONUS being reduced. However, the catchments in the central and southwest US tended to improve less, suggesting additional issues to solve that are beyond the reach of parameter calibration, e.g., forcing error, anthropogenic effects and model physics. 3. The comparisons on streamflow characteristics during validation period showed that the calibration was able to match these characteristics from the model to GSCD, which further confirmed the efficacy of the calibration scheme.
Overall, this study offered a proof-of-concept for the grid-level distributed parameter calibration at continental scale. As the new calibration strategy enabled a much higher calibration density, the benefits are obvious for high-resolution hydrologic and river hydrodynamic modeling. Besides the intrinsic limitations of calibration measures, the lack of temporal dynamics information also limits the power of this strategy. Future work will extend the experiments to global scale.