Streamflow Generation From Catchments of Contrasting Lithologies: The Role of Soil Properties, Topography, and Catchment Size

Understanding streamflow generation and its dependence on catchment characteristics requires large spatial data sets and is often limited by convoluted effects of multiple variables. Here we address this knowledge gap using data‐informed, physics‐based hydrologic modeling in two catchments with similar vegetation and climate but different lithology (Shale Hills [SH], shale, 0.08 km2, and Garner Run [GR], sandstone, 1.34 km2), which influences catchment topography and soil properties. The sandstone catchment, GR, is characterized by lower drainage density, extensive valley fill, and bouldery soils. We tested the hypothesis that the influence of topographic characteristics is more significant than that of soil properties and catchment size. Transferring calibration coefficients from the previously calibrated SH model to GR cannot reproduce monthly discharge until after incorporating measured boulder distribution at GR. Model calibration underscored the importance of soil properties (porosity, van Genuchten parameters, and boulder characteristics) in reproducing daily discharge. Virtual experiments were used to swap topography, soil properties, and catchment size one at a time to disentangle their influence. They showed that clayey SH soils led to high nonlinearity and threshold behavior. With the same soil and topography, changing from SH to GR size consistently increased dynamic water storage (Sd) from ~0.12 to ~0.17 m. All analyses accentuated the predominant control of soil properties, therefore rejecting the hypothesis. The results illustrate the use of physics‐based modeling for illuminating mechanisms and underscore the importance of subsurface characterization as we move toward hydrological prediction in ungauged basins.


Introduction
Forecasting streamflow and extreme hydrological events (e.g., flooding and droughts) is essential for our society as the pace of climate change accelerates (Hrachowitz et al., 2013;Montanari et al., 2013;Sivapalan, 2003;Vorosmarty et al., 2010). Although forecasting capability has progressed significantly in the recent decades with rapid data accumulation and model development, hydrological prediction in ungauged basins (PUB) remains a grand challenge (Hrachowitz et al., 2013). Model Transferability tests directly from gauged to ungauged watersheds have yielded mixed results (Fenicia et al., 2016;Heuvelmans et al., 2004;Li et al., 2012;Smith, Hayes, et al., 2016;van der Linden & Woo, 2003), underscoring challenges in understanding how and how much catchment characteristics (e.g., lithology, land cover, topography, and size) influence streamflow generation.
Studies on streamflow generation have revolved around the dynamics of connectivity, storage (S) and discharge (Q) relationships, and threshold behaviors that emerge at the catchment scale. Storage-discharge relationships have been explored since the 1930s (Horton, 1936) and are generally recognized as highly nonlinear, often taking a power law form (Wittenberg, 1999). Water storage is often conceptualized as partitioning between distinct reservoirs: a small, active reservoir that rapidly responds to hydroclimatic forcing (dynamic storage S d ) and a passive reservoir that is characterized by longer residence times before water reemerges in streams and rivers (Dunn et al., 2010;Kobierska et al., 2015;McNamara et al., 2011;van der Velde et al., 2015). Transit time estimates using water isotopes or tracers have shown that S d can be more than an order of magnitude smaller than the total storage (S t ) inferred from soil porosity and depth (Birkel et al., 2011;Bishop et al., 2011;McNamara et al., 2011). Such dramatic reduction in S d has been attributed to limited hydrologic connectivity between hillslopes and streams (Bracken et al., 2013;Jencso et al., 2010;McGlynn & McDonnell, 2003;Wlostowski et al., 2016), large water storage, and the existence of thresholds that must be surpassed to initiate hydrological response (Ali et al., 2015;James & Roulet, 2007;Lehmann et al., 2007;Seibert et al., 2011;Tromp-van Meerveld & McDonnell, 2006).
Streamflow generation is often influenced by a multitude of competing factors including external forcings (e.g., climate) and internal structure characteristics (vegetation, land use, topography, and lithology). Among these, the role of lithology and topographic characteristics has been extensively studied. Brantley et al. (2017) argued that weathering affects subsurface permeability, water infiltration, and flow partitioning between shallow and deep subsurface. Kuentz et al. (2017) concluded that the base flow index correlates more strongly with lithology whereas topography primarily controls streamflow flashiness. Hale and McDonnell (2016) observed that catchments derived from permeable sandstone bedrock have longer mean transit times than those with tight volcanic bedrock (Hale & McDonnell, 2016). Pfister et al. (2017) showed that catchment storage decreases whereas streamflow flashiness increases as the percentage of impermeable bedrock increases (Pfister et al., 2017). On the other hand, topographic characteristics, including relief and riparian versus hillslope areas, have long been demonstrated as governing streamflow. Catchment topography and topology have been observed to correlate to hillslope-stream connectivity and runoff source area, therefore exerting a first-order control on streamflow (Jencso et al., 2009). Mean residence time has been shown to correlate strongly with topographic characteristics that determine flow path distance and gradients (McGuire et al., 2005).
In addition to lithologic and topographic properties, soil properties generally regulate water holding and storage capacity (Van Genuchten, 1980), base flow indexes (Zimmer & Gannon, 2018), and streamflow dynamics (Zimmer & McGlynn, 2017). Shi et al. (2015) showed that soil moisture is controlled primarily by soil properties and secondarily by topography and depth to bedrock. Fenicia et al. (2016) found that models with geology-based hydrological response units (HRUs) are more robust in reproducing spatial variations in streamflow compared to those using topography-based HRUs. This observation supports the idea that geology and soil properties exert a stronger control than topography. Topography and soil properties can also drive runoff generation differently under different conditions. In the Tenderfoot watershed in Montana, runoff is topographically driven with lateral redistribution of water and hydrologic connectivity under wet conditions (Jencso et al., 2009); under dry conditions, the influence of geologic controls is more pronounced (Payn et al., 2009).
Streamflow generation also varies with catchment size (Pilgrim et al., 1982). For example, in wet catchments at Maimai, New Zealand, with relatively similar geology, topography, and soil depths, riparian zone groundwater levels and runoff correlate strongly in small headwater catchments but not in large catchments (McGlynn et al., 2004). In nested headwater catchments in New York, USA, estimated event-water contribution during intensive storms inversely correlate to catchment size (Brown et al., 1999). Those authors argued that shallow subsurface flow contributes to summer stormflow substantially in small catchments but not in large catchments. A geomorphology-based model of runoff routing was used to argue that small catchment response is governed primarily by hillslope processes whereas large catchment response is governed by the structure of the stream network (Robinson et al., 1995).
The above-mentioned examples highlight the challenges of disentangling the effects of interdependent controlling variables. While general relationships may exist, their relative influence on streamflow generation remains equivocal. Often, catchments are carefully chosen to highlight the role of one variable (e.g., topography, soil properties, and size) while keeping others relatively similar. Nonetheless, large differences in natural catchments inevitably exist. An additional challenge is that streamflow generation is a catchment-scale emergent behavior and investigating its mechanisms often requires intensive spatial measurements of soil moisture and groundwater levels. Previous field studies on connectivity involve >147 storm analyses (Tromp-van Meerveld & McDonnell, 2006), 13 spatial distributions of soil moisture each with >500 measurements (Western et al., 1999), water table analyses in 84 recording wells distributed across 24 transects (Jencso et al., 2009), and~30 soil moisture monitoring sites (Lin et al., 2006). One way to reduce and circumvent these challenges is to compare catchments of distinctive characteristics 10.1029/2018WR023736

Water Resources Research
in controlled virtual experiments while changing only one variable at a time. In that way, the role of structural variables can be distinguished at the same time that general, emergent patterns can be highlighted. Such an approach could be more effective than focusing on the "idiosyncrasies of yet another experimental catchment" (McDonnell, 2003).
Here we combined data and model approaches to understand the relative influence of soil properties, topography, and catchment size on streamflow generation in two first-order, monolithological catchments in central Pennsylvania experiencing the same climate (temperate) and land use (forestland). One is the shale-underlain Shale Hills (SH, 0.08 km 2 ); the other is the sandstone-based Garner Run (GR, 1.34 km 2 ). The two catchments derive from different parent materials and differ in topography (relief, size of riparian zones, and slope), soil and macropore properties, and catchment size, all of which emerge from landscape evolution and ultimately depend on lithology (Reinhardt & Ellis, 2015). Streamflow monitoring data show that GR discharge is less flashy than SH (Hoagland et al., 2017;Shi et al., 2013). Given that GR has more permeable, bouldery sandy soils that often lead to flashy discharge, we hypothesize that streamflow response and storage-discharge relationships are more affected by topography than by soil properties and catchment size. In other words, we hypothesize that the influence of topographic characteristics (a flatter slope, longer slope length, larger riparian zone) is more significant than that of soil properties and catchment size, leading to a dampened streamflow response and a linear S-Q relationship at GR compared to SH.
This hypothesis was tested using three analyses. The first was a model transferability test, treating GR as if it was a new, ungauged catchment to which we transferred model calibration information from the previously modeled SH using the physically based, spatially distributed land surface hydrologic model Flux-Penn State Integrated Hydrologic Model (PIHM; Shi et al., 2013). The data used in GR were restricted to those that would be typically available for ungauged catchments. In this way, we explored the idea of transferring calibrated model parameters from one measured catchment to a nearby, ungauged one. If the hypothesis is true, the model calibration from SH should be directly transferable and should produce satisfactory discharge because topography is explicitly represented in Flux-PIHM. The second analysis included model calibration and sensitivity analysis that identified key process parameters. If the hypothesis is true, the soil property parameters should have relatively small influence on reproducing discharge and soil moisture data. Third, Figure 1. A simplified geological map of Shavers Creek watershed (the Susquehanna Shale Hills Critical Zone Observatory), showing the location, dominant lithology, and topography (elevation contours) for Shale Hills (SH) and Garner Run (GR) catchments. The catchments experience similar land use, climate, and tectonic histories but differ in lithology. These lithological differences have resulted in large differences in terms of relief, catchment size, drainage density, riparian/ hillslope ratios, and so on. Both today's climate and past periglacial conditions have influenced the underlying aquifer and surficial soil properties.
swap experiments were carried out where catchment characteristics were swapped one at a time, to tease apart the relative influence of topography, soil properties, and size. These virtual experiments assessed the influence of topography compared to soil properties and catchment size. The hypothesis suggests that the largest differences are expected for virtual catchments with different topography. The mechanistic understanding gleaned from these virtual experiments can facilitate the development of parsimonious models, reproduce streamflow dynamics, and advance our forecasting capabilities in PUB .

Study Sites
SH and GR are first-order catchments that are nested within the Susquehanna Shale Hills Critical Zone Observatory (SSHCZO). The CZO consists of the entire Shavers Creek watershed, a hydrologic unit code 10 watershed ( Figure 1 and Table 1). The streams in SH and GR are intermittent and perennial, respectively. SH is steeper (mean hillslope angle = 14-21°) than GR (mean hillslope angle = 12-17°), and GR has a much larger riparian zone with an extensive valley fill up to 300 m wide (~30% of the catchment width), compared to about 25 m (10% of catchment width) at SH. Both catchments are temperate forests dominated by deciduous broadleaf species with a small component of evergreen conifers and understory shrubs (Brubaker et al., 2018;Smith, Eissenstat, et al., 2016). SH generally has denser vegetation cover (60% more biomass and 19% larger maximum leaf area index [LAI]) than GR. Aspect plays a very different role between the catchments. While there is an overall similar pattern across slope position, both sites have higher biomass values at the valley floor compared to the ridge top position (Brubaker et al., 2018). Colocated measurements have shown that soil moisture covaries with leaf production and LAI, suggesting that trees influence soil moisture across space and time (Naithani et al., 2013). Detailed, high-resolution measurements of temporal LAI data however are not available such that we assume uniform LAI in the simulations.
SH is nearly 100% situated on the Rose Hill Shale Formation (Brantley et al., 2016). The Weikert soil dominates hillslopes while the Berks, Blairton, Rushtown, and Ernest soils prevail in swales and valley floor (Lin et al., 2006). Given that these soils are formed on the same rock in the same climate and with the same vegetation, the difference in soil types is largely a function of landscape position (soils in valley are wetter because

Water Resources Research
water drains downhill and organic matter accumulates in the valley) and aspect (one side is sunnier than the other). GR overlies the Tuscarora Formation (Flueckinger, 1969), which consists of almost pure quarzitic sandstone with minor interbedded shales (Hettinger, 2001). The Natural Resources Conservation Service Soil Survey Geographic Database (SSURGO; http://www.nrcs.usda.gov/wps/portal/nrcs/main/soils/survey/) and CZO observations (Brantley et al., 2016;Del Vecchio et al., 2018;Hoagland et al., 2017) show that GR bouldery soil has higher sand content and less silt and clay content compared to SH, which is consistent with higher hydraulic conductivity, larger pore size and porosity in GR. The residence times of the soils at the two sites differ because of differing erodibility and reactivity of the parent rock (Brantley et al., 2018;Del Vecchio et al., 2018;Li et al., 2018). Jin et al. (2010) reported that the SH soil has 38 wt.% (ridge top) to 64 wt.% (valley floor) clay (summation of illite, chlorite, and kaolinite; Table 2 in Jin et al., 2010). In contrast, the clay content in GR soils varies from 8% to 32%.
Given that we wanted to treat GR as if it was a new, ungauged catchment to which we transfer model parameters and knowledge from the well-studied SH, the model setup was mostly based on satellite and national data that were available even for ungauged river basins, allowing a test of transferability. These included data from the U.S. Geological Survey (USGS), National Elevation Dataset (NED, elevation map, https:// lta.cr.usgs.gov/NED), the National Land Cover Database (NLCD, vegetation map, https://www.mrlc.gov/ nlcd2011.php), the National Hydrography Dataset (hydrographic data, https://www.usgs.gov/corescience-systems/ngp/national-hydrography/national-hydrography-dataset), the North American Land Data Assimilation Systems Phase 2 (hourly meteorological forcing, https://ldas.gsfc.nasa.gov/nldas/ NLDAS2forcing.php), and the Moderate Resolution Imaging Spectroradiometer (MODIS, LAI every 8 days, https://modis.gsfc.nasa.gov/data/dataprod/mod15.php). For precipitation, although we had local measurements at SH, we did not have those measurements at GR. To compare the response of SH and GR to the same precipitation forcing, we used the same North American Land Data Assimilation Systems Phase 2 precipitation forcing data for the two catchments. The discharge was measured using stream stage (HOBO Pressure Transducer, Onset Computing) and a Parshall. The averaged surface soil moisture was measured using the cosmic-ray soil moisture observing system (COSMOS) that counts neutron intensities at the vicinity of the ground surface. Discharge and COSMOS data in 2015 were used as model constraints. The averaged model output of the topsoil moisture (<10 cm) within the effective footprint radius of~300 m were compared to the COSMOS data. The COSMOS measurements are affected by water content in soil organic matter (e.g., O horizons) and vegetation. The top 10 cm was chosen to be consistent with the calculation that~86% of the fast neutron counts was found to within the top 10 cm of the soil (Zreda et al., 2008).
At GR, boulders cover a fraction of the hillslopes; in contrast, at SH, rock fragments are observed to emerge only near trees or tree throws. The spatial distributions of boulders at GR were mapped at a resolution of 5 m and were grouped into four categories of volume fractions (<0.10, 0.10-0.67, >0.67 with trees, and >0.67 without trees; Del . At GR, depth to refusal with respect to hand augering (midslope and ridge) or depth to inferred bedrock (excavated with a jack hammer in the valley floor pit) ranges from 0.58 to 2.42 m with an average of 1.8 m

Model Processes
Flux-PIHM is a physically based, spatially distributed model  that couples a land surface scheme adapted from the Noah Land Surface Model with the PIHM (Qu & Duffy, 2007). In addition, there is a family of PIHM-related codes with different simulation capabilities , including landscape evolution (Zhang et al., 2016), ecosystem biogeochemistry (Shi et al., 2018), and catchment-scale reactive transport (Bao et al., 2017;Li, 2019;Zhi et al., 2019). The code discretizes the land surface into triangular elements and rivers into rectangular segments that are projected vertically down to the bedrock to generate 10.1029/2018WR023736

Water Resources Research
prismatic volumes. Simulated hydrological processes include precipitation, canopy interception, evapotranspiration, channel flow, overland flow, infiltration, recharge from the unsaturated to saturated zone, lateral flow connecting hillslope to the stream, and snow melt. The model outputs include spatial and temporal distribution of water content, from which water fluxes and budgets can be calculated.
Flux-PIHM assumes a no-flow boundary corresponding to the soil-weathered rock interface (Qu & Duffy, 2007;Shi et al., 2013). It therefore does not take into account recharge vertically from soil into regional groundwater in deep aquifers (Brantley et al., 2013). Such vertical flow is typically small given the sharp, often orders of magnitude, permeability contrast at the soil-weathered rock interface (Kuntz et al., 2011;Welch & Allen, 2014) and is typically ignored when solving water balance for a catchment (Kirchner, 2009). At GR, the valley is filled with high-permeability colluvial materials up to 9 m thick such that water can be lost to the deeper aquifer (Schaller & Fan, 2009). Water is also lost in the subsurface of SH, but this may be smaller in SH, as it does not have such high-permeability fluvial materials. Given that Flux-PIHM does not include this potential water loss into the deep aquifer, the code essentially lumps this water loss into evapotranspiration (ET) to conserve water balance when stream discharge and soil moisture are used for calibration. Here we note the calculated ET as ET*, which includes potential recharge into the deep aquifer.

Model Setup
Catchment characteristics include topography (e.g., surface elevation), vegetation properties (e.g., land cover type, rooting depth, LAI, and stomatal conductance), and soil properties. All vegetation parameters were from previous work . The LAI values from MODIS at a frequency of 8 days were used as external forcing. A uniform LAI value was used for the whole watershed at the spatial resolution of MODIS (1 km 2 ). Soil properties include matrix properties such as depth, hydraulic conductivity, porosity, and van Genuchten parameters (α, n, θ s , and θ r ) and macropore properties such as depth and conductivities (horizontal and vertical) that reflect boulder characteristics. The code read elevation maps from the USGS NED and land cover maps from the NLCD. These data set up the domain of virtual catchments such that they represented the topography and land cover of the real catchments. Initial and boundary conditions included subsurface characteristics (e.g., soil depth), initial water distribution, water table, snow cover, canopy storage, and watershed boundary fluxes. Soil matrix properties were from the SSURGO data initially, and macropore properties were based on soil survey data, as will be discussed later.

Calibration
Like any spatially explicit model, Flux-PIHM is parameter intensive. There are 8 soil types in GR from the SSURGO database and 13 distributed parameters for each soil type, including 3 macropore parameters. If each parameter was calibrated independently in Flux-PIHM, a total of 104 soil parameters is needed. This overparameterization was circumvented by using parameter regularization that utilized spatially distributed a priori soil parameters and a single global calibration coefficient (GCC) for each parameter, a common practice in physically based hydrologic modeling (Smith et al., 2004).

Storage-Discharge (S-Q) Relationship
Streamflow generation is known to depend largely on S-Q relationships. Based on the rainfall-runoff equation, Kirchner (2009) developed the following linking streamflow to storage: Under the condition b < 2, Q ref is a reference discharge based on the best fit to the S-Q relationship; S is the water storage calculated from model output averaged across the whole catchment; S d (or originally k 1 in Kirchner, 2009) represents a scaling constant that has the dimension of storage; S min (or originally S 0 in Kirchner, 2009) is the storage when discharge drops to zero; and b is the exponential term in dQ dt ¼ −aQ b (Brutsaert & Nieber, 1977), representing the S-Q nonlinearity. Water storage is often considered as 10.1029/2018WR023736 Water Resources Research composed of "dynamic" storage that responds rapidly to hydrological events and "passive" storage that is often deeper and takes longer time for water parcels to route through (Hrachowitz et al., 2016). Summation of the two is the total storage. In this work, we consider S d in equation (1) as the dynamic storage. It was calculated as the difference in the modeled water storage at the maximum and minimum discharges, as will be shown later in the S-Q figures in section 6. The total storage S t was calculated as the total pore volume per land surface area (soil porosity × depth/land surface area). By fixing S min and S d , values of Q ref and b were obtained by fitting the calculated Q and S relationship from the calibrated Flux-PIHM with equation (1) using the trust-region algorithm and "Curve Fitting" tool in MATLAB 2016a.

Connectivity
Hydrologic connectivity at the catchment scale is transient and depends on the spatial distribution of water content. Here we calculated hydrologic connectivity based on the model output of saturated water storage and flow connecting the uphill to the stream. A MATLAB code was developed based on approaches in literature (Allard, 1994;Western et al., 2001). The connectivity function τ(h) is defined as the probability (P) of two grid blocks being connected at a separation distance of h (in Euclidean distance): where x and x + h represent the locations of two grid blocks; G is the set of grid blocks in the domain; A is the subset of G with saturated water storage higher than a threshold indicator value (TIV), defined as the 75th percentile of saturated storage over the whole year and the whole catchment (James & Roulet, 2007;Western et al., 2001); and the "↔" sign indicates two grid blocks (at x in set A and at x + h in set G) are connected if there is a continuous path of neighboring grid blocks between them with indicator values larger than the TIV.
The approach involves three main steps: (1) Identify grid blocks with saturated water storage higher than TIV; (2) use a recursive algorithm to identify and label continuous paths that consist of grid blocks with saturated water storage higher than TIV and are connected to the stream; and (3) calculate the integral connectivity scale (I cs ) by looping through all grid blocks as follows: The I cs has the units of length and can be interpreted as the average length of flow paths from the uphill saturated zone to the stream. The relative connectivity I cs /W, where W is the average width of the catchment in the direction perpendicular to the stream, quantifies the proportional flow path length connected to the stream.

Data Model Analysis
Three lines of analysis were carried out and are described in subsequent sections. The first was the direct parameter transferability test from the calibrated modeled SH to GR. This assessed the degree to which discharge and soil moisture data can be reproduced in GR directly using the SH calibration information. The second was the calibration and sensitivity analysis that identified key parameters to reproduce data at GR. The third was the swap experiments that swapped characteristics (soil properties, topography, and size) of SH and GR one at a time, in an effort to quantify the relative significance of catchment characteristics on water storage, connectivity, and S-Q relationships.

Parameter Transferability Test
Flux-PIHM has a large number of parameters and therefore requires time-and labor-intensive model calibration. Here we transferred model calibration coefficients from the calibrated SH model to GR without any calibration. We considered two soil data sets for SH: the data from the SSURGO database and the data from the soil survey in the field (Lin et al., 2006). The requisite parameters for the GR soil matrix (e.g., van Genuchten parameters) were only available from the SSURGO soil database. To take into account the differences in the soil data, three transferability tests were carried out to assess different approaches of parameter transfer. In the "SSURGO" case, both catchments were parameterized using the SSURGO soil database. The GCC values for SH, calibrated using SSURGO soil parameters and discharge data (Shi et al., 2015), were directly transferred to the GR model that used SSURGO soil parameters. In the "Scaled" case, the SH model parameters were calibrated by discharge and COSMOS data using the a priori field soil survey. For GR, we rescaled the SH GCC values to reconcile the average parameter differences between the soil survey and SSURGO soil parameters by using the scaling equation χ GR;scaled ¼ χ SH; survey χ SH; SSURGO ×χ GR; SSURGO , where χ SH; survey χ SH; SSURGO is the factor that scales a priori SSURGO soil parameters to corresponding soil survey parameters. Here χ SH, survey and χ SH, SSURGO are the area-weighted average parameters from the soil survey and from SSURGO, respectively. The assumption was that the scaling factor between SSURGO and field survey data was the same for the two catchments. However, we did observe a high land surface area covered by boulders at GR. Therefore, a third "Scaled + Boulder Map" case was set up utilizing the GCC values in the Scaled case but additionally included the surveyed boulder map for GR . The SSURGO and boulder maps ( Figure 2) shared similar characteristics and distribution patterns. The boulder map however was more detailed with higher resolution. The boulders included clasts with large grain size that were characterized by cracks between rock and soil and macropores. In Flux-PIHM, the boulder content was represented by setting a distribution of macropore volume fraction and depth based on the measured boulder map. The boulder survey was attained from intensive field work and was thus an unusual data set that would generally not be available without intensive mapping.

Uncertainty-Based Calibration
Unsaturated water dynamics were described by the van Genuchten equation ; |h| is the water pressure (L or water head); θ s and θ r are saturated and residual water content ([L 3 L −3 ]), respectively, with θ s the same as porosity; and α and n are van Genuchten parameters describing the shape of the water retention curve. Previous studies in SH Yu et al., 2013;Yu et al., 2014) identified six parameters that were most sensitive: porosity, van Genuchten parameters (α and n), macropore depth (D mac , [L]), and conductivities (horizontal K macH and vertical K macV , [L T −1 ]). Here we calibrated these parameters using stream discharge and COSMOS data. The model was calibrated under two scenarios: one based directly on the SSURGO soil map (GR-without boulder) and the other based on SSURGO combined with the measured boulder map (GR-with boulder). The comparison of the two cases assesses the importance of including the boulder map in calibration.

Water Resources Research
The Hornberger-Spear-Young approach (Hornberger & Spear, 1981) was used to calibrate the model. GCC values were sampled using the Latin hypercube sampling method (McKay et al., 1979) for 500 simulations for the two scenarios (GR-with boulder and GR-without boulder) for a total of 1,000 simulations. The daily Nash-Sutcliffe efficiency (NSE; Nash & Sutcliffe, 1970), percent bias (PBIAS; Gupta et al., 1999), and RMSEobservations standard deviation ratio (RSR; Singh et al., 2005) were used for model performance evaluation using the satisfactory range of NSE > 0.5, RSR ≤ 0.7, and |PBIAS| < 25 % (Moriasi et al., 2007). The mean and uncertainty of each parameter were calculated using cases within "acceptable" statistical criteria and were normalized for comparison using ¼ χ−χ min χ max −χ min , where χ represents the value of each parameter and χ max and χ min are the maximum and minimum values of model parameters.

Swap Experiments
The model used the digital elevation data from the USGS NED to define the size and topography. The same homogeneous vegetation type (deciduous) and LAI values (time dependent from MODIS) were used for SH and GR to eliminate the influence of vegetation. The soil property and land cover parameters were then distributed for each grid. In this way, the code built digital catchments that represented the spatial structure of real catchments.
SH and GR differ in three variables: soil properties, topography, and size. Although vegetation types, total biomass, and LAI also differ in these two catchments (Brubaker et al., 2018), the magnitude of these differences are generally smaller compared to the three variables (Table 1). As discussed earlier, their interdependent effects present a major barrier for quantifying the relative importance of individual variables. Although parameter sensitivity analysis is often used to assess soil property parameters, it cannot evaluate effects of topography and catchment size as they are typically held constant. The swap experiments here aim to circumvent such limitations. For each swap experiment (Table 2), we first picked the topography of one catchments to set up the simulation domain. The size can be reduced or expanded by scaling the elevation and area of each grid proportionally so that topography features such as slope gradients, proportional slope length, and area of riparian zone were preserved. The averaged soil property parameters, including porosity, van Genutchen parameters, and macropore parameters, were then assigned uniformly to each grid block based on calibrated parameter sets from the base case.
As an example, in setting up the simulation labeled Soil SH Tp GR Size SH , we took the GR digital elevation map (Tp GR ) but reduced the elevation and length of each grid proportionally to the size of SH to maintain the GR topography. The averaged soil properties of SH were then assigned uniformly across the domain (Soil SH ). Similarly, Soil GR Tp GR Size SH had the topography of GR and the size of SH but averaged GR soil properties across the domain. The comparison between these two cases quantified the effects of soil properties because this was the only catchment variable that differed between the two cases. Similarly, the difference between Soil SH Tp GR Size SH and Soil SH Tp SH Size SH quantified the topography influence. Note that the Soil SH Tp SH Size SH case used averaged, uniformly assigned soil properties and therefore differed from the calibrated SH real case (SH) where soil properties (and therefore parameters) were spatially heterogeneous across the catchment. It also differed from the hypothetical GR-Size SH that preserved the heterogeneous soil distribution and topography of the real GR case (the calibrated model) but reduced to the size of SH. Note. Area-averaged soil depths are 1.7 and 1.8 m for SH and GR, respectively. The soil parameters are porosity, van Genuchten parameters (α and n), which describe the shape of the water retention curve, Macropore parameters: include macropore depth (D mac ), and conductivities (horizontal K macH and vertical K macV ).

Transferability Test and the Calibrated Model
Figures 3a and 3b compare the modeled and measured discharge and COSMOS soil moisture in the transferability tests (gray) and from the model after calibration (blue) at GR. GCC values calibrated using discharge data (Shi et al., 2015) were directly transferred from SH to GR for the SSURGO case (where SSURGO soil parameters were used). The model reproduced the temporal trend of monthly average topsoil moisture but only marginally reflected discharge peaks responding to large rainfall events. The model generally overestimated discharge peaks and topsoil moisture and underestimated low flow. The "Scaled" case underestimated discharge and shallow soil moisture to an even larger extent, especially under dry conditions. In the "Scaled + Boulder Map" case where the macropore fraction was spatially distributed based on the boulder map, the model output came closer to measured soil moisture and daily discharge, especially under dry conditions. Model prediction of monthly discharge an NSE value of 0.78 (>0.5); the topsoil moisture prediction also came closer to COSMOS measurements compared to the cases without boulders. That is, when we incorporated the measured boulder map, the model can capture monthly dynamics although not daily dynamics. Comparison of model output between the "Scaled" and "Scaled + Boulder Map" cases showed that the addition of boulder map increased infiltration rates and lateral flow by about 6% of precipitation (data not shown). However did not change much of the already-negligible surface runoff (<1%) at GR. The model with the boulder map produced more flashy peaks under wet conditions. Figure 3 also compares observations and calibrated output for acceptable runs that satisfied all three performance criteria with and without boulder information. At GR, frequent large rainfalls in July and August led to large discharge peaks and high soil moisture conditions. After August, the catchment became increasingly Comparison is for daily discharge (a and c) and areal averaged topsoil moisture from COSMOS within~10 cm depth (b and d). The inserts compare corresponding monthly discharge (e) and areal averaged topsoil moisture from COSMOS (f). Without any calibration, the "Scaled + Boulder Map" case can reproduce the monthly dynamics (i.e., within the standard of NSE > 0.5 and R 2 > 0.5) but not daily dynamics. The other two cases (SSURGO and Scaled) without boulder information cannot reproduce monthly and daily water dynamics adequately, indicating the importance of measured boulder distribution. After calibration, acceptable runs without the boulder data (GR-without boulder) and with boulder data (GR-with boulder) similarly reproduce the daily and monthly dynamics.

Water Resources Research
drier with infrequent rainfall. The model output reproduced monthly and daily response to rainfall events, overcoming the systematic underestimation of discharge and soil moisture observed in the transferability test. A total of 23 (without boulder map) and 10 (with boulder map) acceptable runs out of 500 runs reproduced the data, manifesting model equifinality (Beven & Freer, 2001). The variation in parameters across different runs was used to estimate the uncertainty range. Table 2 and Figure 4a compared the calibrated soil parameters between SH (red) and GR (blue). Each parameter for GR was an area-weighted average of different soil types across acceptable runs. The most notable difference was the higher porosity, smaller α, larger n, and lower D mac at GR. Small α values arose from large pore sizes and large n values described more drainable soils with lower water retention.

Comparing SH and GR 6.2.1. Differences in Soil Parameters
The parameters (Figure 4a) also showed different standard deviations (error bars), where smaller deviation indicates higher sensitivity. Porosity (θ s ) and n were the most sensitive with the smallest acceptable range, indicating these two parameters had to be in a narrow range in order to reproduce daily discharge and soil moisture. Water retention curves depended on α, n, θ r (residual water content), and θ s (saturated water content or porosity). The value of α controlled the position whereas n determined the steepness of the water retention curve (Figure 4b). Flatter curves (larger n) represented highly drainable and sandy soils, and steeper curves represented clayey soils where water does not drain as easily. The macropore depth in SH almost doubled the average of GR, and macropore hydraulic conductivities were surprisingly similar, indicating potentially different causes of macropores. We hypothesize that at SH, vegetation roots, which could go deep, might generate macropores whereas at GR, boulders distributed on the ground surface and at shallow soils might be the predominant contributor to macropore generation. Including the boulder map narrowed the range of macropore parameters and therefore reduced uncertainty but did not change the average soil parameters. Comparing the cases with and without boulders, the explicit expression of the greater macropore fraction in the with boulder case led to a lower D mac . In any case, the no boulder cases reproduced very similar discharge and soil moisture data as the with boulder case, indicating its model parameters effectively represented both soil matrix and macropore characteristics without explicitly including boulder information.  Table 2). The curved dashed lines demonstrate the effects at GR of increasing n from 1.6 to 2.0 and decreasing α from 5.0 to 3.0 m −1 . The range of water content (θ s -θ r ) is a measure of the potentially mobile water storage capacity per unit soil depth.

10.1029/2018WR023736
Water Resources Research Figure 5 shows spatial patterns of saturated water storage at different times. GR has a relatively large riparian zone and shows a persistent water presence under both dry and wet conditions (34.7% of total area). SH has a narrow riparian zone (10.6% of total area), mostly consisting of swales with convergent flow. Under dry conditions, only the riparian area was connected to the stream, whereas under wet conditions some uphill area was also connected to the stream. The riparian zone therefore set the baseline connectivity such that GR rarely fell below 0.2 unless under very dry conditions. Under dry conditions, although the riparian zone at GR had relatively higher water content, these waters were not sufficient to form flow connecting to the stream. This may explain similar minimum connectivities at GR and SH. The riparian zone retained 48% of total water storage at GR annually, compared to 20% at SH. During rainfall events, surface runoff occurred as short-lived pulses followed by subsurface lateral flows that dominated and sustained for longer periods (Figure 5b). The discharge at SH fluctuated more than GR, consistent with lower porosity as indicated by

10.1029/2018WR023736
Water Resources Research the water retention curve (Figure 5b). In both catchments, surface runoff Q S was small but higher in SH than in GR (8.9% vs. 1.2%). This was expected as SH soils are clay rich, and water cannot infiltrate as easily as in the sandstone-derived GR with macropores. Figure 5c showed significantly different connectivities, that is, flashy connectivity in SH compared to GR. GR had a larger normalized I cs /W because of the larger soil water storage capacity and larger riparian zone.
In the models, the precipitation (P) is partitioned into evapotranspiration (ET*) and discharge (Q), and Q is further split into surface runoff (Q S ) and subsurface lateral flow (Q L ) that lumps shallow soil water and some groundwater that contributes to the stream. Further partitioning of ET* (from model output) into E (evaporation) and T (transpiration) indicated that T dominated the ET* term in both catchments at~75% of the ET*. Water balance calculations from model output after 3-year spin-up indicated relatively similar ET*/P (~65%) in GR and SH. This is somewhat surprising as GR is characterized by a lower LAI and vegetation density. The LAI values from MODIS (~1-km spatial resolution and 8-day temporal resolution) in 2015 indicated that the SH LAI on average is~5.4% larger than the GR LAI, although the MODIS resolution was low and may not accurately represent the LAI differences at the two sites. As noted earlier, ET* represents the total water outflow other than discharge (P = Q + ET*), which can include water loss into the regional aquifer that was not counted in the model.

Swap Experiments: Comparing the Effects of Soil Property, Topography, and Size 6.3.1. Soil Saturation and Runoff Ratio
With the same topography and size (paired comparison of Figures 6a and 6b, and 6c and 6d), SH soil held more water in the hillslope, and the water content was higher across a wider area compared to the GR soil. Changing from SH soil to GR soil, the runoff ratio (Q/P) increased, again because of the lower water holding capacity of the GR soil. With the same soil properties, changing from SH to GR topography and size decreased Q/P and increased ET*/P. These trends were generally true for all cases in Table 3. This highlighted that the SH clay-rich soil held more water, reducing Q/P, whereas the steep SH topography and small size increased discharge and Q/P. These two compensating effects (soil vs. topography and size) led to very similar Q/P values between the two real cases SH and GR and between the two hypothetical cases Soil SH TP SH Size SH and Soil GR TP GR Size GR , therefore masking the effects of individual variables.

Storage-Discharge Relationship
The S-Q relationship (equation (1)) and parameters (Table 3) quantified the response of discharge to changing water storage (Kirby et al., 1991;Kirchner, 2009). Figure 7 shows the model output of discharge versus averaged storage for hypothetical catchments. The catchments generally did not produce discharge until a minimum water storage (S min ) was reached. Beyond that, the discharge increased exponentially with storage, until reaching a maximum water storage (S max ). The difference between these storage values quantified the dynamic storage S d . The S d in SH varied from about 0.20 to 0.31 m, corresponding to discharge varying by more than 2 orders of magnitude (~10 −5 − 8.2 × 10 −3 m/day). At GR, water storage changed from 0.23 to 0.39 m corresponding to Q from 1.7 × 10 −4 to 4.4 × 10 −3 m/day. These corresponded to 0.11 and 0.16 m of streamflow-generating S d at SH and GR, respectively. These dynamic storages were much smaller than the S t value of 0.49 m in SH and 0.63 m at GR. Note that S max values in all cases never reached S t of 0.49 m at SH and 0.63 m at GR, indicating the catchments never reached the total storage capacity and never were fully saturated. Comparing across different cases, catchment size had significant effects on S d . The S d in the real SH case had the lowest value of about 0.11 m. Despite the differences in topography and soil properties, all cases with SH size had about 0.11 to 0.13 m, whereas all cases with GR size had S d at about 0.16 to 0.17 m. The results suggested that as catchment size increased, S d also increased due to the expanded area that was connected to the stream (larger I cs ), even though the relative I cs /W decreased with catchment size (Figure 8).
The sensitivity of discharge to storage was prescribed by the b values, or the nonlinearity of the S-Q curve. The red cases with SH soil in Figure 7a had b values at 1.73-1.86 (Table 3), compared to values of 1.34-1.59 in blue cases with GR soil. Catchment size had some impacts on b values but was not as influential as soil properties. Increasing catchment size from SH to GR without changing soil properties and topography decreased b values slightly from 1.79 to 1.73 for SH (from SH to SH-Size GR ), and from 1.50 to 1.34 for GR (from GR to GR-Size SH ). Combining all cases, the soil properties had a predominant control on the nonlinearity of S-Q relationships.

10.1029/2018WR023736
Water Resources Research Figure 6. Water balance (Q/P) and saturated storage (S) distribution in four hypothetical cases where soil properties, topography, and vegetation are uniformly assigned with area-weighted averages. The "Soil SH " and "Soil GR " refer to cases using SH and GR soil properties, respectively; "Tp SH " and "Tp GR " refer to cases using SH and GR topography, respectively; and "Size SH " and "Size GR " refer to cases using their respective sizes. With everything else being the same, the GR soil generates more runoff (Q/P) than SH. The GR topography and size lead to lower runoff due to its large riparian zone and water storage capacity. Note. GR = Garner Run; SH, Shale Hills. The bold rows are for the real SH and GR cases a SH and GR are the calibrated cases using measurements. All other cases are hypothetical. SH-Size GR and GR-Size SH are based on the real SH and GR with heterogeneous distribution of soil but shrunk or expanded to the size of the other catchment. All other cases have uniformly assigned vegetation cover and soil properties based on area-averaged values. b The total water storage S t was estimated as soil depth × porosity θ s . With the area-averaged soil depths of 1.7 and 1.8 m for SH and GR, respectively, the St for GR = 1.8 m × 0.35 = 0.63 m, and 1.7 m × 0.29 = 0.49 m for SH. Note that this differs from the potentially mobile soil water storage directly from the soil depth and mobile porosity (θ s -θ r ), which should be 1.8 m × (0.35-0.04) = 0.56 m for GR, and 1.7 m × (0.29-0.05) = 0.39 m for SH. c The dynamic water storage S d = S max − S min is the same as k 1 in (Kirchner, 2009), corresponding to differences in storage values that produce Q max and Q min . maximum and minimum discharge. The Q ref represents a reference Q between minimum and maximum Q values. d S c is the critical storage value at their corresponding critical connectivity values (I cs /W c in the parenthesis) beyond which discharge increases significantly. For some hypothetical catchments, the critical connectivity cannot be easily observed so there are no values. (Figure 9) 10.1029/2018WR023736 Water Resources Research 6.3.3. Connectivity  Figures 9a and 9b showed significantly contrasting connectivity-storage relationships. SH soil, the connectivity exhibited a pronounced threshold behavior: The connectivity remained low at low water storage until reaching a "critical" storage (S c ) beyond which it increased suddenly and dramatically. All cases with SH soil had S c values between 0.27 and 0.32 m and at corresponding connectivity values between 0.19 and 0.27 (last column in Table 3). Cases with GR soil also showed critical storage values; however, the connectivity tended to "jump" from values at low storage (S < S c ) to values at S > S c , but the connectivity values increased linearly with small variations instead of exponentially (lower σ Ics=W in Table 3). In some cases, no critical storage and connectivity values could be derived. For cases that did exhibit threshold behavior, the S c values were surprisingly similar to those with SH soil cases. However, the corresponding critical connectivity was much higher. Table 3 also indicated that connectivity remained similar or increased when comparing heterogeneously distributed soils to uniform soils with area-averaged parameters. Figures 9c and 9d showed that discharge had a strong dependence on connectivity at low connectivity but tended to scatter at high connectivity beyond critical connectivity values.

Discussion
Topography and soil properties emerge as a result of the lithological starting material and the climate and tectonic forcings (Jenny, 1941). Results from three analyses underscore the predominant control of soil properties, instead of topography, in regulating the nonlinearity of the S-Q relationship. This result contrasts with the original hypothesis that topography plays a more important role. The transferability test cannot reproduce

Water Resources Research
daily or monthly discharge until the boulder distribution is incorporated in the model. The calibration and sensitivity analysis underscore the importance of the parameters describing soil properties. These parameters reflect both soil matrix and macropore/boulder characteristics (Figure 4). The swap experiments further revealed two pronounced observations ( Figure 10 and Table 3). First, differences in soil matrix and macropore properties arising from lithology (shale vs. sandstone) exert a predominant control: GR bouldery soils and large macropores consistently have lower b values, higher Q/P, and higher connectivity, as illustrated in the larger differences in symbol size in cases with different soil properties (vertical direction) (Figures 10a to 10c). Second, catchment size strongly influences S d , changing catchment size from SH to GR size increases S d from~0.12 to~0.17 m, illustrated with solid-line and dashed-line circles in Figure 10d. The Q/P ratios are higher in all GR soil cases compared to their corresponding SH soil cases, indicating that although SH soils generate high flow in large storms (higher flooding tendency), they generate less discharge in the long term (at the annual time scale) because of a larger water holding capacity.

Soil Properties and S-Q Relationship
Clayey and sandy soils release water very differently: Sandy soils with large pores and low clay content release water rapidly, whereas clayey soils with small pores tend to hold water. Soils derived from different lithologies differ not only in clay content, soil aggregation, and spatial arrangement (Du et al., 2016) but also Figure 9. Relationship of normalized integral connectivity scale (I cs /W) with water storage (S) (a and b) and dischargeconnectivity relationship (c and d) for swap experiment cases. All cases with Shale Hills soil have a pronounced threshold behavior where connectivity does not change much until water storage reaches a critical value, beyond which the connectivity increase significantly. The cases with Garner Run soil generally have much more gradual increase in connectivity and do not exhibit threshold behavior, except the case with Shale Hills topography and size.

Water Resources Research
in the amount of colluvial materials and boulders that generate macropores . In both catchments, surface runoff is negligible. The SH soil has been shown to form preferential flow paths at the interfaces of soil horizons and soil-rock contrasts (Jin et al., 2011). These distinctive characteristics possibly lead to different flow generation mechanisms. In GR, abundant colluvial materials and bouldery soil lead to more infiltration and fast flow activation via macropores. In SH, the strong water holding capacity may prevent the formation of preferential flow until after reaching threshold values, beyond which fast flow triggers highly nonlinear, threshold behavior. This finding agrees with threshold behaviors observed when dry-soil barriers are breached (McNamara et al., 2005), when subsurface saturated areas are connected to the trench (Tromp-van Meerveld & McDonnell, 2006), and when threshold soil moisture values are reached (Seeger & Weiler, 2014). Rapid flow activation beyond thresholds has been observed with intensive soil moisture measurements at SH (Lin et al., 2006). These observations related to soil properties have strong implications on how to represent the first-order control of stream flow generation in parsimonious models (Fenicia et al., 2014).
The importance of soil properties emphasized here corroborates findings of related work. Spatial analysis of nested watersheds comparing topography and soil property indicates that transit time characteristics Figure 10. Comparing (a) S-Q nonlinearity b, (b) relative connectivity I cs /W, calculated as averaged normalized connectivity, (c) runoff ratio Q/P, and (d) dynamic water storage S d (or k 1 ) for the 10 simulated cases. The symbols size indicates the magnitude of each quantity. The warm and cold colors represent cases with SH and GR soils, respectively, the same as in previous figures. The dashed line circles filled with lined texture represent cases where only catchment size is changed whereas soil and topography characteristics are the same as the solid line circles with same colors. Symbols size comparison between horizontal cases shows the effects of topography; comparison between vertical cases indicates effects of soil properties (boulder fraction and van Genuchten parameter n). Larger symbol size differences mean more pronounced effects. The figure indicates that steep hills with small riparian zones and clayey soils of high water retention capability (i.e., SH) tend to have small dynamic water storage (S d and smaller and more variable connectivity (b) with higher nonlinearity of S-Q relationships (a). The catchment size influences S d but not as much on b values and relative connectivity.

10.1029/2018WR023736
Water Resources Research correlate better with soil maps than with topographic indices in northern, wet catchments with wetland riparian areas (Geris, Tetzlaff, McDonnell, et al., 2015;Soulsby et al., 2006;Tetzlaff et al., 2009). In another study of 24 mesoscale catchments in Switzerland, only weak correlations were observed between transit time measures and topographic indices (Seeger & Weiler, 2014). The catchment storage derived from mean transit times and mean discharge did not show a clear relation to any catchment properties. Zimmer and Gannon (2018) examined 20 years of daily runoff from 73 regional watershed scale USGS stream gaging sites across North Carolina, United States. Their work suggests that differences in soil and bedrock properties outweigh topographic and climatic differences, because soil-bedrock interfaces and low-hydraulic conductivity layers strongly influence the partitioning of infiltrated water.
Based on an analysis of streamflow generation, Kirchner (2009) proposed that the steepness of the S-Q slope (dQ/dS) or the sensitivity of the S-Q relationship (g(Q)) positively relate to catchment slope (m, topography descriptor) and soil hydraulic conductivity (κ, soil property indicator) and negatively relate to soil porosity (θ, soil property indicator) in the form of g Q ð Þe mκ θ (Kirchner, 2009). From this equation one might infer that soil permeability is the predominant control on streamflow generation because the permeability of sandy soil is typically orders of magnitude higher than for clayey soil, whereas m and θ have a narrow range and do not change as much. In contrast, our results indicate that the van Genutchen parameters (α, n, θ r , and θ s ) and macropore properties may be better measures of water release and flow generation. In addition, θ here may be replaced by the hydrologically responsive S d , instead of total porosity, which includes both dynamic and passive storage. In other words, the correlation equation may need to be modified to take into account of soil and macropore properties topographic features such as the size of the riparian zones.

Dynamic Water Storage, Connectivity, and Catchment Size
Dynamic storage is defined as the portion of storage "that is hydrologically active and directly contributes to streamflow" (Hrachowitz et al., 2016;McNamara et al., 2011) or controls streamflow dynamics (Buttle, 2016). In this work, values of S d (0.11-0.17 m) are much smaller than S t calculated from soil depth and porosity (0.49 m for SH and 0.63 m for GR). This indicates that small changes in soil structure and porosity can have dramatic impacts on S d . The difference of 0.05 m (5 cm) in S d may appear small. It is however 42% and 30% of S d in SH and GR, respectively. The largest storms in Pennsylvania occur at the rate of~5-10 mm/hr. A 5 cm difference in S d can well make the difference between flooding or not.
Water storage estimation remains one of the largest uncertainties in predicting streamflow generation. Dynamic storage calculated from different approaches often yields estimates that differ by up to an order of magnitude (Staudinger et al., 2017). Buttle (2016) used the Kirchner (2009) framework to calculate S d for five drainage basins (~11 to 270 km 2 ). the thick sand and gravel deposits of the Oak Ridges Moraine in southern Ontario, Canada. They did not observe a clear correlation between S d and catchment size but found that S d inversely correlated to the ratio of stream water deuterium variability relative to that of precipitation, suggesting larger S d associated with shorter water mean transit time (Buttle, 2016). Birkel et al. (2011) estimated S d based on water isotope data and modeling for two nested Scottish catchments (3.6 and 30.4 km 2 ). They estimated smaller S d (40 mm out of 500-mm total storage) for the larger catchment and larger S d (55 mm out of 900-mm total storage) for the smaller catchment, respectively. Although carefully chosen for their similarities, the catchments in these studies still vary in soil properties, topography, and land cover, any of which could mask effects of catchment size.
Our results indicate that S d correlates with hydrological connectivity and increases with catchment size (Figure 8). This correlation is hardly surprising: Streamflow is generated when stored water in different parts of catchments connects and forms flow. In fact, hydrologic connectivity between upland and stream has long been considered essential for transmitting water to streams (Bracken et al., 2013;Spence, 2010). The internal catchment structure, that is, the spatial arrangement and size of hillslope and riparian zones, and their connectivity have been shown to exert a first-order control on the activation of fast flow paths and release of old water (Stockinger et al., 2014) and the magnitude and timing of water and solute release (Jencso et al., 2009;Jencso et al., 2010;Nippgen et al., 2015). In the swap experiments where catchment size changes whereas the topography and soil properties remain constant, the runoff generating area that connects to the stream inevitably changes. Larger catchments offer a larger zone of source water that dynamically connects to the 10.1029/2018WR023736 Water Resources Research stream, as indicated by the increasing I cs in Figure 8 and Table 3 when the modeled catchments increase in size from that of SH to GR.

Lack of Model Transferability Suggests Challenges in a PUB Framework
Model transferability from SH to GR did not work well even though the two catchments are both first order and lie within a few kilometers of one another. In fact, the models required detailed information about the boulder distribution to reproduce monthly discharge because the boulder significantly affects macropore distribution. The importance of soil properties has also been emphasized in other parameter sensitivity studies using the Observing System Simulation Experiments . Other published transferability studies have shown that availability of a soil map can guide a priori parameterization and improve model performance (Smith et al., 2004). They also showed that parameter transfer between neighboring catchments is the most unsatisfactory for catchments with differing lithology; transfers between nonneighboring catchments with analogous soils and with different vegetation were observed to be more successful (Heuvelmans et al., 2004). Hydrologic models using geology-based HRUs have been found to capture the spatial variability of streamflow time series better than those using topography-based HRUs (Fenicia et al., 2016). Together, these studies suggest potentially higher parameter transferability between catchments with relatively similar soil properties.
Insights gained from these studies suggest that measurements of soil properties are crucial for successful model transfer. Among the many variables that govern streamflow generation, information on climate, topography (relief, riparian zone, and hillslopes), and land cover (organisms) has become increasingly available from satellites and other observation instruments even in remote areas (McCabe et al., 2017). Below ground characteristics such as soil depth, boulder spatial distribution, and water retention properties, however, cannot be observed directly using similar large instruments. In situ conductivities and their spatial variation are challenging to obtain and often need to be inferred from tracer tests Kuntz et al., 2011;Li et al., 2010). Their measurements are often local, at scales much smaller than the relevant scale for streamflow prediction. The boulder information in this work is an idiosyncrasy of the site and is available for this work only because of another study. Such information however is not available for most watersheds. Current practices in assigning soil property parameters typically use pedotransfer functions in combination with soil properties obtained from databases such as SSURGO . In fact, generating the boulder map involves intensive field mapping and thus is about the furthest from the ungauged basin that one could possibly envision. It suggests that unless we have a good grasp of the subsurface properties, we may have little chance in making reasonable predictions in ungauged basins.
Our work therefore highlights a general need to understand lithology and parent materials in determining soil properties. It is impossible to measure soil properties everywhere. We need methods to predict subsurface properties based on a small number of measurements that are easy to assess. In fact, it may be possible that soil formation models developed from simulating weathering processes can eventually allow meaningful prediction of soil properties that can be used for predicting streamflow dynamics (Heidari et al., 2017;Lebedeva et al., 2010). With such efforts, the need to measure soil properties everywhere might be avoided.

Model Limitations and Strengths
McDonnell (2003) argued that it may be more fruitful to examine the common characteristic forms of nonlinearity and feedbacks than characterizing the peculiarity of specific catchments. Generally speaking, however, field studies of catchment-scale emergent dynamics and catchment classification are challenging because of the confounding nature of multiple factors, and the large data sets required for parameterization. This is one reason for the popularity of experiments that incorporate paired catchments where some variables are held the same. Physically based models and carefully thought virtual experiments may help circumvent such limitations. Spatially explicit hydrology models typically require a large parameter set with large uncertainty, leading to issues with respect to equifinality (Beven & Freer, 2001). A good example of this is the many cases that can reproduce data shown in this work. Despite these limitations, spatially explicit, process-based models have the advantage of generating "digital" catchments in an era with exponentially growing Earth surface data and investigating the convoluted mechanisms of multiple processes using numerical experiments. The thought (swap) experiments here, by systematically controlling variables via a hypothesis testing approach (Clark et al., 2011), offer mechanistic insights and quantify the relative effects 10.1029/2018WR023736 Water Resources Research of interdependent variables. In particular, the swap experiments enable the assessment of topography and catchment size-two variables that are often held constant in typical sensitivity analyses.
Given spatially and temporally sparse field measurements, these models can also guide when and where critical measurements are needed. Data-informed numerical experiments could provide an alternate, cost-effective approach to test hypotheses against field data (Fatichi et al., 2016;Li et al., 2017;Weiler & McDonnell, 2004), especially where field data are collected systematically for catchments chosen on the basis of important variables such as climate, biota, relief, and parent material. They can also be used to "discover" general principles across space and time, therefore complementing a growing body of literature using statistical approaches and cross-site synthesis (Jasechko et al., 2016;Jasechko et al., 2017;Kuentz et al., 2017;Wagener et al., 2010). This will ultimately facilitate catchment classification that will lead to better hydrologic theory for PUB (Hrachowitz et al., 2013;McDonnell & Woods, 2004;Wagener et al., 2007).

Conclusions
This study examined the emergent dynamics of streamflow generation including the nonlinearity of S-Q relationships, connectivity, and dynamic water storage in two first-order monolithological catchments: the shale-underlain SH (0.08 km 2 ) and the sandstone-underlain GR (1.34 km 2 ). Both catchments are located in the temperate climate in central Pennsylvania, USA. SH has steeper slopes and a narrow riparian zone (~10% of catchment area) whereas GR is not as steep and has a relatively large riparian zone (~34%). We tested the hypothesis that the influence of topographic characteristics (a flatter slope, longer slope length, larger riparian zone) is more significant than that of soil properties and catchment size, leading to a dampened streamflow response and a more linear S-Q relationship at GR compared to SH. The hypothesis was tested combining streamflow, soil moisture data, and a spatially explicit, process-based model Flux-PIHM. Three lines of analyses, including model transferability tests, calibration and parameter sensitivity analyses, and swap experiments, all led to the rejection of the original hypothesis. These results underscore the predominant role of soil properties.
The transferability test showed that direct transfer of calibration information from SH to GR in models with catchment topography (digital elevation) data reproduced the monthly trend but overestimated daily discharge peaks and underestimated monthly discharge and topsoil moisture. Only when the boulder map from the field survey was incorporated did the SH calibration information used for GR predict monthly discharge at GR because the boulders enhanced infiltration and accelerated water release to the stream ( Figure 3). Thus, the role of topography was small compared to that of soil properties. If the topography was the primary control, the model calibration information would have been directly transferable from SH to GR because topography was explicitly represented in each of the model domains. Model calibration and sensitivity analysis for GR (500 simulation runs) identified important soil parameters including soil matrix and macropore properties that reflected boulder characteristics (Figure 4). The analysis indicated that the GR soil has higher porosity, lower water retention (van Genuchten n), higher water storage capacity (θ s ), and shallower macropore depths compared to the SH soil. Comparison between the two catchments indicated more flashy streamflow response and hydrological connectivity variation in SH than GR ( Figure 5).
Swap experiments comparing cases disentangling the effects of interdependent variables (topography, soil properties, and size) accentuated the significance of clayey soils in holding water (Figures 6 and 7). In addition, they revealed two overarching observations ( Figure 10 and Table 3). First, soil properties exerted a predominant control: cases with GR bouldery soil consistently had lower b values, higher Q/P, and higher connectivity. Second, dynamic water storage S d increased with catchment size. For example, an increase in size from that of SH to GR resulted in an increase in S d from about~0.12 to~0.17 m, largely because of the higher connectivity related to a larger area connected to the stream in the larger catchments ( Figure 8). Cases with GR soil had more linear S-Q relationships without pronounced threshold behavior: in some cases, critical connectivity could not be identified (Figure 9). The Q/P ratios were higher in all GR soil cases compared to their corresponding SH soil cases, indicating that although SH soil tended to have high flow in large storms (more flooding tendency), it generated less discharge in the longer term (i.e., years) due to its large water holding capacity.

Water Resources Research
These results underscore the importance of soil properties as compared to topography. This predominant role of soil properties presents a grand challenge for PUB, as soil properties are more challenging to measure and are not as readily available as earth surface characteristics such as topography and land cover. This work also illustrates that controlled virtual experiments (comparisons between model outputs for different catchment characteristics) can systematically disentangle convoluted effects of interdependent variables and can offer mechanistic understanding of catchment-scale emergent dynamics.