Defining Robustness, Vulnerabilities, and Consequential Scenarios for Diverse Stakeholder Interests in Institutionally Complex River Basins

The Upper Basin of the Colorado River in the southwestern United States supports municipal, industrial, agricultural, and recreational activities worth an estimated $300 billion/year within the state of Colorado alone. The allocation of water to these activities is fundamentally shaped by water rights that in turn distribute risks among a diverse suite of sectors and stakeholders. In this study, we assess the vulnerabilities faced by the hierarchy of hundreds of water users in the portion of the Upper Basin within the state of Colorado, as a result of changes in hydrologic extremes, demand growth, and institutional and physical infrastructure in the basin. We also determine how robust the different users are to these potential changes, examining how their sensitivity to these factors depends on the magnitude and frequency of shortage they are unwilling to exceed. This advances previous robustness evaluation methods by formalizing an a posteriori exploration of alternative definitions of robustness tailored to each user's unique context. Our analysis reveals that robustness varies significantly not only across users but also across different definitions of acceptable performance for each user. We further show the importance of using scenario discovery to evaluate how the factors that drive consequential scenarios differ depending on the definition of robustness. Our results highlight the need for robustness and vulnerability frameworks to avoid broad categorical aggregations of stakeholder groups, to carefully capture complex institutional dependencies (e.g., water rights), and to facilitate a more transparent illustration of the implications of alternative definitions of robustness for specific stakeholders.


Introduction
Water resources systems are inherently complex human-natural systems, characterized by multisector interactions between the environment, physical infrastructure, and societal needs (Clarke et al., 2018;Moss et al., 2016). The availability and provision of water to its multiple uses is shaped by the human systems (water institutions, societal values, and infrastructure requirements) as well as by their natural contexts (hydrology, climate, and ecosystems). In institutionally complex river basins, the institutions governing water (e.g., systems of allocation and legal rights) manage its dynamics and, by extension, distribute the risks and vulnerabilities innate to increasingly variable extremes (Hall et al., 2014). This is especially true for the American West, where the doctrine of "prior appropriation" fundamentally shapes water resources allocations, vulnerabilities, and dynamic responses to hydrologic extremes. Prior appropriation distributes resources across water users in order of priority ("first-in-time, first-in-right") and limits the decreed volume of water of each right to the amount that could be put to a "beneficial use" (Kenney, 2005). The seniority of water rights hierarchically defines the relationships between stakeholders in these systems, where the senior water rights holders fundamentally shape the dynamics of water entitlements for more junior water rights holders.
In Colorado, rapid population growth and increasing demands are stressing the state's water supplies (State of Colorado, 2015). Climate change is expected to increase the strain on supply with rising temperatures, earlier peak runoff, and reduced water availability (Chavarria & Gutzler, 2018;Fyfe et al., 2017;Solander et al., 2017). Yet, while the direction of these temperature-induced impacts is understood, the complex feedbacks of uncertain global carbon emissions on precipitation make the magnitude of hydrologic impacts from climate change far more difficult to project (Elias et al., 2015;Lehner et al., 2019;Vano et al., 2014). These drivers are examples of what some have referred to as "deep uncertainties" (Knight, 1921;Polasky et al., 2011;Walker et al., 2003). More formally, deep uncertainties can be defined as component factors within a system for which expert opinion cannot know or cannot agree on the full set of outcomes and their associated likelihoods (Kwakkel et al., 2010;Lempert, 2002;Lempert et al., 2003). The challenges and potential high consequences of deep uncertainties have motivated a rapidly growing body of literature focused on bottom-up approaches, which employ exploratory modeling (Bankes, 1993) to simulate large ensembles of plausible scenarios and thereupon identify those with consequential effects on the system (see Dittrich et al., 2016;Herman et al., 2015;Kwakkel & Haasnoot, 2019;Moallemi et al., 2020, for recent reviews of these approaches). In this manner, exploratory modeling shifts the focus from predicting future conditions to discovering which future conditions lead to (un)desirable consequences. The relative likelihoods or decision relevance of the scenarios that lead to these outcomes can then be evaluated by the relevant decision makers a posteriori in the richer context of knowing their potential consequences (Dessai et al., 2009).
In bottom-up methods, the identification of consequential scenarios (also known as scenario discovery) presupposes a metric-based classification of the acceptability of system performance, most commonly by exploiting one or more candidate measures of robustness. Common families of robustness measures include expected value metrics (Wald, 1950), regret-based metrics (Savage, 1951), or satisficing metrics (Simon, 1956). Recent literature has highlighted that there can be substantial implications and insights when choosing among different candidate robustness metrics (Herman et al., 2015;Maier et al., 2016;McPhail et al., 2018), with the shared conclusion across these works that each different metric choice reflects potentially very different aspects of what makes the system robust. These studies, however, do not address the underlying complexities associated with better resolving highly diverse multistakeholder perspectives. In multistakeholder systems, the choice of robustness measures bears the additional burden of promoting agency in individuals' choices while also clearly mapping what makes a system robust for whom. Highly aggregated system-wide robustness measures potentially ignore asymmetries in values and adaptive capacities across individual actors in the system (Franssen, 2005). These concerns are especially relevant for water resources systems supporting demands across multiple sectors, such as agriculture, manufacturing, power generation, and recreation (Kasprzyk et al., 2016). Relatively few recent studies have focused on clarifying the implications of multistakeholder trade-offs in robustness (Gold et al., 2019;Herman et al., 2014;Trindade et al., 2017) to actively reduce inadvertent consequences for particular actors. Moreover, these studies are limited in their focus on a single sector (i.e., water supply) and only four significant actors.
In more institutionally complex contexts, such as the Upper Basin of the Colorado River, bottom-up vulnerability assessments should resolve water rights at a high resolution, especially when assessing how particular users are affected. The Colorado Water Conservation Board (CWCB) and the Division of Water Resources (DWR) have jointly developed Colorado's Decision Support System (CDSS), a collection of databases, data viewing and management tools, and models to support water resources planning in Colorado's major water basins (Malers et al., 2001). The CDSS is made up of a central database with water resources data (HydroBase), a public website where the data can be accessed, a Geographic Information System (GIS) for viewing and analyzing the data, and a consumptive use model (StateCU) that estimates consumptive use by each irrigation unit in a basin. The outputs from StateCU are then input to the State of Colorado's Stream Simulation Model (StateMod), a generic network-based water system model for water accounting and allocation, and the final component of CDSS. StateMod was developed to support comprehensive assessments of water demand, allocation, and use, as well as reservoir operations. It represents all of the major subbasins within the state of Colorado (i.e., Parsons & Bennett, 2006;White, Yampa, Upper Colorado, Gunnison, Dolores, San Juan, and San Miguel CWCB, 2012). StateMod replicates each basin's unique application and enforcement of the prior appropriation doctrine and accounts for all of the consumptive use within the basins. To do so, it relies on the detailed historic demand and operation records contained in HydroBase that include individual water right information for all consumptive use, data on water structures (wells, ditches, reservoirs, and tunnels), and streamflow data. Further, StateMod uses irrigation consumptive use data output from StateCU, which calculates water consumption based on soil moisture, crop type, irrigated acreage, and conveyance and application efficiencies for each individual irrigation unit in the region. These detailed, fine-scale inputs allow StateMod to resolve the effect of all users and water rights on water availability, which can in turn be used to simulate hypothetical scenarios to assess the impacts of changes in hydrology, water rights, or infrastructure on any and all represented water rights.
Both the DWR and the CWCB, as well as several other water resources entities, have been actively using StateMod to model and assess impacts to water users by changing water demand, allocation, and operations since the late 1980s (Parsons & Bennett, 2006). StateMod has been used to support informed planning for the basin including studies on annual consumptive water use, as well as the possible system impacts of having to meet interstate river compact obligations under potential changes in stream flows, reservoir storage capacity, and reservoir operations. StateMod has also been instrumental in the development of the Colorado River Water Availability Study (CWCB, 2012), investigating how climate change may affect water availability at the water user and water rights level, as well as reservoir use and operations in the four major tributary river basins of the Colorado River in the state of Colorado. The study was considered at the time the "most detailed look at how specific water users may be impacted by climate change performed anywhere in the world" (CWCB, 2012).
The Upper Colorado basin within the state of Colorado (henceforth abbreviated to UCRB) is the largest of the five basins, containing the headwaters of the Colorado River, and is the basin of interest in this study. The outflow of this subbasin of the Upper Basin of the Colorado River flows into Utah and continues to deliver water downstream to Lake Powell. The Colorado River Compact allocates 9.3 km 3 (7.5 million acre feet) per year to the Upper Basin states (Colorado, Utah, Wyoming, New Mexico, and Arizona), and the state of Colorado is allotted 51.75% of that amount. The UCRB contains several thousand water rights for agriculture, municipal water supply, and recreational uses, power generation, and fisheries. The basin lies west of the Continental Divide so major diversions of water are necessary to support urban centers on the east slope, where most of Colorado's population lives. Amid a historic 20-year drought (Rhee et al., 2018;Schwartz, 2019;Udall & Overpeck, 2017), state-level planning efforts must carefully balance the increasing potential for resource competition across users while accounting for potential vulnerabilities to climate change, demand growth, and infrastructure changes. In this paper, we perform a bottom-up water-supply vulnerability assessment for hundreds of stakeholders in the UCRB, seeking to identify the factors that most significantly shape their ability to meet their water demands. To do so, we pair StateMod with an exploratory modeling approach, through which an ensemble of scenarios reflecting potential changes in hydrologic conditions, evaporation, water demand, infrastructure, and water rights are simulated. We investigate how these changes manifest through the system to impact its users and how these impacts are shaped by the presence of strong institutional governance and intricate allocation-priority relationships.
The objectives of this paper are therefore the following: (1) perform a bottom-up vulnerability assessment of the diverse water users in UCRB, resolving their water shortage vulnerabilities resulting from an ensemble of deeply uncertain hydrologic, social, and institutional changes; and (2) evaluate their robustness for a spectrum of potential definitions of vulnerability to inform scenario discovery that identifies the most consequential scenarios shaping users' water shortages. This study advances previous applications of the Robust Decision Making (RDM) framework (Lempert et al., 2006(Lempert et al., , 2010 by formalizing an a posteriori exploration of alternative history-informed performance preferences for large multiactor systems that clarifies the implications for defining robustness and the identification of key factors that control vulnerabilities.

Study Area and Model
The UCRB spans 25,682 km 2 (9,915 mi 2 ) in western Colorado, from its headwaters at the Continental Divide to its outflows located at the Colorado-Utah state line (Figure 1). The following description of the study area will be centered around the main uses of water in the basin, as well as several specific elements that were deemed to be of particular interest by the CWCB and other stakeholders. The area supports a moderate but growing population of approximately 300,000 inhabitants, with the major water use in the basin being irrigation. Several thousand irrigation ditches divert water from the main stream and its tributaries to irrigate approximately 1,012 km 2 (391 mi 2 ) in the basin. Eighty percent of Colorado's population resides on the east slope, which necessitates major transbasin (and transmountain) diversions from the UCRB. These diversions amount to approximately 567,400,000 m 3 (460,000 acre feet) a year and export water for irrigation, industrial, and municipal use in northern and eastern Colorado, including the cities of Denver and Colorado Springs (State of Colorado, 2015). The major tunnels serving these diversions are indicated in Figure 1.
The other major water use in the basin is for power generation, with the Shoshone Power Plant, the Grand Valley Power Plant, and the Molina Power Plant being the primary power producers. Water for power generation is generally nonconsumptive and does not significantly deplete the water resources in the basin. Of note, the Shoshone Power Plant (indicated in Figure 1) owns one of the most important water rights in the basin, in terms of size and seniority. The oldest right owned by the plant dates back to 1902 with a large decree of 39.40 m 3 /s (1,391 cf/s), representing approximately 68% of the median flow at that location (USGS, 2019;Yates et al., 2015). Generally, when the Shoshone Power Plant calls their decreed water right, upstream junior reservoirs, transbasin diversions, and irrigation must either cease or offset their diversion and storage to satisfy the call. Given that the Shoshone Power Plant use is largely nonconsumptive, a majority of the water is immediately returned to the stream, sustaining an important part of the downstream recreation economy (rafting, kayaking, and fishing). A key aspect of the Shoshone Power Plant call is that the decreed water needs to be put into beneficial use. The 100-year-old plant had to shut down for repairs and maintenance in 2007 and in 2019 (Gardner-Smith, 2019;Proctor, 2008), and when not operating, its right does not have to be honored. This can significantly affect downstream users that benefit from the water called by the plant, as well as the upstream projects that could potentially use the water if their water right priority allows. For these reasons, the potential effects of its nonoperation were included in our study, since it could impact local stakeholders.
Another notable feature is a 24-kilometer river segment, referred to locally as "the 15-mile reach," extending from Palisade to the confluence of the Colorado River with the Gunnison River in Grand Junction ( Figure  1). This reach is considered critical for the recovery of four endangered fish species: the razorback sucker (Xyrauchen texanus), the Colorado pikeminnow (Ptychocheilus lucius), humpback chub (Gila cypha), and bonytail (Gila elegans) (IUCN, 2012a(IUCN, , 2012b(IUCN, , 2012a(IUCN, , 2012bUSFWS, 2012b). As a result, the U.S. Fish and Wildlife Service has made recommendations with regard to necessary flow in this reach, depending on annual hydrologic conditions (USFWS, 1999; a more detailed discussion of these recommended flow rates is presented in section 4). Following these recommendations, state agencies and other stakeholders have been particularly interested in the effects of the seniority of the flow right at this location.
All of the critical river segments as well as the consumptive and nonconsumptive water users are represented within the UCRB StateMod platform. The model simulates stream flow, stream diversions, instream flow demands, and reservoir operations as determined by the basin's hydrology, the decreed water rights in the geographical area of the basin, and their associated structures and operating rights. Each water right is defined by an administration number representing its seniority, a decreed water flow, and a stream location. The model then resolves river flow and operations for each (monthly) time step in order of water right priority. All consumptive use of water that takes place in the basin is accounted for and represented. For reasons of practicality, the model does not represent the several thousand diversions that take place in the basin explicitly but does represent 75% of all use in the basin in strictly correct locations. The remaining structures are grouped into aggregates based on use location, tributary boundaries, and instream flow reaches. As a result, the UCRB model includes over 400 explicit structures and 65 aggregate structures, illustrated as points in Figure 1. Likewise, the UCRB model does not explicitly represent every single reservoir and stock pond existent in the basin; only those with decreed capacities of above 4,934,000 m 3 (4,000 acre feet) are represented. The reservoirs explicitly represented at strict locations (18 in total) make up 94% of the total storage capacity in the basin. The remaining storage capacity is represented using ten aggregate reservoirs and one aggregate stock pond. Additional information on how the aggregation of structures was performed during model development can be found in the manual for the UCRB model (CWCB & CDWR, 2016).
Demands at these structures are defined by best-estimate levels of irrigation, population, and reservoir capacity up to 2010. Irrigation structure water demands are not limited by water right but are set to the irrigation water requirement for the specific time step divided by the historical efficiency for that month of the year. Historical efficiencies reflect irrigation practices and application methods, conveyance losses, and timing of the diversion (e.g., early versus late season). Municipal structure demands and transbasin diversion demands are based on average diversions per month of the year, during the period of 1998-2005. The 15-mile reach is represented by a minimum streamflow demand that varies monthly and annually based on hydrologic conditions. Reservoir demands are modeled using minimum and maximum reservoir target storage limits. Maximum targets for reservoirs that serve primarily agricultural and municipal diversions are set to capacity. Maximum targets for reservoirs that include operations for hydropower or flood control purposes are set to operational targets according to rule curves provided by the reservoir operating agencies. Additional information on how historical water demands were estimated and represented for all consumptive use in the basin can be found in the model's manual (CWCB & CDWR, 2016).
When simulating water allocations and reservoir operations, the model needs to first represent natural flows-the flow that would have been in the stream had none of the diversions and reservoir operations taken place. Natural flows are estimated by superimposing historical diversion data, end-of-month contents of modeled reservoirs, and estimated consumption and return flows on historical streamflow data from U.S. Geological Survey (USGS) gauges. For the ungauged locations represented in the model, flow is distributed using proration factors that account for the fraction of the drainage area contributing to each gauged natural flow. The record represented in the model spans water years 1909-2013, which includes extended dry and wet periods, as well as years with high runoff and extreme drought. The term water year refers to the 12-month period starting on October 1st and ending on 30 September the following year.
Water year 2002 (spanning October 2001 to September 2002) is a recent severe drought period with exceptional water shortages across the state of Colorado (Pielke et al., 2005). Figure  , and (f) 24.95 m 3 /s (64.67 × 10 6 m 3 /month). User (a) is located upstream in the basin, diverting water from Blue River. User (b) is located downstream in the basin, diverting water from Roan Creek in Garfield County, and users (c, d), also located downstream, are diverting from Lower Colorado River. The transbasin diversion (e) is located midstream, diverting from Eagle River. Finally, the 15-mile reach (f) is located in the Lower Colorado River (also indicated in Figure 1). The experienced shortages throughout the record and particularly during 2002 indicate large differences in how shortages are experienced by users in the basin with regard to their frequency and magnitude. Specifically, panels (a)-(c) show three irrigation users at three different levels of right seniority, with the most senior user (a) experiencing shortages at most three months out of a year and the most junior user (c) experiencing shortages for up to 11 months (indicated by the percentile of magnitude per year). The shortages experienced during the 2002 water year appear to be some of the most volumetrically severe in the record for the median-and junior-right irrigation users (b and c), the large-decree irrigation user (d), and the 15-mile reach (f). However, the senior-right irrigation user (a) experienced severe shortages for only two months, and they were an order of magnitude lower than the user's worst recorded shortages. The transbasin diversion (e) did not experience shortages during that year. These data suggest that these users are vulnerable to different drought characteristics. Resolving the relative severity of a particular drought for each user requires a careful accounting for how their demands and place in the basin's rights hierarchy amplify or modulate shortages of varying timing, duration, frequency, and magnitude.

Experimental Design
This study contributes a bottom up exploratory assessment for how vulnerable UCRB users are to changes in droughts, demands, institutional rights, and physical storage. Individually and collectively the changes in these factors are considered deeply uncertain. This experiment identifies which of these factors are most critical for different users through a three-step process: (1) generating alternative "states of the world" (SOWs) represented by different combinations of these uncertain factors and 10 streamflow realizations for each SOW, (2) simulating StateMod over these alternative realizations to compute different users' shortage distributions in each world, and (3) using alternative definitions of robustness in combination with scenario discovery to determine which combinations of changes in the uncertain factors lead to poor performance for different users. Our experimental design is presented in Figure 3, using notation consistent with McPhail et al. (2018). Panel I illustrates Step (1): We generate 1,000 samples of uncertain factors ( ), representing the set of SOWs, each producing ten streamflow realizations (s).
Step (2) is illustrated in Panels II and III. The performance of the system is evaluated for each realization using StateMod (Panel II) producing shortages for all users (U) in the basin (f(U, s)). Panel III shows how f(U, s) is a set of performances for each user (u) across all realizations (f(u, S)).
Step (3) is illustrated in Panels IV and V. The performance for each user Figure 3. Experimental design of this study. Panel I shows how the samples of uncertain factors ( ) relate to the generation of streamflow realizations (s), that is, 10 streamflows for each sample . The performance of the system is evaluated for each realization using StateMod (Panel II) producing shortages for all users (U) in the basin (f(U, s)). The performance for each user (u) across all realizations (S) is then represented by f(u, S) (Panel III). Panel IV shows how that performance is classified using alternative satisficing thresholds t, resulting in an individual robustness value R for each user and satisficing threshold (Panel V). This figure follows the same mathematical notation for robustness analyses as defined by (McPhail et al., 2018).
is classified using alternative satisficing thresholds t (Panel IV), each resulting in an individual robustness value R for each user and satisficing threshold (Panel V). Table 1 lists the 14 deeply uncertain factors that are sampled in this experiment: eight hydrologic parameters, three demand parameters, and three parameters related to environmental, structural, or institutional changes. To capture changes in the hydrologic factors, a two-state Gaussian Hidden Markov Model (HMM) is used, as detailed in section 3.1.1. A Latin Hypercube sample (McKay et al., 1979) of 1,000 parameter combinations (SOWs) is generated across the ranges given in Table 1 assuming uniformity and independence. This is represented by set in Figure 3. While these parameters are likely correlated and not uniformly distributed in reality, this bottom-up analysis is simply intended as a screening analysis that identifies consequential parameter combinations. As such, the ranges assumed for these factors are intentionally wide for exploratory purposes. Given the complex and high-dimensional nature of the vulnerability assessment, our experiment shifts focus from predicting the likelihood of a given SOW to instead characterizing which of those SOWs are consequential to users. Users can then use the dimensionally reduced consequential scenarios in an a posteriori assessment of which changes are of most concern. We have grounded our experimental bottom-up sampling design in detailed recent assessments and in collaboration with basin management experts. We present the details of our sampling and analysis steps below.

Alternative Hydrologic Conditions
Managing climate variability and its socioenvironmental impacts is a familiar challenge to water users in the UCRB. The region is currently in the midst a nearly 20-year drought in which streamflows have decreased despite increased precipitation, as increasing temperatures have resulted in greater evapotranspiration and earlier snowmelt, reducing summertime flows (Xiao et al., 2018). Stakeholders in the basin are beginning to wonder if this is the "new normal," or worse, only the beginning of a downhill trajectory. These concerns are unfortunately consistent with many regional climate projections that on average predict decreased deliveries to Lake Powell at the outlet of the entire Upper Colorado River Basin (Christensen & Lettenmaier, 2006;Christensen et al., 2004;Rasmussen et al., 2014). However, there is great variability across these projections, making the mean an inappropriate projection that masks the true deep uncertainty (Harding et al., 2012). This variability is as great from decade to decade within individual general circulation model (GCM) runs as across them (Harding et al., 2012). Such variability may be the result of significant climatic persistence in the basin leading to long wet periods and long dry periods, such as the current bidecadal drought. Paleoreconstructions in the basin suggest these decadal, or even multidecadal droughts, are not uncommon in the UCRB (Ault et al., 2013(Ault et al., , 2014Woodhouse et al., 2006).
Given that extreme hydrologic variability and persistence has been observed historically and is projected in the future, our exploratory analysis exploits a synthetic streamflow generator that can explicitly model the effects of potential changes in drought frequency, severity, and persistence. Following Bracken et al. (2014), we capture these characteristics with a two-state Gaussian HMM fit to the log-space natural annual flows at the last node in StateMod, the CO-UT state line. This model consists of two "hidden" climate states representing wet and dry conditions, respectively. HMMs can have more states, but Bracken et al. (2014) found two to be optimal in this system. Within each of these states, log-space flows are generated from Gaussian distributions with different parameters. We observe the flows generated from these distributions, but not the states themselves, hence the term "hidden." Fitting the Gaussian HMM requires estimation of the mean and standard deviation of each of these Gaussian distributions, as well as the probabilities of transitioning between wet states and dry states. Changes in these transition probabilities can capture changes in drought persistence, changes in the mean and standard deviation of the distributions can capture changes in severity, and changes in both sets of parameters can collectively capture changes in drought frequency. Bracken et al. (2014) show that these parameters, in particular the transition probabilities, are nonstationary and can be estimated as a function of large-scale climate phenomena such as the Pacific Decadal Oscillations (PDO) and El Niño-Southern Oscillations (ENSO). Thus, by changing these parameters in our scenario discovery experiment, we can determine how sensitive users are to changes in these large scale climate phenomena.
Given known nonstationarity in hydrologic conditions in the UCRB, we fit the Gaussian HMM to only the last two thirds of the 105-year record (70 years) to capture recent conditions. This is also consistent with the 64-year records that have been used for future climate simulations in the basin (CWCB, 2012). The parameter estimation for the Gaussian HMM was performed using Expectation-Maximization with Python's hmmlearn package (Lebedev, 2015). The parameter estimates are reported in supporting information (SI) Table S1 and the Gaussian fits are shown in SI Figure S1. Note that the parameters reported in Table S1 10.1029/2020EF001503 are based on fitting the Gaussian HMM to the log of annual flows in m 3 but that the multipliers in Table 1 are applied to parameters estimated from fitting the Gaussian HMM to the log of annual flows in acre feet. Across the Latin Hypercube sample, this results in real-space multipliers of about 0.73-1.37 for the dry-state mean, 0.64-1.54 for the dry-state standard deviation, 0.72-1.39 for the wet-state mean, and 0.63-1.58 for the wet-state standard deviation. A visual depiction of how these streamflow parameter ranges map to changes in the real-space annual flow distribution is provided in SI Figure S3.
State identification of wet and dry years, shown in Figure S2, was performed using the Viterbi algorithm with the same Python library. As can be seen from this figure and the transition probabilities in Table S1, there is considerable persistence in the system under recent historical conditions, with the probability of wet-to-wet and dry-to-dry transitions ranging from 65-68%. More details on the fitting method and validation of the generator are provided in the SI.
For our exploratory modeling experiment, we synthetically generate log-space annual flows at the last node from a two-state Gaussian HMM for parameterized changes from the historical record by modifying the historical transition probabilities with delta shifts and the historical means and standard deviations with multipliers over the ranges in Table 1. The log-space annual flows are converted to real space and temporally downscaled to monthly flows using a modification of the proportional scaling method used by Nowak et al. (2010). First, a historical year is probabilistically selected based on its "nearness" to the synthetic flow at the last node in terms of annual total. The daily hydrograph observed at that site in the selected year is then altered to model earlier, dissipated peaks, controlled by a peak shift parameter listed in Table 1. This change is intended to capture the effects of increasing temperatures, exacerbated by dust on snow, on reducing snow cover duration by several weeks and bringing peak runoff earlier in the year (Livneh et al., 2015;Painter et al., 2012;Skiles & Painter, 2019). The shifted monthly flows at the last node are then downscaled to all other nodes using the same ratios of monthly flows at the upstream nodes to the last node as in the historically selected year. A detailed explanation of the methods used for the seasonal shift, as well as validation of the generator's ability to capture spatial correlation with this approach, is provided in the SI, with links to the code for the synthetic streamflow generator.
Combining changes sampled in our 1,000 SOWs across all of these seven streamflow parameters, we observe a wide range of streamflows beyond what has been observed historically. For every 1 of the 1,000 parameter samples ( ), 10 realizations of synthetic 105-year streamflow records are generated (s) using that SOW's streamflow parameters, resulting in ten thousand 105-year sequences making up the entire ensemble S (Figure 3). Figure 4 shows the ranges of (a) monthly and (b) annual flows at the last simulated node in the basin across these 10,000 simulations compared with other observed and synthetic series. The range of historical flows is shown in light red, with the extreme drought year of 2002 indicated in dark red. Shown in dark cyan are synthetically generated flows for 10 sequences of 105 years using the stationary parameterization of the HMM, fit to the last 70 years of the historical record. Shown in green and yellow are 112 and 97 projected streamflows associated with downscaled and bias-corrected precipitation and temperature projections from the Coupled Model Intercomparison Project 3 (CMIP3) and 5 (CMIP5), respectively. These scenarios were developed through the Colorado River Water Availability Study (CWCB, 2012). In that study, the downscaled projections were used as forcing to the Variable Infiltration Capacity (VIC) model to quantify the effect of climate changes on naturalized streamflows in the basin. Differences between VIC streamflow projections under historical and projected climates were applied to natural historical flows to create the scenarios run through StateMod, which are what is plotted in Figure 4. Finally, the flows shown in gray in Figure 4 are the synthetically generated streamflows across all SOWs used in this study. The ranges of the parameters sampled were chosen so that the resulting flows would span beyond both the (a) monthly and (b) total annual flows of historical, stationary synthetic, and CMIP-projected flows, both for wetter and drier conditions.
The final climate-related parameter sampled in our scenario discovery experiment is the reservoir evaporation rate, which is expected to increase with rising temperatures. StateMod represents reservoir evaporation using net average monthly evaporation rates (feet/month) assigned to each modeled reservoir. Historical net annual evaporation rates were estimated by subtracting weighted average effective precipitation from estimated gross water surface evaporation. The annual estimates of evaporation were distributed to monthly values based on the elevation of each reservoir. More details on how the values were estimated and their distributions can be found in the manual for the UCRB model (CWCB & CDWR, 2016). To represent changes in evaporation within the experiment, we sample additive changes to the monthly evaporation rates and apply them to all modeled reservoirs.

Scenarios for Changing Demands
Colorado's population has seen a fivefold increase in the past century, a growth trend that is expected to continue (DOLA, 2015). Maximum projection levels anticipate population to double by the mid-21st century, further increasing municipal and industrial water demands. Even though population growth is inevitable, state and local governments can influence how and where the population grows (State of Colorado, 2015), as well as how much water is actually needed to support this growth. Since 2000, conservation and efficiency programs implemented throughout the state have reduced per capita water needs by as much as 30% in some communities (CWCB, 2010). Additional conservation measures planned throughout the state are expected to further reduce industrial and municipal per capita water demand (State of Colorado, 2015).
Throughout Colorado and in the UCRB particularly, agriculture represents the largest sectoral water demand. Increasing temperatures are expected to increase the number of days in the growing season for perennial crops (most predominant in the UCRB), as well as the crop demand for irrigation water. As a result, crop irrigation requirements may increase by up to 30% by 2040 (CWCB, 2012). In contrast, the CWCB expects total irrigated acreage to decline, as many municipalities meet their increasing demands by acquiring agricultural water rights. Real estate developers have also been purchasing farmland to expand urban areas. Reduced irrigated land might result in crop irrigation requirements reducing by up to 20% by 2050 (State of Colorado, 2015).
Given these conflicting estimates, it is unclear exactly how water demands across these sectors might change in the future. In this study, we explore the effects of both growth and decline in each type of demand in the UCRB (municipal and industrial, transbasin, and irrigation) of up to 50%. For transbasin diversions, we apply these rates of growth or decline to their maximum historical monthly demands observed in the historical record, considering this the ideal amount of supply. It should be noted that any increase in demand is limited by the flow and storage rights associated with each diversion. As we are not considering the purchase of additional rights by the users in this study, the shortages reported in section 4 are in part due to the fact that the demands might be exceeding the right associated with each user, not necessarily due to water being unavailable. In reality, users, and especially the large municipal users diverting water to the eastern plains, could consider the purchase of additional water rights, among other measures, to meet their growing demand.
However, within this exploratory experiment, streamflows and irrigation demands should be anticorrelated, as evapotranspirative demands increase in drought years when flows are low. To ensure consistent irrigation and streamflow time series within each SOW, we compute the synthetic annual flow anomaly at the 10.1029/2020EF001503 last node in every simulated year. We then use a regression of historical annual total irrigation anomalies versus historical annual flow anomalies at the last node to determine an appropriate annual total irrigation anomaly for each year of that simulation (with added noise to preserve variance). We then add this time series of irrigation demand anomalies to the mean irrigation demand for that SOW. These total irrigation demands are then distributed to all of the irrigation structures using their average historical proportions of the total demand. Additional details on the regression used to anticorrelate streamflow anomalies and irrigation anomalies are provided in the SI.

Environmental and Institutional Changes
The last three factors sampled in our exploratory modeling experiment represent environmental and institutional changes in the basin with potentially consequential impacts to some users. The first is an environmental factor related to erosion and sedimentation of reservoirs in the UCRB, resulting in reduced storage. As shown in Table 1, for each SOW we sample potential losses of up to 20% of capacity, a number informed by Graf et al. (2010).
We sample the institutional effect of whether the Shoshone Power Plant continues its operation using a binary variable indicating whether or not it maintains its "beneficial use" of water. If the Shoshone Power Plant shuts down (Gardner-Smith, 2019; Proctor, 2008), its decreed volume of water can no longer be put into beneficial use, allowing otherwise out-of-priority users to take advantage of the available water instead. However, junior users downstream of the Shoshone Power Plant might be negatively affected, as water that would have otherwise been delivered by the Shoshone call, can now be diverted to more senior upstream users.
The final institutional change we sample is the effect of changing the flow right at the 15-mile reach to be a senior right. This is again a binary factor where a sampled value of 1 enforces the potential legal change of assigning the 15-mile reach a first priority senior right with a decree of 22.94 m 3 /s (810 cf/s, the minimum flow rate recommended during dry conditions USFWS, 1999). Including this binary variable in our experiment allows us to investigate the effects of legally prioritizing the system's ability to meet the minimum flow requirement.

Robustness Analysis and Scenario Discovery
To assess robustness, one needs to define the outcome of interest, measured by some performance metric (e.g., volume of delivered water and cost), and aggregate the value of that performance metric across the scenarios being considered (McPhail et al., 2018). This paper explores how different definitions of acceptable water shortages affect relative robustness values across users (Panels IV and V in Figure 3). In this case, we focus on satisficing as the primary measure.
The domain criterion satisficing metric (Starr, 1962) has been widely used in the robustness literature and quantifies the fraction of scenarios in which a desired performance is met. The desired performance is typically defined by some performance threshold (e.g., a maximum budget if performance is concerned with cost); the greater the number of scenarios in which this threshold is met, the more robust the system. Satisficing metrics are particularly applicable to water supply problems, as water supply management tends to be concerned with assuring that the necessary quantities of water are provided (e.g., to sustain agricultural production), rather than maximizing water consumption or minimizing deficit (Beh et al., 2017;Hall & Borgomeo, 2013). Using this robustness metric, we investigate how different performance thresholds affect both the interpretation of robustness and the scenarios under which the system performance is considered acceptable. Since the UCRB is a system of diverse stakeholders, we perform this analysis for all users represented in StateMod.
The different performance thresholds used in this study have been defined using combinations of magnitude and frequency of shortage. Magnitude of shortage is defined as the percentage of user demand that is not met, while frequency is defined as the percentage of time (in months) that the user experiences such shortage. Both measures of shortage (magnitude and frequency) are binned by percentiles with a width of ten (e.g., one bin might include 0th to 10th percentile of shortage magnitude and 10th to 20th percentile of shortage frequency), resulting in 100 combinations of performance thresholds for each user (t in Figure 3). Applying these performance thresholds for each user (indicated as f ′ (u, t) in Figure 3) results in a satisficing 10.1029/2020EF001503 surface for each user, indicating their robustness value as determined by all threshold combinations (indicated as R(u, t) in Figure 3). Scenario discovery is then performed for select performance thresholds for each user, informed by the frequency with which they historically experienced shortage of that magnitude.
Scenario discovery is most typically performed using the Patient Rule Induction Method (PRIM) (Friedman & Fisher, 1999) or Classification and Regression Trees (CART) (Breiman, 1984). One of the main criticisms of these methods is that they rely on orthogonal subspaces to classify the parameter space into success or failure, making their application to nonlinear systems inadequate (Kwakkel, 2019). Logistic regression is a common technique in binary classification problems that overcomes this limitation by using a nonlinear function to describe the probability that a SOW belongs to the scenarios that lead to failure (Lamontagne et al., 2019;Trindade et al., 2019). This also allows users to define regions of success based on the estimated probabilities of success in those worlds, a more intuitive way of capturing different levels of risk aversion than coverage and density used in PRIM. As the factors considered in this study are assumed to be deeply uncertain, the probability that a parameter combination (or SOW) will occur cannot be known. However, with logistic regression we can predict the probability that, should a SOW occur, the performance threshold will be met. The resulting probability estimates can then be used to discover parametric combinations (scenarios) that lead to failure. The scenario discovery procedure is performed independently for each user and is detailed below.
For each performance threshold (combination of shortage magnitude and frequency), several logistic regression models are built using a different covariate (uncertain factor) each time. Every model estimates the probability that a SOW will be classified as a success (1) or a failure (0) in meeting the performance threshold, given the value of the uncertain factor value in that SOW. Each logistic regression model is defined by where p i is the probability the performance in the ith SOW will be classified as a success, X i is the vector of covariates describing the ith SOW, and is the vector of coefficients describing the relationship between the covariates and the response, estimated using maximum likelihood estimation. Iterating through all possible covariates, one at a time, we identify those with the highest McFadden's pseudo-R 2 . McFadden's pseudo-R 2 is given by where lnL (M ull ) is the log likelihood of the full model (model with predictor) and lnL(M intercept ) is the log likelihood of the intercept model (model with no predictor besides the intercept). The intercept model predicts the mean probability of success across all SOWs, and McFadden's pseudo-R 2 is a measure of improvement of the model with the predictor variable over the intercept model. Higher McFadden's pseudo-R 2 values can therefore be used to rank all candidate covariates in terms of significance in predicting success or failure (Quinn et al., 2018). After identifying the top two individual predictor variables for the performance threshold being modeled, we then build a new logistic regression model with both as covariates.
While this approach could be problematic with historic observations, multicollinearity is not a concern in this experiment since all uncertain factors being used as covariates were sampled independently in the Latin hypercube design. From our new bivariate logistic regression model, a probability plane can be plotted to create a "factor map" illustrating the probability of success given the values of the two variables. An interaction term was also added to the logistic regression model, to capture potential interactions between the top two factors. The factor maps show how users' sensitivities to the deeply uncertain factors differ depending on the satisficing threshold (i.e., shortage magnitude and frequency) used in the robustness definition.

Impacts of Changing Conditions on UCRB Users
Across the synthetic ensemble of 10,000 realizations (each one encompassing 105 years of streamflow), water users in the UCRB are affected in very different ways by deeply uncertain changes in streamflows, demands, and institutions, experiencing broadly different shortages (unmet demands) over time and across simulations. Figure 5 shows annual shortages experienced historically and across the ensemble for the same users presented in Figure 2 (i.e., f(u, S) for each user in Figure 3). The black line in every panel indicates magnitude of annual shortage (y axis) and the percent of time it was experienced historically (x axis) for each user. The shaded areas represent the frequency with which shortages of different magnitudes were experienced at each percentile across the ensemble. For example, looking at panel (Figure 5e), the annual shortage at the 20th percentile historically was approximately 50M m 3 (40,536 acre feet). Within each realization of the ensemble experiment, annual shortages at the 20th percentile ranged between near 0 and 120M m 3 (9,729 acre feet). Lighter shades within these areas indicate increased cumulative frequency within the ensemble.
Looking across the panels in this figure, it is apparent that users in the UCRB experience diverse impacts as a result of these changing conditions. For the senior-and median-right irrigation users (Figures 5a and 5b), historical experiences appear to lie at the lower bands of cumulative frequency in the ensemble (0-10% and 10-20%), meaning that in the majority of realizations simulated (80-90% of simulated SOWs) they experience higher and more frequent shortage magnitudes across all percentiles of shortage. The historic shortages for the large-decree user (irrigation right with the largest decreed flow of water) and the transbasin diversion (Figures 5d and 5e) lie in the 50-60% and 40-50% bands of cumulative frequency for all shortage magnitude percentiles, respectively. This means that in approximately half the SOWs, shortage magnitudes and frequencies increase, whereas in the rest they are reduced, in some cases down to 0. For the 15-mile reach (Figure 5f), the historic shortages lie near the 60% band of cumulative frequency, which means that in most SOWs the shortages for this diversion are reduced in both magnitude and frequency (to no shortages at all for the entire 105-year sequence). One also notes that deducing relative sensitivities of the six users to changing conditions based on historical observations, reported in Figure 2 in monthly magnitudes (also in black lines in Figure 5), would have been misleading. For example, in Figure 2a, the senior-right irrigation user appears to have historically had less frequent shortages (maximum three months in a year) of smaller magnitudes-less than 100,000 m 3 (or 81 acre feet) per month-with the 2002 drought year only having a moderate impact on their water yield. The large-decree irrigation user, on the other hand (Figure 2d), had experienced shortages for up to six months in a year, with magnitudes over 10,000,000 m 3 (8,107 acre feet) per month. Yet the historically more vulnerable user (Figure 5d) appears less sensitive to changing conditions in Figure 5 than the historically less vulnerable user (Figure 5a). This highlights that if the state is concerned with addressing the vulnerabilities of various users in the basin, their past performance is not necessarily a direct indicator of their vulnerability under changing conditions.
With the exception of the large-decree irrigation user and the transbasin diversion (Figures 5d and 5e), it is also apparent that the magnitude impacts are not uniformly distributed across shortage percentiles. Looking at panel (Figure 5c), the variance in absolute magnitudes of shortage across the ensemble appears to be much larger for higher percentiles of shortage than for lower percentiles. The inverse appears to be true for the user in panel (Figure 5b). There is also significant variability in how the frequencies of shortage are affected. For example, for the senior-right irrigation user (Figure 5a), no shortages were experienced 70% of the time historically but are experienced up to 90% of the time in the worst-case ensemble member, making shortages up to three times more likely in that scenario. On the other hand, their worst historical shortage at the 99th percentile has moved to around the 90th percentile in the worst case scenario, making it up to 10 times more likely to be exceeded. The large-decree irrigation user (Figure 5d) historically experienced a shortage less than 40% of the time, but under the ensemble they experience some level of shortage at all times in the more challenging realizations. It is apparent from these comparisons that general inferences by users individually or aggregated by type would not sufficiently capture the nonlinear differences in how shortage magnitude and frequency may change for UCRB stakeholders. Water right seniority and level of extraction are also not indicative a priori of how the various users might be affected, as can be deduced by comparing across panels (Figures 5a-5c), and ( Figure 5d) and (Figure 5e), respectively.

Translating User Impacts Into Robustness
To understand how robust different UCRB users are to the changing hydrologic conditions, human systems demands, and environmental institutional factors sampled in this study, the impacts in Figure 5 need to be transformed into a robustness metric, in this case, the domain criterion satisficing metric (Starr, 1962). As discussed in McPhail et al. (2018), the calculation of a robustness metric involves three steps: (i) a transformation of the shortages experienced by each user to an aggregate measure of performance over time in each simulation (e.g., number of periods with shortage, maximum shortage); (ii) a scenario selection over which robustness is to be assessed; and (iii) an aggregation of the metric in (i) across the scenarios in (ii).
In this study, we use a binary transformation in Step (i) to define realizations as "successes" or "failures" if a performance threshold is met (i.e., f ′ (u, t) for each user and threshold). For Step (ii), we use the full set of realizations to assess robustness (10,000), and for Step (iii), we calculate robustness by counting the number of realizations that meet the performance threshold (i.e., the number of successes, R(u, t) for each user and threshold). For each user a matrix of satisficing performance thresholds is used that encompasses 100 combinations of shortage magnitude and frequency. Each combination of magnitude and frequency of occurrence is a potential performance threshold over which robustness is evaluated. Figure 6 presents these matrices of performance thresholds and the resulting robustness values for the same six users presented in Figures 2 and 5. Robustness for a shortage magnitude and frequency is calculated as the percentage of SOWs where shortages of that magnitude do not occur more often than the specified frequency. The magnitude and frequency of shortage are classified on their annual scales. For example, looking at Figure 6a (the upstream senior-right irrigation user) and taking the third square from the left in the top row: This performance threshold is defined by shortages of magnitude at least 30% of the annual demand occurring at a frequency of 10% of the years or fewer in a given realization. Realizations that meet this criterion are considered successes, while those that do not are considered failures. The color of the square indicates the percentage of realizations that are considered successes (in this case, approximately 30%). In each matrix, moving from the left to the right indicates performance concern with shortages of higher magnitude, and moving from the top to the bottom indicates performance concern with shortages of higher frequency. Consequently, moving from the top left to the bottom right represents a shift in concern from low-magnitude, low-frequency shortages to high-magnitude, high-frequency shortages. This translates to moving from a focus on less severe droughts to far more extreme events (in terms of shortage magnitude and frequency).
Each user's potential willingness to accept a magnitude and frequency of shortage can be contextualized in each Figure 6 panel by the squares with black borders. Black borders indicate the historical frequency with which each magnitude of shortage was experienced by each user. To facilitate presentation, frequencies have been binned in brackets of tens: moving from the top to the bottom, the top row indicates frequencies of 1-10% of the time, 11-20% of the time, and so on. For example, looking at Figure 6a again, this user has historically experienced shortages at a magnitude of at least 10% of demand at a frequency of 11-20% of the time. They also experienced a shortage magnitude of at least 20% of demand at a frequency of 1-10% of the time, and a shortage magnitude of at least 30% of demand (one square to the right) at the same frequency interval. Shortages of magnitudes higher than 40% of demand were never experienced.
The median-right irrigation user (Figure 6b) experienced shortage magnitudes of up to 100% historically, albeit infrequently. A shortage magnitude of at least 10% of their demand took place 70-80% of the time. The same comparison for the large-decree irrigation user (Figure 6d), suggests that even though some shortage took place 60% of the time (Figure 5d), all of those shortages were of magnitudes below 10% of demand. The transbasin diversion (Figure 6e) appears to have experienced shortages throughout the historical record (matching the corresponding Figure 5e), which were at magnitudes of up to 50% of demand in magnitude. Shortages of higher magnitudes were also experienced at reduced frequencies. Finally, results for the 15-mile 10.1029/2020EF001503 reach are presented slightly differently (Figure 6f). Moving from left to right, each column indicates decreasing flow rates to be met in the river, and moving from top to bottom, each row indicates the percent of time that flow rate is met. These discrete flow rate levels were informed by the U.S. Fish and Wildlife Service recommendations for 46, 35, and 23 m 3 per second (or 1,630, 1,240, and 810 cubic feet per second) under wet, average, and dry hydrologic conditions, respectively (USFWS, 1999). The minimum official requirement of 23 m 3 per second was met 90-100% of the time historically. This flow rate was only met at the same frequency in approximately 40% of the sampled realizations (as indicated by the square's color). If alternatively, one chooses to have the same flow rate but only 60-70% of the time (i.e., move three squares down), the robustness value changes to 100% of the realizations being able to meet this performance criterion (square in dark blue).
Looking at the color gradients across all of the panels in Figure 6, robustness values vary significantly subject to the magnitude-frequency combination of threshold selected, as expected. Even though generally minimizing shortages is clearly a concern for all stakeholders in this system, this figure highlights the complexity of choosing a single magnitude-frequency combination threshold (e.g., to not have shortages of magnitude of 50% at a frequency of more than 20% of the time) that would be appropriate and relevant across the diverse stakeholders in the UCRB. Additionally, using a single threshold for individual stakeholders or stakeholder groups is inappropriate as the relationship between the satisficing thresholds and robustness values is neither simple nor consistent across users. Across the panels in Figure 6, the slope of the transition plane from thresholds resulting in 0% to 100% robustness values (from deep red to deep blue) appears to be equally dominated by both the magnitude and frequency dimensions in Figures 6b and 6f but more strongly affected by its magnitude dimension in Figures 6c and 6e. This slope also appears to vary in steepness, for example, comparing the transition in Figure 6b with that in Figure 6f.
This complex relationship between performance thresholds and robustness would not be known a priori: It is highly dependent on the interactions between all the considered uncertainties (hydrologic, demands, and infrastructure), the institutional water rights allocation hierarchy, and physical constraints on water availability in different regions of the basin. For example, senior users located on small upstream tributaries may be more vulnerable than junior users located downstream on the mainstem, as water right seniority cannot protect against physical supply limitations. Even though history can provide some information on how these interactions impact different users (the black lines in Figure 5 and the black borders in Figure 6), our exploratory modeling illustrates the complex relationship between alternative magnitude-frequency performance thresholds and robustness values for each user, as shaped by the combination of their water right seniority, extraction levels, and location in the basin. For example, robustness values for the large-decree irrigation user (Figure 6d) appear to be insensitive to the choice of performance threshold, in contrast to all the other users in this figure. Differences are also observed in the range of robustness values experienced by different users in meeting their history-informed magnitude-frequency combinations (squares with black borders). These combinations represent users' previous experiences and could potentially be decision-relevant candidates as robustness performance metrics: In Figure 6a the robustness values at the historical thresholds range from 7% to 24% robustness, in Figure 6b they range from 3% to 99% robustness, in Figure 6c from 39% to 99%, in Figure 6e from 13% to 63%, and in Figure 6f from 35% to 88%. For the single history-informed threshold in Figure 6d, the robustness value is 62% (indicated by the light blue color).
Most importantly, the relative robustness ranking of these six users is not always stable across performance thresholds. For the sake of illustration, if one were to gauge the robustness of this system using "Robust if shortages of 10% of demand do not occur more frequently than 80% of the time" as the metric definition, the relative robustness of the four irrigation and the transbasin diversion would be the following (from most robust to least robust): d > a > c > b > e. If, instead, the robustness metric was selected to be "Robust if shortages of 30% of demand do not occur more frequently than 30% of the time," their relative ranking would become: d > c > a > b = e (notice relative rankings for (a) and (c) switching). Lastly, if robustness was assessed using "Robust if shortages of 80% of demand do not occur more frequently than 10% of the time," their relative ranking would again switch to: d = a > c > e > b (notice relative rankings of (b) and (e) changing). This finding further highlights how different robustness metrics, applied for all users in the basin, can yield different and unpredictable multiactor robustness inferences. Note. Two performance thresholds for the median-right, junior-right, and transbasin diversion user were selected to perform scenario discovery. The respective factor maps are presented in Figure 7.
Alternative robustness metrics might also yield different conclusions about which deeply uncertain factors most influence success and failure at those satisficing thresholds. We investigate these implications by applying scenario discovery to six of the history-informed robustness performance metrics, highlighted with a yellow border in Figure 6. There are two performance thresholds each for the median-right, junior-right, and transbasin diversion users, and they are listed in Table 2.
Exploiting logistic regression, each panel in Figure 7 plots the top two most important factors controlling robustness performance for a given magnitude-frequency threshold, along with their interaction (most important factor × second most important factor). In our factor mapping, the values of the most important factor are always plotted on the x axis and the values of the second most important factor are always plotted on the y axis in each panel. The most important factors were determined based on their predictive ability in the regression model. Their predictive ability is indicated by their McFadden's pseudo-R 2 values, which are provided in the SI for each panel in Figure 7. Points in each panel represent the sampled SOWs, in blue if at least 50% of the realizations for that SOW met the respective criterion in each panel (successes) and in red if less than 50% of the realizations met the criterion (failures). Using this method, we can map parametric regions of the uncertain factors to conditional probabilities of failure and success and use the model to predict the likelihood that a criterion will be met given a change in that factor. As changes in mean wet and dry flows are sampled independently, in approximately 6% of the sampled SOWs the values for the mean dry flows exceed those for the sampled mean wet flows. These samples have their parameter labels swapped during the scenario discovery part of the analysis, to avoid misclassifying the SOWs. The parametric area corresponding to those samples is not assigned conditional probabilities of success or failure (blank area in Figure 6c). Note that the mean wet and dry flow multipliers have also been converted to real space in the figure to aid interpretability.
The two alternative robustness performance metrics for the median-right irrigation user (Figures 7a and 7b) result in different robustness values ( Figure 6b) and factor maps. Using the "10% demand-80% of the time" performance metric (Figure 7a), robustness for this user is 16% of the sampled SOWs (derived from square color in Figure 6b). Success and failure are first controlled by the shift in the timing of snowmelt and then by changes in the mean wet flow. The failure region for this metric is a large portion of the sampled space, with failures occurring along the entire range of snowmelt shifts (0-60 days earlier) and successes only occurring when combined with wetter worlds (higher mean wet flows). Using the "80% demand-20% of the time" performance metric ( Figure 7b) the most important factor change is to the mean dry flow, followed by the shift in snowmelt. Higher magnitude shortages typically occur in drier years, the flow of which is in part controlled by changes in the distribution of mean dry flows. Use of this metric results in a 52% robustness value (derived from square color in Figure 6b), reflected in the now smaller failure region (red areas), characterized by drier low flows and snowmelt shift. Both these panels also show that interactions between shifts in snowmelt timing and changes in the mean wet and dry flows appear to further affect success in Figures 6a and 6b, respectively.  Table 2. Points in each panel represent the sampled SOWs, in blue if at least 50% of the realizations for that SOW met the criterion (successes) and in red if less than 50% met the criterion (failures). For each factor map the parameter with the highest predictive ability is plotted on the x axis, and the parameter with the second highest predictive ability is plotted on the y axis. In brackets next to each parameter name, we indicate whether the parameter had an additive or a multiplicative effect on the system. Note that the blank area in the top left corner of panel (c) is not assigned conditional probabilities of success as it represents sampled SOWs in which the mean dry flow exceeds the mean wet flow. Mean wet and dry flow parameters have been converted to real space to aid interpretability.
Looking at the two alternative performance metrics for the junior-right irrigation user (Figures 7c and 7d), similar changes can be seen. Using the "10% demand-80% of the time" performance metric (Figure 7c), robustness for this user is 60% of the realizations (derived from square color in Figure 6c). Success and failure are determined primarily by the distributions of wet and dry years. This means that applying the exact same robustness metric to two different irrigation users in this basin results not only in different robustness values (16% for the median-and 60% for the junior-right user) but also different drivers of their failures. Even though prior studies have highlighted that robustness definitions can yield unintended multiactor consequences (Herman et al., 2014;Trindade et al., 2017), we believe this to be the first study to evaluate and quantify how the highly diverse users in an institutionally complex river basin are faced not only with different levels of vulnerabilities but also and how changes in their preferred measures of robustness can fundamentally reshape which deeply uncertain factors control their performance.
The two alternative performance metrics for the transbasin diversion (Figures 7e and 7f) also result in different robustness values ( Figure 6d) and factor maps. In both instances the factor most significantly controlling success in the performance metrics is transbasin diversion demands, followed by the shift in snowmelt timing for the "70% demand-30% of the time" metric, and by changes in the dry flow distribution for the "80% demand-10% of the time" metric. The "70% demand-30% of the time" metric is met by 34% of the simulated realizations (derived from square color in Figure 6e) and can almost never be met in realizations with transbasin diversion demands increasing by more than 20%. For this metric to always be met, transbasin diversion 10.1029/2020EF001503 demands would actually need to decrease, either by conservation measures or through other means. This is also true for the "80% demand-10% of the time" metric.
Finally, the failures highlighted in the panels of Figure 7 (the red points) show failure realizations have streamflows across the sampled range, both in drier and wetter conditions. This indicates that failure across users and preference thresholds is determined by factor interactions beyond those determining streamflow, as even in the most favorable SOWs (with wetter conditions) there are multiple realizations failing to meet the preference thresholds. These findings underline the importance of exploring the impacts of a wide number of uncertain factors on a variety of performance metrics of interest, as (i) there might exist several failure modes resulting from the interactions between the uncertain factors and (ii) the relative effects of these factors and interactions might vary for different impacts of concern. This appears to be the case when comparing across users and across types of impacts for a single user. The scenario discovery also suggests that the probabilities of failure and success in all performance metrics can be variably sensitive to the identified factors. In some instances, minor shifts in a factor's value may result in movement to the failure region (for example, changes of 10% in the mean dry and wet flows). Moving across different combinations of magnitude and frequency of shortage, the relative effect of different factors varies (even when only looking at the effects on a single user).

Conclusions and Future Work
The UCRB is an institutionally complex basin supporting a multitude of diverse stakeholders and agricultural, municipal, and industrial activities. Water allocation in the basin happens through an intricate network of hierarchical water rights shaping both supply and shortage dynamics for all users. In this study, we use StateMod, a model of the basin's water flow, accounting, and allocation, to resolve the dynamics of supply for hundreds of the basin's stakeholders, and investigate their changing vulnerabilities under an ensemble of potential (deeply uncertain) futures. We pair StateMod with a bottom-up vulnerability assessment framework, advancing previous work in this area by formalizing an a posteriori exploration of a spectrum of performance preference thresholds, defined by combinations of shortage magnitude and frequency. We investigate the consequences of alternative preference thresholds for robustness conclusions, as well as for the identification of consequential scenarios for the different users.
Our analysis demonstrates large differences in how various users in the basin are affected by the same ensemble of uncertain futures, shaping both the magnitude and frequency of shortages in diverse ways across users. Furthermore, their relative vulnerabilities to changing hydrologic and socioeconomic conditions differ from their historical experience, stressing that water resources management concerned with addressing shortage vulnerabilities for individual users cannot only depend on their past performance to infer future impacts. Our findings also show a complex mapping between performance thresholds and robustness values that is shaped by the interactions between the considered uncertain factors and the water allocation hierarchy. This highlights the complexity of selecting a single performance threshold a priori, that is equally relevant across the diverse stakeholders in the basin, even within a single sector (say, agriculture). Lastly, when discovering consequential scenarios for UCRB users, we also show for the first time that the most important factors driving shortages can change significantly across users and preferred performance thresholds (i.e., inferences of what matters and for whom can change significantly).
These results imply that, for systems of this type, predicting impacts on, say, agriculture using a single functional relationship or metric would dramatically misrepresent the impacts for the different users. Management policies designed based on these simplified estimations and key factors could potentially be blindsided by uncertainties affecting ignored types of shortage or users. While it is beyond the scope of this paper to quantify the sources of heterogeneity in effects across the users, it is apparent that the human elements of this system, the water rights in particular, are fundamental components shaping the experienced impacts, by translating hydrologic and demand changes to supply and allocation. StateMod allowed us to resolve these effects at a high resolution and demonstrate their role in distributing vulnerabilities for the users. We believe this finding to be strong evidence that the modeling representation of such complex systems affects the conclusions drawn, and consequently, it is important to use highly resolved models of multiuser systems when assessing impacts specific to their stakeholders. Simply aggregating either shortage quantification or robustness valuation per sector would collapse their diversity to a single, poorly representative, value.
Future work will explore the mechanisms by which these rights influence the sensitivities of different metrics for different users through cluster analysis. We would like to discover if sensitivities are more informed by right seniority, magnitude of decree, basin geography, sectoral use, or other factors, as well as the extent of their interactions. Moreover, we plan to investigate the stability of robustness values achieved using different robustness metrics (e.g., regret-based metrics as opposed to satisficing) or using alternative performance threshold definitions. These definitions could, for example, take into account shortage duration. This is a relevant shortage characteristic that has not been included in these results but that could be influenced by changing hydrologic persistence, an especially pertinent concern given the current bidecadal drought of the 21st century.

Data Availability Statement
StateMod is available at GitHub (https://github.com/OpenCDSS). The input files to run StateMod for the UCRB can be found at the website (https://www.colorado.gov/pacific/cdss/surface-water-statemod). All the scripts to replicate the analysis performed in this paper and regenerateft Hadjimichael and Quinn (2020).