The Value of Initial Condition Large Ensembles to Robust Adaptation Decision‐Making

The origins of uncertainty in climate projections have major consequences for the scientific and policy decisions made in response to climate change. Internal climate variability, for example, is an inherent uncertainty in the climate system that is undersampled by the multimodel ensembles used in most climate impacts research. Because of this, decision makers are left with the question of whether the range of climate projections across models is due to structural model choices, thus requiring more scientific investment to constrain, or instead is a set of equally plausible outcomes consistent with the same warming world. Similarly, many questions faced by scientists require a clear separation of model uncertainty and that arising from internal variability. With this as motivation and the renewed attention to large ensembles given planning for Phase 7 of the Coupled Model Intercomparison Project (CMIP7), we illustrate the scientific and policy value of the attribution and quantification of uncertainty from initial condition large ensembles, particularly when analyzed in conjunction with multimodel ensembles. We focus on how large ensembles can support regional‐scale robust adaptation decision‐making in ways multimodel ensembles alone cannot. We also acknowledge several recently identified problems associated with large ensembles, namely, that they are (1) resource intensive, (2) redundant, and (3) biased. Despite these challenges, we show, using examples from hydroclimate, how large ensembles provide unique information for the scientific and policy communities and can be analyzed appropriately for regional‐scale climate impacts research to help inform risk management in a warming world.


Introduction
Borges said that the future is inevitable and precise, but it may not happen (Borges, 1941). For the science of climate change projections, the implications of his insight are clear: Across multimodel projections of climate change (like those from Phase 6 of the Coupled Model Intercomparison Project, or CMIP6; Eyring et al., 2015), warming is inevitable. And while each model's simulation is a precise rendering of the future climate, no one model projection will happen.
The inevitability of warming creates a desire for scientists to provide accurate estimates of the future to best inform adaptation decisions. Given scant resources, where do decision makers put their efforts to greatest effect (Paté-Cornell, 1996)? The deep uncertainty (Lempert et al., 2003) in regional-scale projections of climate change naturally leads to questions about whether climate science can inform such decisions. The models give different answers to the same sets of assumptions, so which model do we choose to trust? And what is the basis for that trust? Because of these difficulties, decision makers are left with questions about whether the range of climate projections across models is due to scientific choices about how best to simulate processes like convection, clouds, or tree mortality, or whether it is instead due to natural internal variability of the climate system, which should be interpreted as a set of possible real-world outcomes consistent with the same global warming. Because of this, climate uncertainty, such as different answers across models, can be cast as a justification for indecision, suggesting that the science needs to mature before it can be helpful to decision-making (e.g., Koonin, 2014;Palutikof et al., 2019;Weaver et al., 2013). In a recent survey of local-scale adaptation to sea level rise in coastal Australia, for example, respondents found scientific uncertainty to be the second most important barrier to the development of effective responses, after lack of leadership (Palutikof et al., 2019).
Initial condition large ensembles of fully coupled Earth System Model (ESM) simulations, have, at first glance, further complicated this climate-science-for-decisions gap (Deser ,Knutti, et al., 2012;Hawkins et al., 2016). Such ensembles are experimentally similar to those used in weather forecasts: Run one model with one set of boundary conditions many times with different initial conditions to generate a distribution of outcomes consistent with the same assumptions. The key differences from weather forecasting are threefold: (1) The spatiotemporal scale of the integration is larger in initial condition large ensembles-typically global in scale and centennial in length; (2) there are different methods for initializing the model, such as introducing round-off errors into atmospheric fields (e.g., Kay et al., 2015;Kirchmeier-Young et al., 2017;Selten et al., 2004;Sterl et al., 2008) versus sampling the ocean-atmosphere and/or land states (e.g., Maher et al., 2019;Rodgers et al., 2015), a distinction that is sometimes called microinitializations versus macroinitializations (Hawkins et al., 2016); and (3) that each model realization is fully coupled, simulating the interactions of the ocean, atmosphere, cryosphere, and land surface, often with active biogeochemistry under evolving greenhouse gas emissions. This experimental design has only recently become computationally feasible and is not one of the handful of common "Diagnostic, Evaluation and Characterization of Klima" (DECK) experiments mandated by CMIP6. Large ensembles have revealed an underappreciated source of uncertainty in regional climate projections-that of internal climate variability (Deser, Knutti, et al., 2012;Deser et al., 2020;Kay et al., 2015). Therefore, not only do different models give different answers about future climate (as in the CMIP5), but so does the same model when run many times (as in a large ensemble). In the former situation, there remains hope that we can constrain uncertainty by improving models. In the latter situation, the uncertainty emerges within the same model, and so it cannot necessarily be tied back to how we build models. Instead, this uncertainty is irreducible as it is intrinsic to the climate system-both simulated and real.
The existence of irreducible uncertainty in model predictions has been known since Lorenz (1963). What large ensembles have illustrated, however, is how persistent and large irreducible uncertainty is in centennial-scale climate simulations from state-of-the-art ESMs. In fact, at a recent community workshop hosted by CLIVAR on large ensembles, (https://usclivar.org/meetings/large-ensembles-workshop) there was an explicit discussion among attendees about moving away from using the word "uncertainty" when talking about the range of outcomes from large ensembles for fear of reinforcing mistrust in model projections. This is because of the possibility that the large magnitude of "irreducible uncertainty" in climate projections can confound people's expectations of what climate change looks like at regional scales (Deser, Knutti, et al., 2012;Deser et al., 2020), perhaps reinforcing the sentiment that climate science is not positioned to inform adaptation decisions.
Here we illustrate why this sentiment is not correct. Characterizing the magnitude and sources of uncertainty-even when it is large, irreducible, and poses communication challenges-is not a bulwark to effective decision-making, it is the means to effective decision-making. Risk assessment tools require the full range of probabilities of the most damaging risks; there are rightful calls for climate science to take risk assessment much more seriously (Sutton, 2019). As we discuss in the sections below, large ensembles are a crucial tool for climate change risk assessment and adaptation decision-making because they allow us to (1) better attribute uncertainty and estimate its irreducible magnitude, something that cannot be done well in multimodel ensembles like the CMIP, and (2) because the noise they generate allows decision makers to consider the full range of outcomes consistent with same forcing to ensure they are robustly prepared for the future.
Several arguments about the importance of large ensembles to decision-making have already been made in the literature (e.g., Deser, Knutti, et al., 2012;Deser et al., 2020). The most common one made is that large ensembles help robustly identify the emergence of signals in climate projections, particularly noisy systems like hydroclimate . A second important argument emerging from large ensembles is about how uncertainty in climate projections should best be communicated to policymakers. Because the climate system is inherently noisy, there are appropriate arguments that scientists need to manage expectations about the limits of regional climate prediction (Deser, Knutti, et al., 2012;Deser et al., 2020). Here we extend these important points, arguing that the noise is not only a means to an end, such as signal identification or better climate communication. Instead, we point out that the noise itself provides information that is as valuable to planning as the signal. One may ask, if more than 40 ensemble members are needed to detect a significant change, is the change even relevant? Such a question comes from a signal-dominated perspective: The noise is not simply noise. They are outcomes-even if rare-entirely consistent with the same change in the mean state, and decision makers can only be prepared for those outcomes if they are communicated as possibilities. For example, even if all models agreed that the next two decades should on average see more rainfall because of climate change, it does not preclude the possibility to experience a drought. How could a water manager know to prepare for the possibility of drought if it is simply dismissed as noise (Lawrence et al., 2020)?
Nevertheless, it is important to highlight that large ensembles also have considerable drawbacks. Some common critiques are that they are (1) resource intensive, (2) redundant, and (3) biased, potentially misrepresenting the true magnitude of internal variability. Focusing on hydroclimate, we engage each of these major drawbacks in the sections below as well, illustrating how they inform responsible use of initial condition large ensembles to aid regional-scale adaptation decisions.

The Importance of Sourcing Uncertainty for Decision-Making
Part of the power of large ensembles lies in their ability to rigorously quantify and attribute uncertainty in climate projections in ways multimodel ensembles alone cannot. This insight has recently been concretized by comparing several large ensembles to the multimodel CMIP5 ensemble. Such a comparison shows that internal variability can be underestimated and overestimated at the same time, depending on the variable, region, or temporal frequency being analyzed Lehner et al., 2020). This implies that one cannot distinguish true model-to-model differences from noise in a multimodel ensemble without large errors . To understand why this is, it helps to recall the sources of climate uncertainty, why they are important to quantify, how they have been quantified to date, and why large ensembles better articulate the range of real-world outcomes a decision maker will be faced with.
Canonically, uncertainty in multimodel projections of climate change is partitioned across three sources (Hawkins & Sutton, 2009): (1) forcing (or scenario) uncertainty, which is a function of ambiguity about people's greenhouse gas emissions over the coming century. This is fundamentally a boundary condition problem (Kirtman et al., 2012;Meehl et al., 2009). (2) Structural (or model) uncertainty, which arises from models giving different answers in response to the same forcing and initialization. This uncertainty is a function of both computational limits and scientists' imperfect knowledge of the physics of the climate system, and how those factors are reflected in modeling choices. (3) Internal variability, which are climate variations that occur in the absence of any changes in boundary conditions. It suggests that the climate we experience is but one realization of many plausible futures, given the chaos innate to the climate system (Lorenz, 1963). This uncertainty in model projections is sometimes called irreducible (e.g., Deser, Knutti, et al., 2012;Hawkins et al., 2016) because this variability is intrinsic to model representations of climate. Because we cannot characterize the state of the climate everywhere simultaneously to perfectly initialize a model simulation (there will always be uncertainties in observations), estimates of internal variability are framed as an initial condition problem (Kirtman et al., 2012;Lorenz, 1963;Meehl et al., 2009).
Partitioning uncertainty in projections of climate change is not a pedantic exercise. The origins of such uncertainty have important consequences for the scientific and policy decisions made in response to climate change. For example, where do we focus our scientific efforts to best project climate change? And given these 10.1029/2020EF001610 ongoing efforts, should a policymaker wait for more information before making a decision? Of these three sources of uncertainty, only two exist in the real world: scenario uncertainty and internal variability. In contrast, model uncertainty is a function of both our lack of knowledge about some climate processes and the computational challenges inherent in modeling a multiscale, complex system. This uncertainty does not exist in the real world and, unlike the other two uncertainty sources, will not shape the real-world climate people experience. Therefore, model uncertainty provides a veil of plausible deniability for making decisions-it is the uncertainty that suggests that climate science needs to mature.
While the sourcing of uncertainty in projections of climate change is important to decision-making, it is not clear that multimodel ensembles like the CMIP5 provide a rigorous quantification of these uncertainties. Large ensembles have revealed that the CMIP5 conflates model uncertainty with that from internal variability (Kay et al., 2015;Lehner et al., 2020;Mankin et al., 2015Mankin et al., , 2017. One can see this clearly by comparing the total uncertainty in a climate projection from a multimodel ensemble to that in a single-model large ensemble. We make this comparison in Figure 1, using the large ensemble from the National Center for Atmospheric Research's Community Earth System Model (NCAR CESM1 Large Ensemble, or CESM1-LE) and the CMIP5 multimodel ensemble both forced with a high-emissions pathway, called Representative Concentration Pathway 8.5, or RCP8.5 (Riahi et al., 2011). We show the relative magnitudes of uncertainty (measured as the full range of the CESM1-LE as a percentage of that from the CMIP5) focusing on two variables important to water management: basin-scale 75-yr trends in spring and summer runoff from snowmelt ( Figure 1a) versus that from rainfall (Figure 1b), both estimated from 2006-2080. For many midlatitude basins, the uncertainty in CESM1-LE, which is the range of simulated outcomes from internal variability alone, spans the majority of (or exceeds) the uncertainty in the CMIP5, which includes both model and internal variability uncertainties. These basins have large populations and varied demands on water resources that require confident estimates of future climate. For regional hydroclimate, the expectation based on multimodel ensembles is that model uncertainty should dominate that from internal variability on multidecadal time scales (e.g., Giuntoli et al., 2018;Hawkins & Sutton, 2011;Zhang & Soden, 2019). But Figure 1 shows this is not always the case-for many basins, the CMIP5 ensemble is undersampling the uncertainty from internal variability-at least as estimated using the CESM1-LE-because there are too few simulations with each model to quantify their representation of internal climate variability. Large ensembles, therefore, demonstrate that uncertainty quantification in multimodel ensembles can be misleading for some regions and variables. An implication of this for decision-making is that the fractional uncertainty from internal variability is larger and more persistent than assumed and, more importantly, that its true extent (the form of uncertainty that exists in the real world) may not be accurately reflected in projected climate impacts from the CMIP5 multimodel ensemble. From a decision-making perspective, this is crucial, as decision makers are left with the question of whether the range of climate projections across models is due to model uncertainty or instead is entirely consistent with the same warming world. Consider a water manager in the western United States deliberating long-range risk management options: Figure 1 shows (via hatching) that snowmelt and rainfall runoff have the possibility for 75-yr increases and decreases in both ensembles despite the fact that the ensembles represent different sources of uncertainty. Looking at such results from CMIP5 alone, that water manager might argue that the models do not agree on anything, leading to a "wait and see" approach in decision-making. That same manager, looking at the results from the CESM1-LE, may recognize that there are many basin-scale climate futures that are consistent with the same warming, and that we need to be prepared for all of them.

The Value of Large Ensembles to Robust Decision-Making
Adaptations to climate change will require regional responses encompassing institutional, social, and economic changes, as well as significant time horizons for planning. Since adaptations are best made proactively, their implementation should necessarily occur when internal climate variability still has the potential to amplify, mask, or reverse expected climate changes, even in the face of a clearly detected and attributed anthropogenic signal. Evaluating tradeoffs associated with adaptation options is difficult in a world where internal variability can create the appearance of maladaptation or be truly maladaptive over the short term, diverting resources to activities that may increase people's vulnerability to climate risks. Furthermore, adaptations require benchmarks for evaluation. A farmer may ask, for example, "Did changing my sowing date based on my expectations of climate change actually improve my yields?" Such evaluations are based on actual yields, meaning they are, in part, a function of the actual climate state and thus subject to irreducible uncertainty from internal variability. But humans, including water managers and farmers, have a long history of optimizing decisions under deep uncertainty, and scholars in decision sciences, engineering, and epidemiology have worked hard to develop tools to ensure that the decisions people make in response to things like climate change are robust despite such uncertainty.
Robust adaptation decision-making is the process of choosing a strategy or set of strategies that have the greatest benefit across the broadest range of potential real-world outcomes (Dessai & Hulme, 2004;Lempert & Groves, 2010;Lempert & Schlesinger, 2001;Lorenz et al., 2015), such as the outcomes consistent with internal variability. Consider the set of water management adaptations that would be evaluated under two different distributions of decadal precipitation changes over the western United States, inspired by Dessai and Hulme (2004) (Figure 2). The distribution in blue is an estimate of irreducible uncertainty following Hawkins & Sutton (2009) from 40 models in the CMIP5 multimodel ensemble, totaling 40 simulations. The distribution in red is the same but from seven different large ensembles, totaling 286 simulations. Both distributions are shifted to match the (potentially biased) CMIP5 ensemble mean, or "best estimate" of climate change. As such, it isolates only modeled internal variability. Across the bottom are possible adaptations in response to the different magnitudes of decadal-scale precipitation change, from sourcing new supply if the climate shifts to that exemplified by the decade centered on the 2012-2015 California drought to increasing dam heights for high precipitation decades like the decade centered on 1982-1983, when Lake Powell overtopped (Figure 2). The large ensembles, which have 33 fewer models than the CMIP5, encompass a larger range of variability in projected western U.S. rainfall. This larger variability occurs because the additional realizations in the large ensembles provide a more complete sampling of internal variability than one can get from the smaller number of simulations from each model in the CMIP5. This presents a challenge for robust decision-making.
Robust decision-making is a computational approach to stress-testing decisions for vulnerabilities against a set of possible scenarios . To carry out a formal decision analysis, one needs a full characterization of the possible outcomes, including the range related to the sampling of internal variability. In the case of the western U.S. water manager presented in Figure 2, decisions around long-term, capital-intensive hydrologic infrastructure and planning can be evaluated for vulnerabilities against the set of real-world outcomes consistent with internal variability. From a decision standpoint, the potential that internal variability's full extent is not being quantified or attributed in our multimodel projections is critical-because the set of strategies a decision maker adopts may be different if it is based on a truncated estimate of internal variability, as under the CMIP5 distribution, than if it is based on a fuller estimate of internal variability, as provided by the large ensembles. For example, if a water manager is only presented with the CMIP5 distribution, then the possibility of needing to seek new water supplies would not be part of the formal decision-making process. The consequences of that could include a misallocation of resources and a lack of preparedness for the water availability risks facing the region.
The impulse to give decision makers the most likely outcome as the basis for sound policy still pervades discussions on how climate science can best inform climate policy (Hausfather & Peters, 2020;Lawrence et al., 2020). But such a focus on the most likely set of outcomes (rather than all possible outcomes) can wrongly give decision makers a "false sense of certainty", increasing vulnerability and exposure to climate damages and locking-in costly decisions that will need to be made later on (Lawrence et al., 2020). Decision makers can only make flexible and robust decisions if they evaluate the set of responses that are beneficial across the broadest range of outcomes; this inherently requires we scientists to deemphasize our focus on the most likely outcome (the signal) and consider those outcomes that are typically dismissed as noise. The variability from large ensembles in conjunction with the signal from multimodel ensemblesas we show above-could be a key tool for that work.

Addressing the Problems With Large Ensembles
While we have illustrated the value that large ensembles have to robust decision-making, they are not without their problems. Rightful and common critiques are (1) that they are expensive to run and archive, (2) that one can estimate the same range of irreducible uncertainty via other means, and (3) that they are biased, where the range is that due to internal variability (model uncertainty is removed following Hawkins and Sutton (2009) by estimating the residuals off of a fourth-order polynomial fit to each model realization). The distribution in red is the same but from seven initial condition large ensembles archived by NCAR totaling 286 simulations. Both distributions are shifted to match a specified forced response, here taken as the CMIP5 ensemble mean for RCP8.5 (gray). To put these distributions in perspective, we use Global Precipitation Climatology Centre (GPCC) rainfall observations to show the western U.S. average decadal precipitation anomalies centered on the 1982-1983 Lake Powell overtopping (21.3%) and centered on the height of the 2012-2015 California drought (−7.5%). Under the figure are potential adaptation decisions that could be considered in response to particular magnitudes of change (dotted lines). The adaptations highlighted in green, for example, would not be considered without full estimates of internal variability; as such, water managers would not be able to "robustly adapt" to rainfall changes. Inspired by Dessai and Hulme (2004).

10.1029/2020EF001610
improperly estimating real-world internal variability. In this section, we examine each critique and consider how that should inform our evaluation of their continued scientific and policy value.

They Are Expensive
The CESM1-LE cost NCAR approximately 17 million core hours on the Yellowstone supercomputer, producing over 200 TB of archived model output (Kay et al., 2015). This is a massive resource investment of computing, data storage, and labor. NCAR could have used these resources elsewhere, and thus it is appropriate to ask whether it makes sense for a modeling center to invest in generating a large ensemble, particularly given the constraints and production schedules associated with CMIP. Certainly, modeling centers have to reconcile the generation of a large ensemble against competing scientific priorities such as higher resolutions or additional physical processes, both of which also improve our understanding of the real world. High-resolution simulations are also valuable to adaptation decision-making because they can better capture fine-scale processes and extremes (e.g., Diffenbaugh et al., 2005;Hall, 2014) at the scales most useful for decision-making. So why invest resources in a large ensemble? On this question we note three points: First, if one is in the model improvement business, as all modeling centers are, then knowing the model's irreducible uncertainty can be valuable as a benchmark. For instance, analysis of extreme climate outcomes in a large ensemble can help to elucidate the physical mechanisms that drive these outcomes, providing grounds to improve simulations, prediction, and model intercomparison. Consider the experience of model developers at NCAR, who discovered that their model, CESM2, had multiple equilibria in sea ice concentrations over the Labrador Sea that only emerged sporadically in their preindustrial control simulations, wreaking havoc on how best to initialize the model for other production runs associated with CMIP6 (Danabasoglu et al., 2020). While the underlying physics of overly extensive sea ice has not been identified, a solution to the challenge it presented to other production runs only emerged through the use of initial condition large ensembles of preindustrial simulations (Danabasoglu et al., 2020). Second, while higher resolutions are crucial to representing fine-scale processes that regulate the response of climate extremes to forcings (Diffenbaugh et al., 2005), there is mixed evidence as to whether higher resolutions necessarily increase a model's actual predictive skill (Scaife et al., 2019), which is what would be most valuable from a decision maker's perspective. Finally, there is the important question of the resolution-realization computational trade-off. Both higher resolutions and additional realizations are crucial, and each modeling center must determine its scientific priorities: The pursuit of higher-resolution simulations is motivated by different scientific aims than additional realizations, but given finite computing, they come to the sacrifice of one another. For example, if we take the simplistic question of computational resources for a standard CMIP-style component set (compset) of the CESM1, a factor N increase in resolution is far more expensive than N additional realizations (of order N versus N 4 , though factor increases could be as much as N 5 due to input-output of data). For example, assuming a base cost of a centennial-scale simulation of~300,000 core hours, then 10 model realizations are approximately 1,000 times cheaper than a factor 10 increase in resolution. This kind of comparison is inherently reductionist, and as such it does not compare the actual payoffs of investing in resolution versus realizations. Resolution increases could pay real scientific and decision-making dividends that vastly outweigh their costs (e.g., Schneider et al., 2017). But additional realizations are, as we note above, also prized. How might we overcome this resolution-realization trade-off?
Scientists are undertaking this challenge, synergistically combining large ensembles with high-resolution regional climate simulations to efficiently provide information for decision-making. For instance, what if you want to inform a decision maker of the dynamics and impacts of a 1-in-1,000-yr event? A single centennial-scale realization at high resolution has a less than 10% chance of including that event as part of its simulation. A 40-member large ensemble of centennial-scale realizations, like the CESM1-LE, has over a 98% chance of including that event. A modeler can then use that large ensemble realization to drive a Regional Climate Model (RCM) to produce the fine-scale features and impacts associated with such an anomalous event (see, e.g., https://usclivar.org/sites/default/files/meetings/2019/presentations/SWAIN-DANIEL-LE19.pdf). This also points to the importance of archiving (at a minimum) the variables needed to force an RCM as part of any large ensemble project.
More widely, new methods are being developed to answer the crucial question of the "necessary" ensemble size (Beusch et al., 2020;Coats & Mankin, 2016;Link et al., 2019;Milinski et al., 2019), and there are active discussions in the large ensemble community (US CLIVAR, 2020) about how to best balance the length of 10.1029/2020EF001610 simulation, number of ensemble members, and the degree of model complexity. Ultimately, it is for each modeling center to decide on the worth of a particular resource investment, as the number of required simulations, as others have noted , is a function of the question you are seeking to answer.

They Are Redundant
A second critique is that the range of irreducible uncertainty in future climate change can be estimated via other, less resource-intensive means. For example, recent research has shown that for certain variables one can recover nearly the same variability estimated from a large ensemble with a simple red noise model trained on a preindustrial control simulation or even observations (Thompson et al., 2015). Such an insight is powerful, but it does not dismiss the value of the large ensemble to science and decision-making for at least four reasons: First, statistical models do not allow for the same biogeophysical diagnosis and evaluation that process-based ones do. Indeed, a fundamental goal of process-based modeling is to produce a physically consistent system against which one can test hypotheses. While empirical models are useful for rigorous hypothesis testing, they are less useful for tracing the physical processes underpinning climate variability. Second, large ensembles are an ideal experimental testbed for statistical methods and approaches. Third, a large ensemble provides a robust estimate of the climate response to changes in boundary conditions like anthropogenic emissions (Deser et al. 2016;Selten et al., 2004). Finally, statistical modeling approaches tend to assume that internal variability is not itself altered by anthropogenic forcing (Coats & Mankin, 2016;Pendergrass et al., 2017). We examine these last three reasons more closely, below.
Large ensembles provide a perfect model framework for testing statistical and other ensemble analysis approaches that can then be applied to multimodel ensembles and/or the real-world climate system-even for those approaches that suggest they supersede a large ensemble's value in estimating internal variability (e.g., Thompson et al., 2015). Using models as a testbed for cheaper analytical approaches has a long history in the sciences when evaluating, for example, the skill of a coarse numerical implementation in capturing the "perfect model" provided by an analytical solution (Machete & Smith, 2016). It has also long been applied by the weather and climate communities for everything from evaluating filtering schemes for the Navier-Stokes equations (e.g., De Stefano & Vasilyev, 2002), to data assimilation (e.g., Anderson et al., 2009), to numerical weather prediction and operational forecasting (see, e.g., the Developmental Testbed Center's Ensemble Testbed (DET): https://dtcenter.org/det/index.php) (Anderson, 1996;Gallo et al., 2017;Schwartz et al., 2019). Such a philosophy has naturally extended to large ensembles of climate simulations as they have become more widely available. One example of this is the testing of statistical methods to decompose signal and noise in the climate system, such as dynamical adjustment procedures to isolate thermodynamic from dynamic responses (e.g., Lehner et al., 2017;Saffioti et al., 2015;Sippel et al., 2019;Wallace et al., 2012). In the absence of large ensembles, we would not be able to vet the robustness of such approaches for assessing the climate response to greenhouse gas forcing.
Estimates of the climate system response to greenhouse gas emissions is critical from an adaptation decision standpoint, and large ensembles allow for a robust estimate of each model's forced response. This is crucial because even if one can estimate the distribution of internal variability through statistical methods, one still needs an estimate of the forced response to shift the center of mass of that distribution to its new future position. Typically, to assess the forced response, one averages across ensemble members and uses the resulting ensemble mean or fits a trend to that ensemble mean. Importantly, the ensemble mean is itself a function of the ensemble size, n: A larger ensemble will produce a more stable estimate of the forced response. This begs a question that climate scientists have long wondered-how many ensemble members does one need to robustly estimate the forced response (e.g., Daron & Stainforth, 2013;Drótos et al., 2015;Hawkins & Sutton, 2012;Hawkins et al., 2016;Maher et al., 2019;Milinski et al., 2019)?
We can combine the perfect model framework philosophy discussed above with a large ensemble to address this question. Here the framework involves using standard techniques on randomly chosen subsets of a large ensemble (like calculating a mean centennial-scale change) to assess how well such subsets capture a model's "known" characteristics, where the "known" characteristics are those from the full ensemble. So, for example, we can assess the informational value of an additional ensemble member to improved estimation of a known quantity, like the full 40-member ensemble mean, which is often taken as the forced signal of climate change.

10.1029/2020EF001610
We show such an analysis for a 40-member ensemble projection of snowpack (snow water equivalent, mm) just before springtime melt in the headwaters of the Ganges and Brahmaputra Rivers from the CESM-LE (Figure 3). If one estimates the forced response as the time-evolving mean (Figure 3a), we can see that there is a high degree of variation in that estimate as a function of the total number of ensemble members and the unique ensemble members used in that calculation. For example, in Figure 3a, one can see that it takes 10 or more ensemble members before there is reasonable convergence on the ensemble's "true" forced response, which here is the mean across all 40 members. Taking the Pearson correlation between each estimate of the forced response and the "true" forced response, some 30 ensemble members from the example in Figure 3a are needed to bring the correlation above 0.9. (Figure 3b). If instead one estimates the forced response as the linear trend on the ensemble mean, we can see that measure too is dependent on the unique ensemble members and ensemble size used in its calculation (Figure 3c), though less so than for the time-evolving ensemble mean.
A common benchmark for assessing the ensemble size required to capture the forced response is the signalto-noise ratio (S/N). Here we estimate the S/N in Himalayan snowpack as the ratio of the magnitude of the ensemble mean linear trend to the standard deviation across the ensemble (Figure 3d). We perform the exercise of estimating the S/N as a function of ensemble size two times (Examples 1 and 2), illustrating that the S/ N, particularly for small ensemble sizes (>10), can vary drastically. Even if the S/N is valuable for detection and attribution studies despite these issues, it is less valuable from a decision-making standpoint. A decision Figure 3. Uncertainty in the forced response as a function of the ensemble size for Himalayan snowpack. In (a), we show an example of the time-evolving ensemble mean of Himalayan (map inset) snow water equivalent (mm) between 2020 and 2060 as estimated from different numbers of ensemble members (colors, see color bar in (b)) from the 40-member CESM1-LE. Note that the order in which ensemble members are added changes the evolution of the ensemble mean because the average is taken over a different subset of ensemble members. Based on the example in (a), we show in (b) that it can take over 30 ensemble members to have a Pearson correlation coefficient (r) with the "true" 40-member ensemble mean that exceeds 0.9. In (c), we show that a large number of ensemble members is also needed to estimate the "true" (full ensemble) forced response in the case where it is estimated as a linear trend of the ensemble mean (c). A typical metric of the number of ensemble members needed is the signal-to-noise ratio (S/N) (d). We show two examples of how the S/N in snow water equivalent can vary as a function of ensemble size, from only needing 2 ensemble members (Example 1, black line, gray bar) to achieve statistical significance ((S/N) × √n > 2), to 9 ensemble members (Example 2, blue line, blue bar). Yet in (e) it is clear that it can take more than 15 ensemble members in both cases to estimate the forced response (Δ b β) to within 5% of the forced response from all 40 ensemble members (β). maker in this snow-dependent region is likely to be less concerned with precisely when a significant signal in snow declines emerges, than the actual range of outcomes for which to be prepared.
We illustrate this point most clearly in Figure 3e, where we show how the estimate of the forced response b β, converges on the "true" forced response, β. In Example 1, it can take up to 15 or more ensemble members to get b β to fall within 5% of the linear trend calculated on the full 40-member ensemble mean (Figure 3e). From a robust decision-making standpoint, where the distribution of future outcomes is centered matters as it helps determine the magnitude of changes in the extremes. In nonlinear systems like water management, this can matter greatly-modest linear increases in rainfall extremes, for example, could be sufficient to overtop dams or levees, causing nonlinear impacts, like floods.
While there are clearly diminishing marginal returns with each additional ensemble member (Figure 3e), they still contain important information about the forced response. This is also true of more sophisticated statistical approaches that could theoretically yield a faster convergence on the true forced response (such as nonlinear fits, time series decomposition, or dynamical adjustment). But it is rarely clear a priori how many ensemble members are needed to achieve convergence on the true forced response-one inevitably needs large ensembles to develop and test such approaches. Again, such exercises are not pedantic, because the information about the true forced response is essential from an impacts perspective: The full distribution of outcomes that must be adapted to is shaped not simply by its irreducible uncertainty but also by where you place its center of mass (i.e., the forced response). Such a lesson holds for the real world as well: We do not know the true forced response to climate change, which necessarily increases our reliance on robust model estimates. In particular, one's estimate of the forced response has a crucial influence over what constitutes changes in the most damaging tail risks, as we illustrate below.
Finally, estimating irreducible uncertainty from a control simulation of a model or from observations assumes that anthropogenic forcing does not project onto internal variability to change its characteristics. As models become more complex (Marvel et al., 2015), stationarity assumptions, particularly for hydroclimate, begin to fall apart , implying that there is little basis for assuming the irreducible uncertainty of the future will look like the irreducible uncertainty of the past . Even the CMIP5 archive (which as we discuss undersamples internal variability) projects that the dominant mode of ocean-atmosphere variability, El Niño-Southern Oscillation (ENSO), may change in the future, bringing both more El Niños and more La Niñas (Cai et al., 2018). This assumption of stationary variability also breaks down in CESM1-LE, as we show at the grid point scale for multidecadal variability in persistent (35-yr) aridity using a common aridity metric called the Palmer Drought Severity Index, or PDSI ( Figure 4) (Coats & Mankin, 2016;Cook et al., 2018). Over at least 12% of land area, it is nearly guaranteed that this assumption is untrue in the CESM1-LE (red colors). This can immensely impact our characterization of tail risks, such as for multidecadal droughts, which are of utmost importance for robust decision-making ( Figure 2) (Sutton, 2019).

They Are Biased
A final critique points to the fact that the irreducible uncertainty from a large ensemble is itself a function of the model and, as such, is shaped by model uncertainty-that is, any estimate of internal variability from a model is biased. This is no doubt true as these models are imperfect reductions of the real world. For some regions and quantities, climate models tend to overestimate variability, thereby leading to both an underestimate of the S/N (Scaife & Smith, 2018) and the predictability of the real world (Eade et al., 2014). At the same time, climate models can underestimate variability on multidecadal and longer time scales (Laepple & Huybers, 2014). Large ensembles do not solve the problem of model biases. And yet, despite this issue, large ensembles are important in part because of what they are telling us about sources of real-world uncertainty. We do not have a perfect sense of the dividing line between model uncertainty and that from internal variability: Some of a large ensemble's internal variability is polluted by model uncertainty, making it overdispersive or underdispersive. But so too in the CMIP5, some of what is being called model uncertainty is actually attributable to internal variability. Without large ensembles, we would not know this. Large ensembles can help diagnose these biases and help scientists draw the decision-critical dividing line between model and irreducible uncertainty. It also enables searching for robust emergent constraints that allow scientists to reduce the spread in future projections (Hall et  Earth's Future critical when it comes to making costly and long-term adaptation decisions, like hydrologic infrastructure, which can be too costly to build for all trajectories emerging from climate models (Vano et al., 2018).

Moving Forward
How do we, therefore, as Bertrand Russell noted, "live without certainty, and yet without being paralyzed by hesitation?" (Russell B: History of Western Philosophy. 2nd ed. London: Allen & Unwin, 1961, p. 14). While the observation of model bias projecting onto its representation of internal variability poses a new challenge, it is a challenge for which large ensembles are likely needed. Tools under active development like the Observational Large Ensemble (McKinnon et al., 2017;McKinnon & Deser, 2018), enable scientists to assess a large ensemble's internal variability vis-à-vis a comparable estimate of observed variability. While an observational large ensemble does not position an evaluation of a large ensemble's future variability, it could, in combination with estimates of nonstationary variability from a large ensemble itself (e.g., Poppick et al., 2016). Increasingly, climate scientists are combining analyses of large and multimodel ensembles in an effort to better attribute uncertainty in impacts analyses (see, e.g., https://usclivar.org/workinggroups/large-ensemble-working-group). As such, large ensembles alone are not a perfect remedy unto themselves but instead are part of a wider constellation of tools to best position decision-making. In places where they show skill, initialized ensembles for decadal prediction will likely be part of such a toolbox. They enable reductions in the otherwise irreducible uncertainty from internal variability for lead times of several years (e.g., Hawkins & Sutton, 2009;Simpson et al., 2019;Yeager et al., 2018) and can provide value for decision makers. Existing uninitialized ensembles can in turn be used to search for analogs that complement initialized prediction efforts (Ding et al., 2018), ultimately leading to a more effective use of limited computing resources. A philosophy of "model for purpose" has prevailed in climate model experimental designs; perhaps this maxim should be extended to include "ensemble for purpose".
What is clear is that given society's commitment to warming, it is crucial to inform decision makers regarding the irreducible uncertainty in the climate system (Pielke, 2003), with an emphasis on how this can shape effective and sustainable climate risk management (Kunreuther et al., 2013;Stern et al., 2013). People consistently make decisions under conditions of deep uncertainty, and climate adaptation should be no different. In fact, decision makers increasingly have tools to optimize decisions under uncertainty, such as robust decision-making (Lempert & Collins, 2007;Lempert & Groves, 2010;Lempert et al., 2013). There is a danger in claiming certainty about inherently uncertain things, as it removes human agency to mitigate risks.

10.1029/2020EF001610
Earth's Future Uncertainty, therefore, is not a dirty word-it simply means that many outcomes are consistent with expectations (Dessai et al., 2009). Climate science works to update our expectations about the varied forms regional climates can take, all of which are consistent with the same warming world. Until recently, however, it was hard to appreciate the potential magnitude of this uncertainty. This is in part because the different answers provided by multimodel ensembles could be dismissed as due to model choices made by scientists, rather than the irreducible uncertainty that exists in the real world. But the scientific community, largely through computational advances, is now able to better characterize real-world uncertainty in regional climate change through experiments like initial condition large ensembles. From a decision standpoint, we know it is a problem if science is not getting the full range of outcomes correct. Initial condition large ensembles are thus a crucial tool to inform necessary decisions as the world warms.