Comparison of Equilibrium Climate Sensitivity Estimates From Slab Ocean, 150‐Year, and Longer Simulations

We compare equilibrium climate sensitivity (ECS) estimates from pairs of long (≥800‐year) control and abruptly quadrupled CO2 simulations with shorter (150‐ and 300‐year) coupled atmosphere‐ocean simulations and slab ocean models (SOMs). Consistent with previous work, ECS estimates from shorter coupled simulations based on annual averages for years 1–150 underestimate those from SOM (−8% ± 13%) and long (−14% ± 8%) simulations. Analysis of only years 21–150 improved agreement with SOM (−2% ± 14%) and long (−8% ± 10%) estimates. Use of pentadal averages for years 51–150 results in improved agreement with long simulations (−4% ± 11%). While ECS estimates from current generation U.S. models based on SOM and coupled annual averages of years 1–150 range from 2.6°C to 5.3°C, estimates based longer simulations of the same models range from 3.2°C to 7.0°C. Such variations between methods argues for caution in comparison and interpretation of ECS estimates across models.


Introduction
Two primary metrics of idealized global climate model 2-m air temperature response of CO 2 greenhouse radiative forcing are the Transient Climate Response (TCR) to 1%CO 2 year −1 increase at doubling, and the equilibrium climate sensitivity (ECS) to CO 2 increase to a long term equilibrium doubling. ECS was originally estimated at 3°C ± 1.5°C (Charney et al., 1979) and has continued to serve as a fundamental metric of climate model behavior over the last four decades as estimated by the Intergovernmental Panel on Climate Change (IPCC) Fifth Assessment with "high confidence that ECS is extremely unlikely less than 1°C and medium confidence that the ECS is likely between 1.5°C and 4.5°C and very unlikely greater than 6°C." (Bindoff et al., 2013). While both TCR and ECS are defined as a temperature change from CO 2 doubling, the TCR is easily calculable in an idealized model framework as the global warming at the time of doubling (average of years 61-80), while the ECS of a given model is only fully known after that model has simulated control and doubled or quadrupled CO 2 over the long timescales of ocean heat uptake and after the sea surface temperature (SST) response has come to equilibrium-usually after several millennia of simulation (Krasting et al., 2018;Paynter et al., 2018;Rugenstein et al., 2019Rugenstein et al., , 2020. While TCR is generally considered to more closely resemble the incremental (rather than abrupt) historical and projected CO 2 increase, ECS has been found to display a more robust representation of regional temperature change patterns than TCR (Grose et al., 2018) and is also highly useful both as a fundamental metric of model response and for the calibration of integrated assessment models (Calel & Stainforth, 2017). The relationship between TCR and ECS depends on several factors including the rate and pattern of both surface and interior ocean warming. Further, ECS has a long history of use, through all the IPCC reports and back to the late 1960s (e.g., Möller, 1963) and serves as an integrated high-level metric of the climate system that spans multiple generations of climate model development. A full contextual analysis of the value of the TCR and ECS concepts both historically and in the ongoing IPCC Sixth Assessment is provided in Meehl et al. (2020).
To estimate ECS without the computational expenditure of multimillennial simulations, several strategies have been used. The first was the "slab" ocean model (SOM) or "mixed layer" approach (Manabe & Stouffer, 1979, 1980 where the atmospheric model is coupled to a simple mixed layer ocean-sea ice model. Early SOM heat flux patterns were derived from observational climatologies while later SOMs where constructed from equilibrated coupled atmosphere-ocean simulations to more adequately reflect coupled model behavior (Bitz et al., 2012). As the SOM with specified lateral and deep ocean heat flux pattern comes rapidly to equilibrium, this approach requires far shorter simulations but has the uncertainty of estimating with a fundamentally different ocean component (Danabasoglu & Gent, 2009;Hansen et al., 1985Hansen et al., , 1997 and it assumes no changes in heat transport by the world oceans. An alternative approach that uses shorter runs and extrapolates to equilibrium was put forth by Gregory et al. (2004) and applied to Coupled Model Intercomparison Project Phase 5 (CMIP5) models by Andrews et al. (2012) and is alternatively referred to as the "Effective Climate Sensitivity". In this approach, one conducts two simulations of at least 150 years -a control run and an abrupt quadrupling of CO 2 -and regresses the difference in net radiative flux at the top of the atmosphere (ΔF) versus the change in global surface air temperature (ΔT) to extrapolate to the hypothetical radiative balance at equilibrium. Danabasoglu and Gent (2009) estimated the one sigma uncertainty in ECS estimates of approximately 0.18°C (8%) for CCSM3. Several studies have demonstrated the limitations of this approach highlighting the multiple timescales of ocean adjustment (Frölicher et al., 2014;Paynter et al., 2018) and the need to run models out longer than 150 years to achieve a robust estimate of ECS (Gregory et al., 2004). Nonlinearity of the relationship between ΔT and temporal and spatial variation in ocean heat uptake causes extrapolation methods to underestimate the ECS but with decreasing error as the integration lengthens (Armour, 2017;Armour et al., 2013;Ceppi & Gregory, 2017;Senior & Mitchell, 2000;Winton et al., 2010). Ocean heat uptake influences the pattern of surface temperature (Haugstad et al., 2017), which in turn determines the strength of climate feedback due to the spatially heterogeneous nature of these feedbacks (Armour et al., 2013). Specifically, the increase in feedback with time appears to be in large part due to the movement of the pattern of warming away from regions of tropical convection, regions that tends to induce particularly negative climate feedbacks Dong et al., 2019;Zhou et al., 2017). Feedback temperature dependence, as mentioned above, can also change the slope of ΔF against ΔT. Geoffroy et al. (2013) emulate the CMIP5 model nonlinearity of global temperature/heat uptake response to step forcing with a two box (two timescale) model. The kink in this adjustment occurs after the fast timescale adjustment with an e-folding time of about 4 years. Including the initial fast timescale adjustment with its steeper slope in the regression biases the ECS estimate low. Geoffroy et al. (2013) find an average long timescale e-folding time of 290 years for the CMIP5 models but is limited by the analysis having been based on only the first 150 years. Andrews et al. (2015) demonstrated that linearly fitting only years 21-150 increased the ECS estimate. Alternative methods include fitting functions with two or three exponentials (Proistosescu & Huybers, 2017), specific simulation setups (Saint-Martin et al., 2019), and the local tangent approach (Rugenstein et al., 2016). Recently, Rugenstein et al. (2020) showed through a 15-model intercomparison that the Gregory et al. (2004) method underestimated the long-term estimate by a median 17%.
Another weakness of the Gregory et al. (2004) and Andrews et al. (2012) methods relates to enhanced regression uncertainty in CMIP6 models as they increasingly capture climate modes of variability and their teleconnections. While only a few CMIP5 models were capable of accurately representing the role of ElNiño-Southern Oscillation (ENSO) on global temperature variability, modeling centers have since successfully represented not only ENSO but other modes of variability including Madden-Julian Oscillation and Pacific Decadal Oscillation (Eyring et al., 2019), and multidecadal to centennial modes (e.g., Zhang et al., 2019). Preprocessing the data by taking long averages before performing the regression filters out some of this low-frequency variability. For example, the method of Winton et al. (2020) uses 50-year-binned averages of ΔF and ΔT before the regression is applied to better capture the forced response and avoid biasing the result with the different relationships between ΔF, Ocean heat uptake and ΔT relationship from natural internal variability such as ENSO. The first heat uptake/temperature pair is discarded and the remaining five that are available in the 300-year simulation are used in the regression.
The first 50 years are discarded to remove a period of SST adjustment during which a pattern of relatively reduced warming emerges in the subpolar North Atlantic and Southern Ocean . Although the fast mode of global surface temperature adjustment takes place with an e-folding timescale of about 4 years (Geoffroy et al., 2013), high-latitude adjustments-including changes in deep water circulation-are multidecadal or longer as the evolving SST response pattern changes the relationship between surface warming and high-latitude ocean heat uptake .
One of the central experiments for the sixth phase of the CMIP6 Diagnostic, Evaluation and characterization of Klima (DECK)  experiments is an abrupt quadrupling of atmospheric CO 2 run out for 150 years to estimate ECS, precluding the Winton et al. (2020) approach. The current approach used in ESMvalTool  to estimate ECS is that of Gregory et al. (2004) and Andrews et al. (2012) in which least squares regression is conducted on the full 150 years using annual values of ΔF and ΔT. Making use of several previous generation models that have been run out to equilibrium and more recent ones run out 300 years, we are able to provide both a quantitative multimodel assessment of the Gregory et al. (2004) and Andrews et al. (2012) methods and provide an alternative approach for an improved estimate of the derived ECS among current generation U.S. models. However, we also note that the Andrews et al. (2012) method remains superior compared to SOM estimates in this analysis.
Building on previous work (Krasting et al., 2018;Paynter et al., 2018;Rugenstein et al., 2019Rugenstein et al., , 2020Winton, Adcroft, et al., 2013), the central factors of concern in the present study with respect to abrupt 4xCO 2 changes in radiative forcing are as follows: (1) Do models return to radiative balance and, if so, how long does it take?
(2) Over what timescale (if ever) does the approach to equilibrium to become quasilinear? And (3) how long of a temporal average is required to remove the confounding role of model internal variability. While the first question on decadal scales is largely answered in Winton, Adcroft, et al. (2013) and on millennial scales in Paynter et al. (2018) and Krasting et al. (2018), the present study takes the practical step of translating that understanding into improved analysis of the current short (150-year) CMIP6 DECK simulations to estimate the ECS achieved from multiple century to multimillennial simulations.

Methods
We take advantage of the combination of previous generation models that were contributed to LongRunMIP (Rugenstein et al., 2019 along with the University of Arizona implementation of the Manabe Climate Model (MCM_UA) based on the GFDL R30c and MOM1 component models described in Delworth et al. (2002) with air-sea flux adjustment to reproduce SSTs and a fixed 40-m mixed layer, GFDL's second generation of climate model GFDL ESM2G (Dunne et al., 2012) based on GFDL's CM2.1 . We also take advantage of GFDL's fourth-generation model development products, GFDL CM4 (Held et al., 2019) and GFDL ESM4.1 (Dunne, Horowitz, et al., 2020). The climate sensitivity in CM4 has been previously documented in Winton et al. (2020). Additionally, long simulations with Goddard Institute for Space Studies GISS E2.1-G (2091 years) (Kelley et al., 2020), the National Center for Atmospheric Research NCAR CESM2(CAM6) (898 years) Gettelman et al., 2019), and the Department of Energy DOE E3SMv1 (300 years) (Golaz et al., 2019) are used.

Data
Data from CMIP6 models can be found on the Earth System Grid Federation (https://esgf-node.llnl.gov/projects/esgf-llnl/). All global annual values for temperature and net radiation at the top of the atmosphere for control and 4xCO 2 runs used in this study as well as Matlab™ scripts to analyze the data are supplied as supporting information. LongRunMIP data are from Rugenstein et al. (2020).

Results
A comparison of different methods of estimating ECS for a suite of climate models is provided in Table 1. Among models for which both slab ocean model (SOM; Column 2) and long (>800-year regressing 50year-binned averages; hereafter long; Column 3) coupled atmosphere-ocean estimates are available, SOM estimates tend to be 6% ± 7% lower than long estimates (Figure 2, upper left). One example of significant disagreement is NCAR CESM2(CAM6) for which the long estimate is 1.4°C higher than the SOM based estimate. Early analysis suggests this is a result of the cloud response to the warming surface in NCAR CESM2(CAM6) Gettelman et al., 2019). Several studies have demonstrated a strong increase in ECS estimates with warming climate with 4xCO 2 perturbations often giving a higher ECS than 2xCO 2 experiments (e.g., Bloch-Johnson et al., 2015;Meraner et al., 2013;Rohrschneider et al., 2019). As many of the SOM estimates come from 2xCO 2 experiments whereas all of the fully coupled simulations come from 4xCO 2 experiments, this nonlinearity could explain this result.  (Manabe & Stouffer, 1979) >800 years (Senior & Mitchell, 2000) 50yr 51-300  1yr 1-150 (Gregory et al., 2004) 1yr 21-150  5yr   Note. The 1σ uncertainties associated with the regression are provided for models not already contributed to LongRunMIP. Columns that represent the long term (≥800 years regressing 50-year-binned averages; hereafter long) with values from LongRunMIP provided in parentheses are provided along with 1-year averages over years 1-150 (1yr 1-150 ) (Gregory et al., 2004), 1-year averages over years 21-150 (1yr 21-150 ) , regressing 50-year-binned averages over years 51-300 (50yr 51-300 ) ,

Geophysical Research Letters
When we compute 50yr 51-300 (Column 4 in Table 1) following Winton et al. (2020) from 300-year simulations in which the first 50 years is ignored and the slope/intercept is calculated by regressing 50-year-binned averages (50yr 51-300 , fourth column), we find good correspondence with long ECS with 50yr 51-300 tending to underestimate long ECS by approximately 5% ± 5% with the exception of FAMOUS, which gave a 21% lower 50yr 51-300 estimate than its long estimate. The lack of convergence of ECS in FAMOUS is discussed in Rugenstein et al. (2020) but could not be explained. As this model displays a fundamentally different and unexplained behavior than the other models, it is excluded from subsequent analysis in this study. Similar to NCAR CESM2(CAM6) , GFDL CM4 also exhibits a much lower ECS (0.8°C) based on SOM than 50yr 51-300 . We also note that this analysis-using a linear regression of the entire 1-to 300-year period to reference the control-gets a slightly lower estimate for GFDL CM4 (4.93°C) than the value of 5.0°C provided in Winton et al. (2020). This is due to their having referenced individual 50-year time differences from the control and ignoring data corresponding to the first 50 years of perturbation. Overall, we found that differences in treatment of the control drift resulted in relatively small ECS estimates (<0.1°C) using year 51-300 analyses.
Because the standard simulation time of the abrupt 4xCO 2 simulations in the CMIP6 experimental design is only 150 years, we next turn our attention to comparability of alternative methods for performing such calculations from the CMIP6 multimodel ensemble. For all estimates of ECS based on the first 150 years alone, we applied a single linear estimate of the 0-300 year control drift for all ECS calculations as we found the uncertainty control drift to inflate markedly when restricted to year 1-150 and 51-150 analysis. We find the annual-150 year method (1 yr 1-150 ; Column 5 in Table 1) (Andrews et al., 2012;Gregory et al., 2004) slightly overestimates long ECS for the first generation MCM_UA but strongly underestimates by approximately 0.4-1.7°C, or −18% ± 5% of 50 yr 51-300 among the more recent models analyzed here. Important to note is that MCM_UA differs from all of the other models considered here in not having an explicit mixed layer but rather a fixed 40 m mixed layer depth. As such, it is unable to represent the immediate surfacewarming based shoaling and reduction of ventilation of the upper ocean that, in all the other models, leads to a latitudinal shift in sea surface warming away from the tropics toward higher latitudes and results in a strong initial ΔF:ΔT slope that subsides as warming propagates into the ocean interior and surface stratification subsides (Cubasch et al., 1992). Held et al. (2010) found that surface ocean warming initially was focused in the tropics while higher-latitude warming occurred later except for the North Atlantic subpolar region which cooling.
Efforts to remove the role of the initial response of the 150-year runs have been proposed. We find that the revised 150 year method using annual averages but ignoring the first 20 years proposed by Andrews et al. (2015;Column 6) increases the ECS estimate on average by 8% ± 6% and removing the first 50 years in the annual analysis (1yr 51-150 ; Column 6 in Table 1) leads to further increase of 3% ± 5%. We found that ignoring more than 50 years led to considerable degradation in the reliability of the regression. Ignoring initial slopes associated with ocean equilibration timescales is not the only challenge, however, as current generation climate models include representation of a complex combination of interannual, decadal and centennial scale modes of variability. We find that the ECS underestimate can be further reduced when both the first 50 years are ignored and the annual data is averaged into pentads. An example visual comparison of these methods is provided in Figure 1 for the case of GFDL CM4 .
Excluding FAMOUS, we find that the original Gregory et al. (2004) method (1yr 1-150 ) underestimates the long estimate by an average of −14% ± 8% (Figure 2, middle left panel), while the Andrews et al. (2012) method (1yr 21-150 ) is less biased relative to the long estimate but with a larger uncertainty (−8% ± 10%; Figure 2, middle right panel). We find an improved correspondence to the long estimate when 5-year averages are first calculated before linear regression is performed (5yr 51-150 ; Table 1, Column 7; Figure 2,  Table 1 for GFDL CM4 model from the regression of the difference in net radiation at the top of the atmosphere in the 4xCO 2 simulation from the control (ΔF; W m −2 ) versus the difference in 2-m air temperature as annual values from those simulations for the first 150 years (black x), and next 150 years (blue +) along with 50-year averages (red o) and regression using 1yr 1-150 (black solid), 50yr 51-300 (red solid), 1yr 21-150 (black dot), and 5yr 51-150 (black dash) methods.

Geophysical Research Letters
lower left panel) with underestimation of the long estimate but slightly larger uncertainty of −4% ± 11%. The overall results are shown in Figure 2 that illustrates that the 5yr 51-150 tends to follow the 1:1 line when compared to long simulations (Figure 2, lower left panel), whereas the 1yr 1-150 approach follows more closely to the 0.85:1 line (middle left panel). It is important to note that this analysis suggests that some of the higher ECS estimated from 300-year simulations is due to processes that begin to manifest within the first 150 years but are potentially masked by the early response and that the value of running the simulations out to 300 years is not to uncover a differing response from the 51-to 150-year period but rather to boost the signal to noise in the result. However, it is also clear that some long term processes are also at work in some models such as FAMOUS that serve to further elevate ECS and may result in a biased underestimate in the final equilibrium value (e.g., Bloch-Johnson et al., 2015;Meraner et al., 2013;Rohrschneider et al., 2019).
Alternatively, when the SOM estimate is used as the true value ( Figure 2, lower right panel), it is the 1yr 21-150 method that follows more closely the 1:1 line with a low bias of −2% ± 14%, whereas the 5yr 51-150 method gives a high bias of 4% ± 15%. While the NCAR CESM2(CAM6) and DOE E3SMv1 models give similar ECS of 5.3°C with the 1yr 1-150 method, NCAR CESM2(CAM6) gives a significantly higher ECS with both 5yr 51-150 and 50yr 51-300 methods (6.66°C and 6.53°C, respectively) while DOE E3SMv1 gives a much higher ECS with 50yr 51-300 (7.02°C) than 5yr 51-150 (5.96°C) methods, highlighting the potential for significant differences in results between methods across models. Further, while the 5yr 51-150 method appears superior to the 1yr 1-150 approach in 16 of 21 cases, the 5yr 51-150 approach strongly overestimates the long estimate in the case of MCM_UA, ECHAM5 MPIOM , and HAD GEM2 . As such, the above diversity in model behavior should serve as caution in interpreting the uncertainty associated with each method and the potential role of a suite of factors including the nonlinearity of CO 2 response, lack of equilibration of initial and final states, and long-term feedbacks associated with adjustments in ocean circulation.
We also conducted a suite of sensitivity studies varying both the averaging windows from 1 to 50 years, and length of initial simulation ignored from 0 to 50 years. As discussed in the supporting information, we found little advantage to increasing the averaging window without ignoring an initial segment. In contrast, we found that including an averaging window of 5 years filtered out most of the interannual variability when the first 50 years of simulation was excluded. Further, we found that fidelity declined slightly as the averaging window increased beyond 5 years. We attribute this slight loss in fidelity as an effective decrease in the span along the x axis from 95 years with the 5-year window to 50 years with the 50-year window. Overall, we found that analysis of 150-year simulation results largely converged with analysis of 300-year simulation when the first 50 years were excluded both with an averaging window of 1 year, and slightly more so with an averaging window of 5 years.

Conclusions
We find that much of the character of the long-term behavior of the ECS estimates can be captured with exclusion of the first third of a 150-year simulation and calculating 5-year averages before least squares regression for an ECS estimate with only slight underestimation (−4% ± 11%). With the original method of Gregory et al. (2004), however, we find an underestimation of −14% ± 8% of ECS compared to those estimated from long (>800 year) runs. Using the modified method of Andrews et al. (2012), we find a smaller underestimation of −8% ± 10% compared to those estimated from long runs but good agreement (−2% ± 14%) with SOM-based estimates. One interpretation of these results is that the CMIP6 experimental design significantly underestimates the long ECS with CMIP6 class models, but that this deficiency can be largely addressed using the modified 5yr 51-150 method excluding the initial part of the simulation and taking pentadal averages of years 51-150 to calculate the temperature at radiative balance. As such, we find a large range of ECS estimates among current generation U.S. models from 2.6°C using the 1yr 21-150 method for GFDL ESM4.1 to 7.0°C using the 50yr 51-300 method for DOE E3SM1 -well outside the IPCC assessment that ECS is very unlikely greater than 6°C (Bindoff et al., 2013). However, we also find evidence that estimates from long abrupt 4xCO 2 simulations are significantly higher than SOM estimates as well as considerable divergence in the relationships between different methods for different models. There might also be an additional discrepancy due to the typical use of 2xCO 2 simulations for SOM estimates but 4xCO 2 simulations for coupled estimates. Under the assumption that the global surface air temperature responds linearly to an increase in atmospheric CO 2 , the 4xCO 2 and the 2xCO 2 should give the same climate sensitivity. In practice the linear assumption is not strictly satisfied (Jonko et al., 2013) as forcing has been shown to be supralogarithmically dependent on the CO 2 concentration (Byrne & Goldblatt, 2014;Etminan et al., 2016;Gregory et al., 2015) along with other arguments for nonlinearity in the temperature dependence of radiative feedbacks (e.g., Bloch-Johnson et al., 2015;Meraner et al., 2013;Rohrschneider et al., 2019). As such, we argue that more research should be done to standardize methods to estimate ECS through a more comprehensive comparison of ECS through both multimillennial climate perturbation simulations such as conducted in LongRunMIP (Rugenstein et al., 2019) and slab ocean model comparisons to better understand the causes of these differences and derive a more robust estimate of climate sensitivity from current generation models.

Geophysical Research Letters
Note also that the concept of ECS is predicated on the initial and final states both being in equilibrium. With computationally intensive models such as those used in CMIP6, models have typically not spun up the ocean to equilibrium for practical considerations. As such, while the surface temperature was stable, the deep ocean thermal status was continuing to evolve. When such a system is perturbed, it may respond differently, at least in transient (e.g., He et al., 2017), to the equilibrium of a slab ocean model.
Whereas ECS began as a convenient idealized model construct (e.g., Charney et al., 1979), it has emerged as a routine test of models as if ECS could be measured and interpreted precisely and accurately. We argue that the concept of ECS should be considered more notional than absolute and useful more in idealized studies of relative sensitivity with the understanding absolute value of this metric will depend on the state of the system and the nature of the imposed forcing with different methods of estimating ECS accessing different feedbacks.

Data Availability Statement
Data used in this analysis as well as analysis scripts are provided in the supporting information and is available at Dunne, Winton, et al. (2020; http://doi.org/10.5281/zenodo.3932257).