Diagnosing Transient Response to CO2 Forcing in Coupled Atmosphere‐Ocean Model Experiments Using a Climate Model Emulator

A climate model emulator that mimics an ensemble of state‐of‐the‐art coupled climate models has been used for probabilistic climate projections. To emulate and compare the latest and previous multimodel ensembles, this study establishes a new method to diagnose a set of parameters of effective radiative forcing, feedback, and impulse response functions by fitting a minimal emulator to time series of individual models in response to step‐ and ramp‐shaped CO2 forcing up to a quadrupling concentration level. The diagnosed CO2 forcing is scaled down to a doubling level, leading to an unbiased estimate of equilibrium climate sensitivity. The average climate sensitivity of the latest ensemble is 18% and 13% greater than that of the previous ensemble for equilibrium and transient states. Although these increases are subject to data availability, the latter smaller rate is significant and is consistent with the relationship between feedback strength and response timescales.


Introduction
One of the main concerns in climate mitigation is whether different emission pathways meet, with some probability, a given temperature goal for global warming relative to preindustrial levels (Clarke et al., 2014;Rogelj et al., 2018). Such an assessment is derived from probabilistic climate projections using a climate model emulator (e.g., Meinshausen et al., 2011;Smith et al., 2018) that mimics the ensemble behavior of coupled atmosphere-ocean general circulation models (AOGCMs).
Although there are various types of emulators that can deal with Earth system components and climate forcing agents like greenhouse gases and aerosols with different complexities, they are, in many cases, based on the linearized global energy balance where t is time, N is the top-of-atmosphere energy imbalance (downward positive), assumed to be equivalent to ocean heat uptake, F is effective radiative forcing (ERF), λ is the climate feedback parameter, and T S is the surface temperature anomaly. ERF is defined such that relatively rapid responses in the troposphere and land surface are treated as forcing rather than feedback, and that temperature response is well represented by the above equation (Sherwood et al., 2015).
ERF and λ in an AOGCM experiment with CO 2 forcing can be estimated from the model's transient response to instantaneous CO 2 increase, typically a quadrupling of the CO 2 concentration in the atmosphere. A common approach to distinguish between forcing and response is to linearly regress N onto T S to determine the two parameters (Gregory et al., 2004). The ratio of ERF to λ is a measure of equilibrium climate sensitivity (ECS), which is usually scaled to a doubling of the CO 2 concentration. In this regression method, ECS is given as half of the ERF-to-λ ratio at the quadrupling level (Andrews et al., 2012), assuming the logarithmic nature of CO 2 forcing (Myhre et al., 1998). The feedback parameter estimated in this way can be used to estimate ERF in more general scenario experiments, including non-CO 2 forcing agents (Forster et al., 2013).
The present study proposes another diagnostic method in the forcing-response framework using a minimal emulator. This is an extension of previous work by the author (Tsutsui, 2017) and serves as an alternative to the regression method. Both approaches are based on equation (1), and the resulting estimates of ERF and λ are not significantly different. However, the emulator method deals with a more general parameter set that encompasses transient responses. It is directly linked to applied studies using the emulator (e.g., Smith et al., 2015;van Vuuren et al., 2016) and has an advantage in ensuring methodological consistency.
ERF can also be estimated from a perturbed experiment that keeps sea surface temperature (SST) fixed under a given forcing change (Hansen et al., 2005). Whereas this fixed SST method is simple and useful for diagnosing forcing (Forster et al., 2016), the regression method, which requires a longer time integration with a full ocean model, enables to separate forcing and response. Similarly, whereas the latter has such usefulness, the emulator method, which is a little more elaborate, enables explicitly dealing with ocean heat uptake. The choice of methodology is guided by the purpose and experimental design. In addition, λ can be estimated from individual feedback terms by using radiative kernels (Chung & Soden, 2015), which is outside the scope of this study.
In order to demonstrate the emulator method, this study diagnoses and compares two multimodel ensembles (MMEs) from the latest (Phase 6) and the previous (Phase 5) Coupled Model Intercomparison Program (CMIP6, Eyring et al., 2016;CMIP5, Taylor et al., 2012), focusing on climate sensitivity in a transient as well as equilibrium state. The transient climate response (TCR) is used as a measure of transient sensitivity, defined as a realized warming at the time of doubling in an idealized CO 2 scenario increasing at 1% per year. The number of models is 25 for the CMIP5 and, at the time of writing, 22 for the CMIP6 (supporting information Table S1), both of which are subject to data availability for the CMIP core experiments, the instantaneous quadrupling, 1%-per-year increase, and preindustrial control, named abrupt-4xCO2, 1pctCO2, and piControl (Eyring et al., 2016).
The rest of this article is as follows. Section 2 describes the emulator used in this study. Section 3 presents the performance of the emulator method and compares diagnostic results between the two MMEs. Section 4 discusses the significance of this method and new insights into climate sensitivity. Finally, section 5 makes concluding remarks.

Methods
This study uses the following impulse response model (IRM) as a solution of a three-layer box model where T k is the layer-k temperature anomaly; F is ERF; τ i is the ith element of three separate time constants, typically, 1, 10, and 200-300 years; and A ki is the fraction of the ith contribution to the layer-k anomaly (∑ i A ki ¼ 1). Layer S represents the surface, and layers 1 and 2 are equivalent to the intermediate and deep ocean. Three-layer implementation has an advantage over widely used two layers when considering short-term (up to about 15 years) response to instantaneous forcing changes of abrupt-4xCO2 and others including volcanic eruptions and geoengineering mitigation. The sum of the three exponentials divided by λ corresponds to the impulse response function of this forcing-response system. Supporting information Text S1 describes the relationship between the IRM parameters and the box model parameters with several useful equations for understanding the system's response. The impulse response is the derivative of the step response, and the above equation is essentially the same as another integral using a step response kernel (Good et al., 2011;Larson & Portmann, 2015).
The system has six independent parameters: three separate time constants (τ i ) and their corresponding scaled fractions (A Si /λ) of the surface anomaly. The fractions of the subsurface anomalies (A 1i ,A 2i ) as well as the physical parameters of the box model are derived from the six parameters.
Equations (1) and (2) allow emulation of N and T S time series to arbitrary forcing changes in an AOGCM experiment using a set of forcing and response parameters tuned to the specific AOGCM. Tuning, here, is achieved by curve fitting that minimizes the squared sum of differences between the AOGCM and the IRM time series.
Here the curve fitting was simultaneously conducted for the N and T S series from the two CO 2 experiments (abrupt-4xCO2 and 1pctCO2), where the AOGCM-IRM differences were scaled by the standard deviation of the control (piControl) series in terms of annual variations. The abrupt-4xCO2 experiment produces a step response for 150 years or more to an instantaneous quadrupling, which is useful for separating time constants. The 1pctCO2 experiment produces a ramp response for 140 years or more to a transient 1%-per-year increase, which is crucial for quantifying forcing amplification. For the MME comparison, the length of time series was limited to 150 and 140 years, respectively.
The output from the AOGCMs was gathered for the top-of-atmosphere energy imbalance and the surface air temperature and preprocessed for global averaging, yearly aggregation, and anomaly computation. The last process was done by subtracting the corresponding piControl time series linearly regressed on the time, which is also intended to remove unexpected drifting errors and offset values.
CO 2 forcing is usually approximated by the logarithm of a concentration level where x denotes the ratio of CO 2 concentration to its preindustrial value, and α is a scaling parameter. Here this form was used for levels up to doubling, and the following quadratic form was used for greater levels: This equation is derived to be continuous at x = 2 (doubling point) with equation (3) including the first derivative in terms of F x , and β is defined as F 4 ∼ ¼ β × F 4 , simply representing model-dependent forcing amplification from the first to the second doubling. The parameters α and β are also determined through the curve fitting.
The method of this study improves and generalizes on the method described in Tsutsui (2017). The time series fitting in that study was applied to T S only and constrained with equilibrium response estimated by the regression method, as in other approaches (e.g., . Adding N series makes the fitting process robust and free from the regression.
The present method also improves the differentiation between forcing and feedback. In the previous method, α/λ was treated as a combined parameter. Although the distinction between the two parameters is not necessarily required for emulating the surface temperature, it enables the emulation of subsurface temperatures and ocean heat uptake.

Outcome of the Emulator Method
The IRM emulator can sufficiently trace the step and ramp series of N and T S from the AOGCM experiments. Figure 1 illustrates the results for one of the CMIP6 models as an example. Supporting information shows the results for all the models (Figures S3-S49) with their parameters (Tables S2 and S3). The emulator has no significant biases in the long-term tendencies, and the resulting N-T S relationship is reasonable throughout the periods examined here.
The N-T S diagram ( Figure 1c) illustrates how the ECS and TCR are diagnosed in the emulator. The system response to constant forcing is shown by a straight line on this diagram, where the x and y intercepts indicate the equilibrium temperature increase and the ERF, respectively. The "abrupt-4xCO2" line, a response to a quadrupling CO 2 level, is scaled down to a doubling level by a factor of 1/(2β), which indicates an ECS estimate at its x intercept and a TCR estimate at its crossing point with the 1%-per-year trajectory. The latter corresponds to the point at t = ln(2)/ln(1.01) ≈ 70 (in year) in the temperature time series (Figure 1b).
In this forcing-response framework for the CO 2 experiments, the behavior of a specific AOGCM is characterized by forcing scale (α), forcing amplification from doubling to quadrupling (β), feedback strength (1/λ), and a realized warming fraction. The last one is associated with ocean mixing and expressed in the emulator by where F 1 is an annual forcing increment in the 1pctCO2 experiment and equals to αln(1.01). This equation is valid when β = 1 or up to a doubling CO 2 level. The realized warming fraction at the doubling level is the ratio of TCR to ECS.
The straight line given by the emulator for abrupt-4xCO2 is similar to the one by the conventional regression; they are almost identical in the case shown in Figure 1c. However, the regression method exclusively uses a factor of 0.5 for scaling (Andrews et al., 2012;Flato et al., 2013;Forster et al., 2013). This value is generally greater than 1/(2β) (Tables S2 and S3), which means that the regression method tends to slightly overestimate ECS for its original definition. Figure 2 illustrates the relationship between the TCR and the ECS estimates, comparing the regression and emulator methods for ECS, and the two MMEs as well. In both methods, the CMIP6 models spread toward higher sensitivities than the CMIP5 models. The differences between the methodologies are visible in that The model output of the instantaneous quadrupling and 1%-per-year CO 2 increase experiments are labeled "GCM 4x" and "GCM 1%"; corresponding IRM emulations are labeled "IRM 4x" and "IRM 1%." The "IRM 2x" line is a reduced one from "IRM 4x" by a factor of 1/(2β). The line labeled "regress" is given by the conventional regression method.

Statistics of Climate Sensitivity
the TCR is more proportional to the ECS for its estimates by the emulator than the regression. This result reflects that β is greater than one in many models with significant intermodel variability. There are no systematic differences for the TCR between the emulator and direct computing from the AOGCM output (not shown).
On the average of each MME, the ECS is 3.10°C (3.29°C) and 3.65°C (3.99°C) for the emulator (regression) method whereas the TCR is 1.85°C and 2.09°C for the CMIP5 and CMIP6. The ranges of the two MMEs are comparable (see the box plots in Figure 2), suggesting large intermodel differences in the latest MME. Individual models are not necessarily uniformly distributed, and the CMIP6 ensemble appears to have a trimodal distribution, corresponding to three groups of TCRs: above−2.5°C, 1.8-2.5°C, and below−1.8°C.
In this study, the temperature refers to the surface air temperature named "tas" from the AOGCM output variables. There is a similar variable named "ts" for the surface skin temperature, which generally shows smaller anomalies than "tas" due to slow response of SSTs compared with surface air temperatures over the ocean. The average ECS and TCR is reduced by about 0.06°C and 0.05°C when the "ts" variable is used.
One characteristic is that the ratio of the TCR to the ECS is smaller in high-sensitivity models. Previous studies have identified such nonproportional distributions for the CMIP5 models (Millar et al., 2015;Tsutsui, 2017) and for a large ensemble of simulations with a reduced-complexity climate model (Knutti et al., 2005). The scatter diagrams in Figure 2, showing a deviation from an assumed proportional line with a ratio of 0.6, confirm that this feature is robust for the CMIP6 as well as the CMIP5. Deviation is evident for the regression estimates but still significant for the emulator estimates.
The distribution of the TCR-to-ECS ratio leads to a significant result that the increase in the TCR from the CMIP5 to the CMIP6 is relatively smaller than that in the ECS. On average, the increase rate is 18% (21%) for the ECS by the emulator (regression), and it is reduced to 13% for the TCR.
The variation of the TCR is decomposed into the following three fractional differences: where RWF 2x is the realized warming fraction at the time of doubling CO 2 , x and δx denote a multimodel mean as a reference value and a deviation from the reference for variable x. This equation is derived from equation (5). The three terms on the right side represent the contributions of the forcing scale (α), feedback strength (1/λ), and ocean mixing (RWF 2x ), respectively. As to fractional differences in the ECS, the last term is dropped, and the forcing amplification (β) term is added for the regression method. Figure 3 illustrates the distributions of these terms. As shown in Figure 3a, the feedback strength has the most dominant contribution to intermodel variability and is greater in the CMIP6 than in the CMIP5. The forcing scale and ocean mixing show small but significant variability. Inter-MME differences for the forcing scale and ocean mixing are opposite to that of the feedback strength, meaning that the increase in the TCR associated with the feedback is lessened by the other two terms (the box plots in Figure 3a). The forcing amplification is greater in the CMIP6, which is consistent with the slightly greater MME-mean difference for the regression estimates.
Cross correlations between the four terms are weak but consistent with those inter-MME relationships (Figures 3b-3g). A negative correlation between the feedback strength and ocean mixing is most noticeable for both of the MMEs. A weakly negative correlation between the forcing scale and feedback strength is also identified, as shown for the CMIP5 (Forster et al., 2013).
The present method provides insights into the intermodel variability of ocean mixing. For the 1%-per-year increase trajectory, the second term on the left-hand side of equation (5) represents a yet to be realized warming fraction, which consists of three components (A Si κ i ) in terms of time constants (supporting information Figure S2). Figure Figure 3i further compares the amplitude (A S2 ) and the base fraction (κ 2 ) for the longest response and clarifies that the former is responsible for the dependency on the feedback strength.
The correlation between the feedback strength and the amplitude of the longest time component can be explained from a perspective of the Earth's heat capacity, which is represented by the emulator as λ∑ i A Si τ i (supporting information Text S1). This formula implies that a representative system response becomes longer with the feedback strength (1/λ), as described for a one-box model (Goosse, 2015). From another aspect, greater feedback strength, eventually leads to a more warmed interior ocean, requires an increased contribution of longer components, because ∑ i A Si τ i represents equilibrium ocean heat uptake to unit forcing. The result that the amplitude A S2 selectively appears to correlate with the feedback strength can be explained by the tendency of κ 2 at the time of doubling, which is not sensitive to changes in τ 2 , in a range of 200-300 years.

Discussion
The emulator method relies on time series fitting to determine a response function, which was understood to be poorly constrained (Gregory et al., 2004). Fitting a sum of exponentials to a temperature time series is unstable in optimizing the function parameters and requires proper constraints to ensure the linear N-T S relationship, as has been done in the past studies Tsutsui, 2017). The time series fitting in the present approach optimizes the emulator's parameters to approximate both N and T S series, which indirectly involves the linear relationship, leading to a robust and internally consistent result. Residual differences are dealt with equally for N and T S series. This has an advantage over the regression method, which arbitrarily selects an independent variable that presumably has no errors (Gregory et al., 2004).
The fitting results imply that the fixed factor of 2 for scaling the first to the second doubling CO 2 forcing overestimates the ECS. The response of AOGCMs is known to be amplified beyond logarithmic proportionality in terms of the CO 2 concentration (Meraner et al., 2013). For fidelity to the ECS definition, incorporating a variant factor is preferable. The present scheme uses a quadratic formula with β (equation (4)), connected to the conventional formula (equation (3)) at a doubling point. This is a practical choice for diagnosing the sensitivity metrics defined for doubling CO 2 from the CMIP basic experiments with quadrupling CO 2 . Although this scheme performs well, forcing characteristics do not necessarily change greatly at the specific concentration level, leaving room for further consideration. Intermodel variability of β should also be explored in relation to physical processes in the AOGCMs.
Variant characteristics have also been identified for the feedback parameter, reflecting diverse physical processes with different timescales (Knutti & Rugenstein, 2015). The assumption of a constant λ is not necessarily valid for multicentennial timescales, and past studies have attempted state-dependent adjustment Larson & Portmann, 2015). The present approach assumes the constancy of λ, and thus, its applicability is limited to some extent-a few centuries-where the assumption is acceptable. Besides the core CMIP6 experiments used in this study, associated model intercomparison projects have been conducting a variety of long-term experiments forced with idealized CO 2 changes (Jones et al., 2016;Keller et al., 2018;Webb et al., 2017). Emulating those results and comparing different methods (Nicholls et al., 2020) will clarify the range of acceptable use.
A number of AOGCMs were collected in this study, owing to the CMIP6 design that ensures continuity across the past and future CMIP phases (Eyring et al., 2016). However, the collected AOGCMs depend solely on the data availability and modeling groups' opportunity, and the statistics presented in the previous section cannot represent best estimates or unbiased uncertainty ranges. There remain issues to be considered separately as to methodology for weighting to calculate proper statistics and for constraining uncertainty ranges (Eyring et al., 2019).
Common characteristics found in the previous and new MMEs are notable, and among them the decreasing tendency of the TCR-to-ECS ratio with the ECS is noteworthy, which has an impact on probabilistic climate projections during this century. This tendency reflects a negative correlation between the feedback strength and ocean mixing in terms of the TCR difference. Tsutsui (2017) has described typical intermodel variations

Geophysical Research Letters
TSUTSUI based on principal component analysis of response function parameters. The present approach, which distinguishes the forcing and feedback properly, allows to diagnose ocean heat uptake that controls the realized warming fraction and the TCR-to-ECS ratio.
The heat capacity of the emulator should be considered as an "effective" heat capacity diagnosed from at most 150-year time series of N and T S under the assumption of a linear response. In fact, the diagnosed heat capacities are smaller than the real ocean capacity and substantially differ from model to model. It is nevertheless informative that the characteristics of the realized warming fraction suggested from a simplified forcing-response framework can be identified from the ensemble experiments with more complex and sophisticated AOGCMs.

Conclusions
This study establishes a new method to diagnose the linearized global energy balance using a minimal emulator to mimic transient response of AOGCMs to changes in CO 2 forcing. The performance of this emulator method is demonstrated for the latest CMIP6 and the previous CMIP5 MMEs, and estimated forcing and response parameters are compared to provide insights into climate sensitivity. The main conclusions are as follows: 1. The emulator method provides a robust and consistent way to estimate a set of parameters representing the ERF, feedback, and transient response of individual AOGCMs from their core CMIP experiments, which can be directly applied to probabilistic climate projections. 2. The ERF of a quadrupling CO 2 level used in the core experiments is properly scaled down to a doubling level, leading to unbiased ECS estimates. 3. The realized warming fraction corresponding to the ratio of TCR-to-ECS tends to decrease with the ECS, which is consistent with the relationship between feedback strength and response timescales implied from the emulator framework.
The average climate sensitivity of the CMIP6 is 18% greater than that of the CMIP5 for the ECS, and the change rate is reduced to 13% for the TCR. Although these numbers are subject to data availability and should be examined with advanced weighting and constraint approaches, the relatively smaller difference of the TCR, resulted from the tendency of the TCR-to-ECS ratio, has a sound basis.  Table S1 cites description papers for the CMIP5 models Dix et al.,