Understanding the Extreme Spread in Climate Sensitivity within the Radiative‐Convective Equilibrium Model Intercomparison Project

The Radiative‐Convective Equilibrium Model Intercomparison Project (RCEMIP) consists of simulations at three fixed sea‐surface temperatures (SSTs: 295, 300, and 305 K) and thus allows for a calculation of the climate feedback parameter based on the change of the top‐of‐atmosphere radiation imbalance. Climate feedback parameters range widely across RCEMIP, roughly from −6 to 3 W m−2 K−1, particularly across general‐circulation models (GCMs) as well as global and large‐domain cloud‐resolving models (CRMs). Small‐domain CRMs and large‐eddy simulations have a smaller range of climate feedback parameters due to the absence of convective self‐aggregation. More than 70–80% of the intermodel spread in the climate feedback parameter can be explained by the combined temperature dependencies of convective aggregation and shallow cloud fraction. Low climate sensitivities are associated with an increase of shallow cloud fraction (increasing the planetary albedo) and/or an increase in convective aggregation with warming. An increase in aggregation is associated with an increase in outgoing longwave radiation, caused primarily by mid‐tropospheric drying, and secondarily by an expansion of subsidence regions. Climate sensitivity is neither dependent on the average amount of aggregation nor on changes in deep/anvil cloud fraction. GCMs have a lower overall climate sensitivity than CRMs because in most GCMs convective aggregation increases with warming, whereas in CRMs, convective aggregation shows no consistent temperature trend.


Introduction
Climate sensitivity describes the amount of global warming that occurs in response to an external radiative forcing. Therefore, climate sensitivity controls many aspects of the climate response to global warming, both globally and locally, like the probability of droughts and flooding (IPCC, 2018). Thus, the value of halving the uncertainty in climate sensitivity may be in the trillions of dollars (Hope, 2015). The earliest estimates of equilibrium climate sensitivity (ECS), defined as the global-mean equilibrium surface air temperature response to a doubling of CO 2 concentration in the atmosphere, were made by Arrhenius (1896). His first ECS estimate of 5-6 K, which he later corrected to 4 K, is remarkably close to Generally, a large fraction of the spread in model-based climate sensitivity estimates can be attributed to the response of clouds, and particularly low clouds, to warming (e.g., Bony & Dufresne, 2005;Sherwood et al., 2014;Vial et al., 2013;Webb et al., 2013). In global climate models, clouds as well as the majority of processes that lead to their formation, such as moist convection, which is a main source of precipitation, are not resolved by the coarse grid and thus need to be parameterized. In a recent study, Fiedler et al. (2020) showed that the representation of tropical precipitation has not substantially improved over three generations of global climate models participating in the Coupled Model Intercomparison Project (CMIP; Meehl et al., 2007;Taylor et al., 2012;Eyring et al., 2016). To escape this deadlock, Stevens et al. (2019) proposed to replace general-circulation models (GCMs), which parameterize deep convection, with models that explicitly resolve deep convection, for example, a global cloud-resolving model (GCRM; Satoh et al., 2019). However, due to computational constraints, GCRM simulations are limited to short time scales and to kilometer-scale resolution, with the consequence that clouds, whose correct representation is crucial for climate sensitivity, are still poorly represented. Thus, to develop a deeper understanding of complex multiscale systems, it is beneficial to use a hierarchy of models (Bony, Stevens, et al., 2013;Held, 2005;Jeevanjee et al., 2017;Medeiros et al., 2008), for example, by linking GCRMs to coarse-resolution GCMs, to limited-area cloud-resolving models (CRMs) and to hectometer-scale large-eddy simulations (LES). RCE, a balance between convective heating and radiative cooling, is a surprisingly accurate simplification of the tropical troposphere (e.g., Dines, 1917;Manabe & Strickler, 1964). The RCE framework is accessible to many types of atmospheric models, and there is probably no other framework that is so relevant for understanding climate change and so accessible to a hierarchy of models (e.g., Jeevanjee et al., 2017). Early RCE studies were conducted with single-column models (SCMs; e.g., Ramanathan & Coakley, 1978, and references therein), and later with CRMs (e.g., Held et al., 1993;Nakajima & Matsuno, 1988;Tompkins & Craig, 1998) and GCMs (e.g., Popke et al., 2013), for example, to improve the understanding of how clouds and convection couple to circulations and to contribute to the development and evaluation of atmospheric models and their parameterizations (e.g., Becker et al., 2018;Held et al., 2007;. A weakness of past RCE studies is that the analysis is usually limited to a single model and that different boundary conditions are used, which means that results have been difficult to compare across studies. These limitations are overcome by the Radiative-Convective Equilibrium Model Intercomparison Project (RCEMIP; Wing et al., 2018), which defines a standardized RCE protocol for many types of atmospheric models: GCMs, SCMs, GCRMs, CRMs, and LES. RCEMIP was motivated by three goals: (1) to investigate the robustness of the RCE state, (2) to analyze the response of clouds to warming and the impact on climate sensitivity, and (3) to understand how convective self-aggregation depends on temperature. One intriguing initial result of RCEMIP is that in those simulations that permit spatial organization of convection, there is a very wide range of climate sensitivities (Wing et al., 2020). This motivates our present study, which seeks to develop an understanding of the extreme spread in climate sensitivity in the RCEMIP ensemble. The focus of our investigation is on changes in cloud amount and spatial organization, which are expected to affect climate sensitivity (e.g., Bony et al., 2015;Cronin & Wing, 2017).
In addition to the ability to compare RCE across model types, the absence of large-scale heterogeneities in boundary conditions or forcing allows for an undistracted analysis of convection and its spatial organization in a setup that, due to the lack of external constraints, reveals differences between atmospheric models particularly well. Indeed, Wing et al. (2020) found that there was a wide range in the representation of mean profiles of temperature, humidity, and cloudiness in the RCEMIP ensemble. Convective self-aggregation, which is the spontaneous organization of convection despite homogeneous boundary conditions and forcings, was found to occur in nearly all large-domain RCEMIP simulations, though the spatial structure and degree of aggregation varied across models (Wing et al., 2020). Consistent with prior studies (e.g., Bretherton et al., 2005), there is widespread agreement across the RCEMIP ensemble that self-aggregation affects the mean simulated climate, by drying the non-convective regions and enhancing outgoing longwave radiation (OLR) (Wing et al., 2020). Similar relationships between convective aggregation (of any type) and its large-scale environment are found in observations (Lebsock et al., 2017;Stein et al., 2017;Tobin et al., 2012Tobin et al., , 2013. In addition, Bony et al. (2020) showed that the observed interannual variance of the tropical top-of-atmosphere (TOA) radiative imbalance depends on the degree of convective aggregation. Because of the impact of self-aggregation on the mean climate and radiative imbalance, self-aggregation may modulate climate sensitivity (e.g., Becker et al., 2017;Becker & Stevens, 2014;Cronin & Wing, 2017). However, there is no scientific consensus on how the propensity of convection to self-aggregate depends on temperature (reviewed by Wing, 2019). Thus, this study quantifies which intermodel differences contribute to the extreme spread in climate sensitivity across the different types of RCEMIP models, with a focus on the temperature dependencies of cloud fraction and convective self-aggregation.

What Is the Climate Sensitivity Across the RCEMIP Simulations?
Estimating equilibrium climate sensitivities from climate models that include a dynamical ocean component takes several thousand modeling years (e.g., Li et al., 2013;Rugenstein et al., 2019;Senior & Mitchell, 2000). Thus, an effective climate sensitivity is often used, which extrapolates from the transient temperature increase and TOA imbalance assuming certain behaviors of radiative feedbacks (e.g., Gregory et al., 2004;Murphy, 1995). To estimate climate sensitivity in the RCEMIP simulations, a third established method must be used, the Cess sensitivity method (Cess & Potter, 1988). In this method, the climate feedback parameter ( ) is estimated, which can be transformed to ECS if the CO 2 forcing (F 2xCO 2 ) is known: where is calculated from the rate of change of the TOA radiative imbalance (R) across simulations at different fixed surface temperatures (T), R is the sum of net incoming shortwave (R SW ) and longwave (R LW ) radiation, so a positive value indicates an energy flux into the atmosphere. In RCEMIP, can be calculated in each model for a higher and a lower temperature range because each model provides three simulations at different fixed sea-surface temperatures (SSTs), at 295, 300, and 305 K. These are atmosphere-only aquaplanet simulations with no planetary rotation. In addition to homogeneous surface conditions, the insolation is everywhere equal to the tropical annual mean (409.6 W m −2 ), and all trace-gas concentrations are fixed and spatially uniform. More details on the boundary conditions and RCEMIP model setup are described in Wing et al. (2018Wing et al. ( , 2020. RCEMIP simulations were performed with large-and small-domain configurations. Depending on model type, the large domain is either global (GCMs), global with reduced Earth radius (GCRMs), or an elongated channel of ∼6,000 × ∼400 km 2 (CRMs and one GCM), while the small domain is either the SCM version of a GCM or a square of ∼100 × ∼100 km 2 (CRMs). GCMs have horizontal and vertical grids similar to their CMIP6 configuration (Eyring et al., 2016), GCRMs and large-domain CRMs have a horizontal grid spacing of ∼3 km, and small-domain CRMs have a horizontal grid spacing of 1 km, but there are also two additional RCEMIP model types on the small square domain, with approximately doubled vertical (hereafter VERT) and additionally quintupled horizontal (hereafter LES) resolution. This information is summarized in Table  1, and the snapshots of OLR in Figure S1 in the supporting information visualize differences across the different RCEMIP model types. For further visualizations, in form of snapshots (of OLR and precipitable water) and movies for all RCEMIP models, the reader is referred to Wing et al. (2020). In Figure 1, we compare climate feedback parameters across 11 global-domain GCMs, six versions of WRF-GCM on the elongated channel, three GCRMs with reduced Earth radius (R E /8 for MPAS; R E /4 for NICAM and SAM-GCRM) merged with 14 CRMs on the elongated channel, 16 CRMs on the small square domain, six VERT models, and six LES models. All models except the GCMs resolve convection explicitly. We compute based on averages of R over the entire simulation, excluding only the first 75 days because within this period of time, an equilibrium state is reached in the RCEMIP simulations (see Figure 1 in Wing et al., 2020). The exception is the LES simulations, in which we only exclude the first 25 days because the LES simulations are only run for 50 days. CRMs and GCRMs are run for 100 days and GCMs are run for ∼1,000 days. A comparison of the average climate feedback parameters (in Figure 1 and Table 1) across the different RCEMIP model types shows that is on average more negative in GCMs than in GCRMs and CRMs. Interestingly, the average climate feedback parameter is the same in the small-domain CRMs and in the large-domain CRMs (including the three GCRMs). Note that Cronin and Wing (2017) found a more negative climate feedback parameter (lower climate sensitivity) in a large domain than in a small domain. This contrast can be reproduced by about 50% of the large/small pairs of CRM simulations in RCEMIP (Wing et al., 2020), but the other 50% of CRMs behave differently and compensate in such a way that the average climate feedback parameter is the same in the large and small domain. This is also subtly different than in Wing et al. (2020), who found that in models with explicit convection, the average climate feedback parameter was slightly more negative in large-/global-domain simulations than in small-domain simulations (see Figure 17 in Wing et al., 2020). This is because the six VERT and LES models, which on average have the least negative feedback parameter (highest climate sensitivity), were included in the explicit group of small-domain simulations, whereas here, we consider them separately. These six models also have a similar climate feedback parameter in the default small-domain CRM setup, which means that climate sensitivity is almost independent of vertical and horizontal resolution across the analyzed resolutions in these small-domain simulations. In most models, becomes less negative with increasing temperature, in particular in the setups with explicit convection. In these setups, is on average 0.5 W m −2 K −1 less negative between 305 and 300 K than between 300 and 295 K, and when is based on clear-sky radiative fluxes, the change with warming is even slightly larger (0.6 W m −2 K −1 , Figure 1b). This indicates that the temperature dependency can be linked to clear-sky mechanisms like an intensification of the water vapor feedback with increasing temperature, as explained by Held and Soden (2000). As an aside, Wing et al. (2020) found that was much more negative in GCMs than in SCMs (comparing the large and small versions of models with parameterized convection), but we don't consider SCMs here because our focus is on three-dimensional models that can represent circulations.
In the RCEMIP simulations, only a rough estimate of ECS can be made with Equation 1 because the CO 2 concentration is not varied, and thus, the CO 2 forcing is unknown and needs to be estimated. Here, we assume F 2xCO 2 = 3.5 W m −2 , which is the average CO 2 forcing in CMIP6 (see Figure 1b in Zelinka et al., 2020). Across RCEMIP, this leads to an average ECS of ∼1.3 K in GCMs and ∼2.9 K in large-and small-domain CRMs. Compared to previous RCE studies (e.g., Cronin & Wing, 2017;Popke et al., 2013;Ramanathan & Coakley, 1978), these two values are at the lower and upper limit of the expected range. The average ECS in CMIP6, however, is higher, 3.9 K (Zelinka et al., 2020). This is expected because RCE climate feedbacks should be compared to local climate feedbacks over the tropical oceans in more comprehensive models; amplified warming over the poles and over land is absent in RCE (Popke et al., 2013). Nonetheless, some RCEMIP models have a positive climate feedback parameter, which would be associated with an unstable climate state and an infinite ECS. In the remainder of the paper, we refrain from making an assumption on the CO 2 forcing and only discuss rather than ECS. A comparison of the intermodel spread in climate feedback parameters across the different RCEMIP model types shows that intermodel spread is largest in the GCMs, where ranges from −6 to 2 W m −2 K −1 (Figure 1a), which translates to a standard deviation of 1.8 W m −2 K −1 (Table 1). Intermodel spread is second largest in the different versions of WRF-GCM and third largest in the large-/global-domain CRMs/GCRMs. All small-domain setups, which are too small for convection to self-aggregate (Wing et al., 2020), have a comparably small intermodel spread in , with a standard deviation of 0.6 W m −2 K −1 ( Table 1). The large spread in WRF-GCM is remarkable, given that the only difference is which convective parameterization is used. Nonetheless, the remainder of our analysis focuses on the other model types because the main purpose of WRF-GCM is to bridge the gap between global GCMs and limited-domain CRMs. WRF-GCM was therefore configured with a very small number of model grid points, which means that robust results are more difficult to find in this setup. Note that ECHAM6 has both the highest climate feedback parameter of all GCMs (between 300 and 295 K), and the most negative climate feedback parameter (between 305 and 300 K). This reflects the nonlinear temperature dependency of the governing processes within individual models. In total, eight RCEMIP models have positive climate feedback parameters for at least one of the two 5 K SST ranges.
To get a general idea of the processes that are responsible for the extreme intermodel spread, Figure 2 shows the domain-mean vertical profiles of relative humidity and cloud fraction for the three CRMs with the most positive (high climate sensitivity) and the three GCMs with the most negative (low climate sensitivity) climate feedback parameters. In the high climate sensitivity cases, the relative humidity in the middle troposphere and clouds in the upper troposphere increase with warming while clouds in the lower troposphere decrease with warming. The opposite tendencies can be found in the cases with low climate sensitivity. This is in line with expectations from previous work (e.g., Held & Soden, 2000;Manabe & Wetherald, 1967). In these previous studies, an increase in relative humidity with warming is associated with a strong water vapor feedback, which results in a TOA longwave radiation imbalance (R LW ) that is less negative. An increase in high cloud fraction also means that R LW is less negative, while an increase of shallow clouds increases the atmospheric albedo and thus results in a TOA shortwave radiation imbalance (R SW ) that is less positive. If the climate feedback parameter is based on the clear-sky TOA radiative fluxes ( CLR ), the intermodel spread is 45% smaller in GCMs and 41% smaller in large-/global-domain CRMs/GCRMs (Table 1 and Figure 1); this is another indication that cloud changes have a significant contribution to the intermodel spread in climate sensitivity across the large-/global-domain RCEMIP models, as further discussed in the next section. However, note that for all small-domain simulations, the intermodel spread in CLR and is similar (Table 1), indicating that, on the small domain, clear-sky processes are responsible for intermodel differences in . On the large/global domain, about 60% of the average difference in between the GCMs and CRMs/GCRMs can be attributed to differences in CLR (Table 1), emphasizing the important contribution of clear-sky processes to general differences between these two model types.

What Explains the Extreme Spread in Climate Sensitivity?
This section investigates the mechanisms that are responsible for the extreme spread in climate sensitivity across RCEMIP. The analysis will focus on the GCMs and the large-/global-domain CRMs/GCRMs (hereafter collectively referred to as CRMs for simplicity) because the intermodel spread in the climate feedback parameter is much larger in these simulations than in the small-domain simulations ( Figure 1a). As described by Wing et al. (2020), convective self-aggregation only occurs in the global and large-domain simulations. Given that the degree of aggregation and its changes with warming vary across models (Wing et al., 2020), we investigate differences in convective aggregation as a possible explanation for the extreme spread in climate sensitivity (Section 3.1). In addition, changes in cloud fraction are known to affect climate sensitivity (see Section 2), which we investigate further in Section 3.2. The combined influence of convective aggregation and changes in cloud fraction is discussed in Section 3.3.
Because of missing model output, DAM will not be considered when indices are dependent on 500 hPa vertical velocity, WRF-CRM will not be considered for clear-sky radiative fluxes, and FV3 will not be considered at all due to a general shortage of two-dimensional model output. If not stated otherwise, hourly averages will be analyzed, excluding the first 75 days, except in IPSL-CM6 where only daily output is available.

Convective Aggregation
In this study, we quantify convective aggregation primarily with the organization index, I org , which is a clustering metric in which nearest neighbor distances of convective entities are compared to that expected from a random distribution. I org has been used to characterize the spatial organization of deep convection across a variety of both modeling (e.g., Cronin & Wing, 2017;Tompkins & Semie, 2017;Wing et al., 2020) and observational (Bony et al., 2020) studies. I org has the advantage that its computation is temperature independent and easy to interpret; values greater than 0.5 represent aggregated convection. However, I org is integrated over multiple spatial scales and thus covers both large-scale organization related to moisture gradients and smaller-scale clustering within a region of convective activity. Our calculation of I org is identical to that in the RCEMIP overview paper (Wing et al., 2020) and uses a threshold value of OLR (R LW,up < 173 W m −2 ) to define a convective column, following the RCEMIP protocol (Wing et al., 2018). To identify convective columns that are part of the same convective entity, 4-point connectivity (considering straight neighbors) is employed. Our results are insensitive to instead using 8-point connectivity (also considering diagonal neighbors). With respect to GCMs, I org has some weaknesses and limitations. For example, to avoid problems with varying grid spacing in spherical models, we only consider grid points between −30 • and 30 • latitude. In the NICAM simulation at 305 K, convection occurs almost entirely outside of this latitude band between simulation day 75 and 80, so this time interval is excluded when calculating I org . In addition, due to the coarse resolution of GCMs, it may be just as justified to consider each convective column as its own entity (zero connectivity), rather than identifying connected columns. Therefore, we test the sensitivity of our results for GCMs to using 4-point or zero connectivity. We also compare to two other metrics of aggregation: subsidence fraction and the spatial variance of precipitable water scaled by its mean value. The latter is similar to spatial variance of column relative humidity, which was used in Wing et al. (2020) together with I org and subsidence fraction to quantify convective self-aggregation.
Convective aggregation can affect climate sensitivity due to its influence on the mean state or by changing with warming (Becker et al., 2017;Cronin & Wing, 2017). Both the average degree of convective aggregation and its temperature dependence may vary across models. In both cases, our initial expectation is that, if convective aggregation is high or increases with warming, climate sensitivity will be small because of a weak water vapor feedback due to drying or a dry mean state. However, we note that the overall cloud feedback is less certain because while it is agreed upon that high clouds decrease with aggregation , changes in low clouds are ambiguous. To evaluate the dependence of climate sensitivity on the average degree of convective aggregation, we calculate the correlation between (based on the entire 10 K SST range) and I org (averaged over the three different SSTs) for the CRMs and GCMs (Figure 3). To quantify the robustness of the results, we use two complementary approaches, a t test and an outlier exclusion method.
A t test assumes that data must be randomly sampled from the population of interest and that the data variables follow a normal distribution, which is not necessarily fulfilled for the RCEMIP models, in particular within one family of models. Keeping this caveat in mind, the sign of any correlation discussed in the following paragraphs is correct with 99% confidence when correlation coefficients are higher than approximately 0.5 to 0.6 in most cases. Generally, confidence is a bit higher for the CRMs than for the GCMs because the Horizontal and vertical lines illustrate the ±1 standard deviation in each dimension for each model type, and their intersection is the mean value. Correlation coefficients (r) are given for CRMs and GCMs, using both I org approaches for the latter. To test robustness, the largest as well as the two largest outliers (product across both dimensions) are excluded in r −1 and r −2 .
correlations are based on a larger number of models. Because of the aforementioned caveats with the t test and because the t test only estimates uncertainty in the sign of the correlation, but not in the strength of the correlation, our second approach is to estimate uncertainty by recalculating the correlation twice, once after removing the largest and once after removing the two largest outliers. We define the outliers based on the product of deviations from the model-category mean across both dimensions, in this case and I org .
Based on these two approaches, does not show a robust dependency on the average degree of convective aggregation ( Figure 3). In the CRMs, is on average slightly more negative when I org is large, in line with our initial expectation, but the correlation is weak, so the t test fails. The t test also fails for the GCMs, and the sign of the correlation is even opposite depending on whether the 4-point or zero connectivity approach is used in the I org calculation. The results remain similarly inconclusive for the correlation of I org with based only on longwave fluxes ( LW ; Figure S2) and when using different metrics to quantify convective aggregation (spatial variance of scaled precipitable water and subsidence fraction; Figure S3). The only correlation we find that is significant with 95% confidence is that between and subsidence fraction in CRMs. The climate feedback parameter also varies across the small-domain CRM simulations, for reasons that are unrelated to aggregation. Therefore, in an attempt to remove this baseline spread in across CRMs and to investigate the effects of convective aggregation in isolation, we scale the large-domain climate feedback parameter by the feedback parameter in the small, non-aggregated domain. However, the correlation of this scaled feedback parameter with I org is even weaker ( Figure S4). In sum, despite the initial expectation of a smaller climate sensitivity in simulations with on average more aggregated convection, climate sensitivity does not show a clear dependency on I org . In line with this, the small-and large-domain CRMs have on average the same climate feedback parameter (Figure 1a and Table 1). The expectation that simulations with aggregated convection have lower climate sensitivities than simulations with non-aggregated convection is only fulfilled if is based on the clear-sky TOA radiative fluxes: CLR is more negative in the large-domain CRMs than in the small-domain CRMs (Figure 1b and Table 1), but the imprint on the all-sky climate feedback parameter is compensated by effects attributable to clouds.
Though does not correlate with the average amount of convective aggregation, it does seem to depend on the change of convective aggregation with warming, as Figure 4 shows. Climate feedback parameters are more negative in models in which convective aggregation increases with warming and climate feedback parameters are less negative or even positive in models in which convective aggregation decreases with warming. When calculated with all-sky fluxes, shows a robust dependence on dI org /dT across the entire ensemble of models, as well as when considering CRMs alone. These results are insensitive to the choice of aggregation metric ( Figure S5) and the choice of connectivity in the I org calculation in GCMs ( Figure S6). Additionally, Figure 4 shows that dI org /dT correlates most strongly with calculated using longwave ( LW ) or clear-sky ( CLR ) fluxes. The latter indicates that cloud-radiative effects (CREs) are at most of secondary importance in how climate sensitivity is affected by convective aggregation, with the primary importance given to clear-sky mechanisms like enhanced cooling from dry regions. Even though dI org /dT has a similar correlation with LW in CRMs and GCMs, the correlation with is significantly smaller and not robust in GCMs, indicating that another process that affects shortwave radiation is masking the overall dependency on aggregation in the GCMs. Interestingly, averaged over all GCMs, convective aggregation increases with warming, while averaged over all CRMs there is no temperature dependency. This result is also backed up by the other two aggregation metrics in Figure S5, spatial variance of scaled precipitable water and subsidence fraction, which both indicate an increase of convective aggregation with warming in the GCMs, while the CRMs have an opposite trend. The different dependency of convective aggregation on temperature might be the main reason why within RCEMIP, the GCMs generally have more negative climate feedback parameters than the CRMs. This also explains why about 60% of the average difference in between GCMs and CRMs is apparent in CLR (Table 1), and why almost 100% of the average difference in is apparent in LW , as a comparison of the horizontal lines (indicating the mean value for each model type) across the different panels in Figure 4 shows.
Three GCMs are noticable outliers in Figure 4: ECHAM6, GEOS, and IPSL-CM6. ECHAM6, which has the largest difference in between its 305-300 and 300-295 K estimate of all the RCEMIP models, has a decrease in I org with warming in the case of the positive value, and an increase in I org with warming in the case of the extremely negative value, so this is in line with the correlation we find. However, the magnitude of the change in I org is relatively low, indicating that a second process must be responsible for the extreme differences in within that model. GEOS (between 305 and 300 K) and IPSL-CM6 (between 300 and 295 K) are the two models with the strongest increase in I org with warming of all RCEMIP models. In GEOS, this causes OLR to increase much more strongly with warming than in any of the other models, leading to an extremely negative LW and CLR , but also overall to a strongly negative . Even though LW is also more negative in IPSL-CM6 (between 300 and 295 K) than in most other models, is overall close to zero, causing a decrease in the correlation between dI org /dT and for the GCMs. being close to zero in IPSL-CM6 (between 300 and 295 K) is related to a strong decrease in reflected shortwave radiation with warming, a tendency that is also apparent in GEOS (between 305 and 300 K) and in ECHAM6 (between 300 and 295 K). This tendency can be explained by a decrease in cloud fraction, as the next section will show.

Cloud Fraction
We define the area covered by shallow and deep clouds based on threshold values of the TOA radiative flux. This definition, which reflects the effect of clouds on radiation, is better suited for our interest in climate sensitivity than defining cloud fraction based on a threshold value of cloud condensate that captures optically thin clouds, as was used in Wing et al. (2020). Our conditions for shallow clouds are (1) R SW,up > 100 W m −2 , thus ensuring the presence of clouds that reflect a significant amount of shortwave radiation, and (2) R LW,up > 250 W m −2 , thus excluding areas of deep convection. Note that a fixed threshold for R SW,up is only justified in simulations with uniform insolation and surface albedo in space and time and depends on R SW,down , which is set to 409.6 W m −2 in the RCEMIP simulations. The threshold for R LW,up is in line with the low-level cloud threshold used in Fiedler et al. (2020), and tests confirmed that the presented results are robust across a range of different threshold values for R LW,up . The shallow cloud definition neglects shallow clouds that are concealed by high clouds, but these clouds are also much less relevant for the radiation imbalance. Our definition of deep clouds is the same as what we use to define grid points with deep convective activity for I org , R LW,up < 173 W m −2 . Deep clouds are thus considered as the sum of deep convective clouds and high stratiform clouds like anvil clouds. According to these definitions, deep cloud fraction ranges from 0.01 to 0.18 across the GCMs and CRMs, while shallow cloud fraction ranges from 0.00 to 0.22 ( Figure S7). These values, which for each model are averaged over the three different SSTs, do not show any robust correlation with the climate feedback parameter.
Instead, changes in shallow cloud fraction with warming correlate robustly with and with the climate feedback parameter based on shortwave fluxes ( SW ), while there is no correlation with LW ( Figure 5). There is no significant correlation between changes in deep cloud fraction with warming and . However, there is a weak tendency of to be less negative when deep cloud fraction increases with warming. This is particularly true when considering LW , which indicates that high clouds reduce OLR. Averaged over the respective model types, deep cloud fraction decreases with warming in GCMs and slightly decreases with warming in CRMs, in line with the stability iris effect . Shallow cloud fraction, on the other hand, slightly increases with warming both in GCMs and CRMs, but the increase is almost zero compared to the intermodel spread. Though these results are not directly comparable to those presented in the RCEMIP overview paper due to the different definition of clouds (Wing et al., 2020), the general tendency for a reduction in deep cloud fraction with warming is consistent. Overall, comparing the relative contribution of changes in shallow clouds and deep clouds to climate sensitivity, changes in shallow cloud fraction correlate more strongly with and the intermodel spread of changes in shallow cloud fraction is also larger than for deep clouds, particularly so in one CRM, UCLA-CRM, and in the GCMs. In ECHAM6, the extremely high climate sensitivity between 295 and 300 K as well as the extremely low climate sensitivity between 300 and 305 K is related to a shallow cloud fraction that is much smaller at 300 K than at the other two SSTs ( Figure 5).
The aforementioned correlations, of course, do not guarantee causality. For example, it is plausible that changes in deep cloud fraction only correlate with LW because changes in deep cloud fraction correlate with changes in convective aggregation. Indeed, this correlation is significant for GCMs (CRMs: r = −0.2; GCMs: r = −0.6), in line with expectations that the coverage of deep clouds is reduced when convection is more  aggregated (e.g., Bony et al., 2016). The increase of I org with warming in GCMs might also explain why the tendency of deep cloud fraction to decrease with warming is stronger in GCMs than in CRMs. To get a better understanding of causality, we also examine the effect of clouds on the TOA radiation imbalance more directly by investigating the CRE (CRE = R − R CLR ) of shallow and deep clouds. Figure 6d shows that the correlation of changes in CRE with changes in shallow cloud fraction is strongly negative, while the correlation with changes in deep cloud fraction is weakly positive (Figure 6e). The latter is surprising given that Figure 6. Cloud-radiative effect averaged over all grid columns covered by deep versus shallow clouds for (a) all-sky, (b) shortwave, and (c) longwave radiative fluxes, and change of cloud-radiative effect with warming versus change of (d) shallow and (e) deep cloud fraction with warming, for GCMs (red) and CRMs (blue), numbered according to Figure 3. The three top panels show the average over the three different SSTs, and the two bottom panels show the change between the two simulations at colder SST (300-295 K) in the darker color, and at warmer SST (305-300 K) in the lighter color. Horizontal and vertical lines illustrate the ±1 standard deviation in each dimension for each model type, and their intersection is the mean value. Correlation coefficients (r) are defined like in Figure 4.
the CRE of deep clouds is negative in most CRMs, and in all GCMs (Figure 6a), so the direct effect of increasing the deep cloud fraction with warming should be a decrease in climate sensitivity. This indicates that, in particular in the GCMs, climate sensitivity depends only indirectly on deep cloud fraction itself. The dominant factor is instead that a large deep cloud fraction is associated with weak convective aggregation. Figure  6a also shows that the CRE is two to three times more negative for shallow than for deep clouds. This is because in most models, the longwave CRE is much less positive for shallow than for deep clouds (Figure 6c), while the shortwave CRE is only moderately less negative for shallow than for deep clouds (Figure 6b). Note that a general caveat of looking at the CRE is that it does not provide the full picture of how clouds and cloud changes affect climate sensitivity because the overall cloud feedback also depends on and interacts with changes in clear-sky radiation. For an accurate analysis, radiative kernel calculations should be used, which can generally differ from the CRE by about 0.3-0.4 W m −2 K −1 (Soden et al., 2004). This might be a focus of future analysis of the RCEMIP simulations.
Overall, shallow clouds affect climate sensitivity more strongly than deep clouds, both because their CRE is larger (more negative), and because the temperature dependency of shallow cloud fraction is more inconsistent across models, thereby introducing more spread in climate sensitivity. This is why we concentrate on shallow clouds in the next section, where we investigate the combined effect of changes in convective aggregation and shallow cloud fraction on climate sensitivity.

Combining Convective Aggregation and Shallow Cloud Fraction
In the previous two subsections, we identified a significant correlation both between and changes in convective aggregation, where changes in the longwave fluxes are the main contributor, and between and changes in shallow cloud fraction, where changes in the shortwave fluxes are the main contributor. This indicates that both of these metrics are relatively independent of each other, which implies that more intermodel variance in can be explained when combining both metrics.
In principle, changes in shallow cloud fraction could be related to changes in convective aggregation via their respective relationships with lower-tropospheric stability (LTS): high LTS is associated with both higher shallow cloud fraction (e.g., Bony et al., 2020;Drótos et al., 2020) and increased convective aggregation (Wing et al., 2020). The explanation is that aggregated convection is triggered at a higher boundary layer moist static energy, leading temperatures to follow a warmer moist adiabat and thus be more stable. Across the RCEMIP models, changes in LTS, defined as the potential temperature difference between 700 and 1,000 hPa, correlate with ( Figure 7). In GCMs, this correlation is more robust with respect to shortwave fluxes, in particular after excluding the model GEOS (Figure 7), and, in line with this, changes in LTS correlate more robustly with changes in shallow cloud fraction than in CRMs (Figure 8). In CRMs, in contrast, the correlation of changes in LTS with is mainly with respect to longwave fluxes (Figure 7), and consistent with this, changes in LTS correlate more strongly with changes in I org than in GCMs (Figure 8). However, the correlation between changes in LTS and changes in I org is weak in GCMs, and the correlation between changes in LTS and changes in shallow cloud fraction is weak in CRMs. This explains why, despite the relationship between each of I org and shallow cloud fraction with LTS, there is almost no correlation between changes in I org and changes in shallow cloud fraction ( Figure 8).
As changes in I org are not correlated with changes in shallow cloud fraction ( f sc ), we may be able to explain much more variance in the climate feedback parameter by combining the two metrics. We use a multivariate linear regression to find the partial regression coefficients b 1 and b 2 of the combined index: separately for the GCMs and for the CRMs. Figure 9 shows that the correlation of the combined index with is indeed very high, r = 0.91 for GCMs and r = 0.85 for CRMs. Even when only half of the GCMs and CRMs are used to calculate b 1 and b 2 , and the correlation between and is calculated for the other half of models, the correlation still remains larger than 0.8, both for GCMs and CRMs (not shown). When calculating the correlation between and for the sum of all GCMs and CRMs (but still using separate coefficients for the GCMs and CRMs), the correlation is r = 0.87 (Figure 9). However, if an additional parameter b 0 for the vertical offset is allowed in the multivariate linear regression, then r = 0.91 for the sum of GCMs and CRMs (not shown).
When considered separately, changes in shallow cloud fraction explain 43% of the variance in across CRMs, and 70% across GCMs, while changes in I org explain 31% of the variance in across CRMs but only 8% across GCMs. Thus, it is not surprising that the partial regression coefficients are different as well: b 1 = −63.38 and b 2 = −170.52 for the CRMs, and b 1 = −54.11 and b 2 = −134.05 for the GCMs. The standardized regression coefficients B 1 and B 2 in units of standard deviation follow after multiplying by the standard deviation of each index and dividing by the standard deviation in . For the CRMs, B 1 = −0.54 and B 2 = −0.65, and for the GCMs, B 1 = −0.36 and B 2 = −0.87. This means that when shallow cloud fraction, for example, increases by 1 standard deviation in GCMs, decreases by 0.87 standard deviations. Therefore, since changes in shallow cloud fraction explain more variance in than changes in I org , they get more weight in the regression. However, since changes in I org explain relatively more variance across the CRMs than across the GCMs, I org gets relatively more weight in the regression for CRMs. As an aside, results using the I org calculated with zero connectivity for GCMs are qualitatively similar ( Figure S6), but with a smaller relative contribution of I org to the regression.
In summary, the intermodel spread in climate feedback parameter in the large-domain RCEMIP simulations may be explained by the combination of changes in convective aggregation and changes in shallow cloud fraction with warming. The combined index , calculated separately for GCMs and CRMs, explains 83% of the variance (r 2 ) of across GCMs and 73% across CRMs. This impressively high amount of explained variance underlines the utility of idealized frameworks like RCE, where it is easier to explain the behavior than in some other studies that analyzed variance in the radiative budget. For example, combined variations in I org and LTS explain ∼60% of the observed interannual variance of the tropical radiation budget (Bony et al., 2020) and combined variations in small-and large-scale lower-mid-tropospheric mixing explain almost 50% of the variance in climate sensitivity across 43 climate models (Sherwood et al., 2014).

How Does Convective Aggregation Affect Longwave Radiation?
The previous section has shown that an increase of convective aggregation with warming leads to a low climate sensitivity, mostly because of an increase of OLR. However, the question of what causes the negative correlation between changes in convective aggregation and changes in TOA longwave radiation ( LW ) remains unanswered. We have shown that this correlation is mostly captured by clear-sky processes and therefore hypothesize that enhanced OLR under conditions of enhanced convective aggregation is due to (a) larger regions covered by subsidence or (b) drier subsidence regions in the more aggregated state.
To investigate the contributions of subsidence region size and dryness, we examine subsidence fraction and mid-tropospheric humidity (MTH). Subsidence fraction, which was introduced as an alternative metric for convective aggregation in Section 3.1, is the daily-mean area fraction of the domain covered by large-scale (CRMs averaged in space over 100 × 100 km 2 blocks, GCRMs over 1 • × 1 • blocks, GCMs on their native grid) subsidence at 500 hPa, and MTH is defined as the domain-mean mass-weighted vertical average of relative humidity between 300 and 700 hPa. Figure S8 shows that changes in domain-mean MTH with temperature are similar to changes in MTH over the subsidence region. We thus use the domain-mean MTH as a proxy for MTH in the subsidence region because the latter requires calculation from 3D output, which is available only for 60% of the models and for only the last 25 days of each simulation, whereas the domain-mean MTH is available for the entire simulation period for all models.
We find that both the expansion of subsidence regions and mid-tropospheric drying show a significant correlation with LW , but MTH dominates, with correlation coefficients of at least r = 0.85 ( Figures  10 and 11). Changes in subsidence fraction and MTH are also strongly correlated with each other, which means that the correlation with LW does not further increase when combining both indices in a multivariate linear regression ( Figure S9). Changes in both metrics do not only correlate with LW but also to a slightly smaller degree with changes in I org , confirming that both metrics are indeed responding to changes in convective aggregation (Figures 11 and  S10). When considering the average values of the metrics, rather than their change, subsidence fraction significantly correlates both with the average value and with the temperature dependency of longwave radiation in CRMs, while in GCMs, MTH correlates very strongly with the average value of longwave radiation (r = 0.89; Figure S11).
In summary, the drying of subsidence regions has the strongest contribution to the increase in OLR when convective aggregation increases, while the expansion of subsidence regions is of secondary importance. We hypothesize that models with enhanced OLR have drier subsidence regions (compared to the other models) either in response to (b 1 ) higher subsidence velocities or (b 2 ) larger individual patches of subsidence. In contrast to the above discussion, we do not investigate subsidence fraction here, but rather the relation between the size of individual patches of subsidence and dryness.
To address this hypothesis, we introduce two new metrics. Subsidence velocity, sub , is defined as the daily-mean large-scale vertical velocity at 500 hPa (downward defined as positive), using the same averaging as in the calculation of subsidence fraction. The second metric is the non-convective area patch length, L p . To diagnose L p , the distance between every non-convective grid point in space and time and the closest grid point that is part of a convective region is calculated, where convective regions are defined according to the threshold R LW,up < 173 W m −2 . The mean distance to the closest convective grid point then serves as an estimate of the size of individual subsidence regions. The analysis is limited to one hour per day (every 24th output time step) for computational efficiency. When looking at the temperature dependency of these two metrics, and scaling the change in L p by its value at the lower SST, an increase of OLR with warming is significantly correlated with a widening of individual non-convective regions, but is not significantly correlated with increasing subsidence velocities, in particular not in GCMs (Figures 10 and 11).
The schematic in Figure 11 provides a summary of the correlations between all discussed metrics and illustrates that many of the metrics co-vary. Overall, changes in sub with warming show a weaker correlation with the other metrics (changes in I org and changes in MTH) than changes in L p with warming, particularly for GCMs. It is not surprising that the correlation between changes in L p and changes in I org is significant because L p could also be used as a metric to measure convective organization. Note that unlike I org , L p focuses on the non-convective region and is mainly affected by large-scale organization, while I org focuses on convecting areas and integrates over all length scales. However, the most important conclusion from Figure  11 is that LW primarily depends on changes in MTH, in response to changes in convective aggregation. Changes in MTH significantly correlate with changes in L p (and also with changes in subsidence fraction). MTH depends on the patch size because when L p is larger, a larger part of the subsidence region is isolated from convective moistening through detrainment, thus reducing MTH in the subsidence region. However, changes in MTH are only weakly correlated with changes in subsidence velocity in CRMs, and uncorrelated in GCMs. The latter is presumably because the exchange of air between convective and non-convective regions happens more slowly in RCE simulations with GCMs than with CRMs, which gives even weak subsidence enough time to dry the non-convective regions. Thus, in RCE simulations with GCMs, subsidence velocity is not a limiting factor of the dryness of the non-convective region. As an aside, the role of vertical velocity is different when using the tropical circulation index I, defined as the mean upward velocity minus the mean downward velocity at 500 hPa (Bony, Bellon, et al., 2013), as a measure of vertical velocity. Changes in I correlate well with LW ( Figure S12), but this is misleading because, across the RCEMIP simulations, differences in the tropical circulation index are dominated by differences in the convective region (i.e., by changes in the mean upward velocity).

Journal of Advances in Modeling Earth Systems
The average values (rather than tendencies) of non-convective patch size and subsidence velocity have only weak correlations with the average value of longwave radiation and its tendency (Figures S13 and S14). One reason why temperature dependencies, as in Figure 11, are more informative than average values is Figure 11. Schematic of how the climate feedback parameter based on the TOA longwave fluxes correlates with the change of different metrics with warming, and how these metrics correlate with each other. The metrics are organization index, subsidence fraction, mid-tropospheric humidity, subsidence velocity, and scaled non-convective area patch length. The correlation coefficients, which are based on Figures 10 and S10, are given in blue for the CRMs and in red for the GCMs. that base state differences between the different RCEMIP simulations, which often are linked to specific problems in parameterizations, play less of a role when only the change in a metric is considered. However, unrelated to these issues, the average values of MTH stand out as most important contributors to intermodel spread in longwave radiation ( Figure S14), and changes in MTH stand out as most important contributors to intermodel spread of changes in longwave radiation ( LW , Figure 11).

Summary and Conclusion
In this study, we analyzed the variability of climate sensitivities across 31 models, run at three fixed SSTs (295, 300, and 305 K) that participated in the RCEMIP (Wing et al., 2018(Wing et al., , 2020. Estimates of the climate feedback parameter ( ) have an extremely large range across the RCEMIP models, from −6 to 3 W m −2 K −1 . The intermodel spread in is particularly large in simulations configured on a domain that is large enough to allow convection to self-aggregate; that is, CRMs run on an elongated channel and GCMs/GCRMs run on a global domain. High climate sensitivities are primarily associated with a decrease of shallow cloud fraction with increasing temperature, and secondarily with a decrease of convective aggregation with increasing temperature, and vice versa. The average amount of convective aggregation, on the other hand, plays no significant role in modulating climate sensitivity. Because changes in shallow cloud fraction and in convective aggregation are almost independent from each other, they can together explain, when combined with a multivariate linear regression approach, more than 80% of the total variance in climate sensitivity across the GCMs and more than 70% across the CRMs. An increase in shallow cloud fraction enhances the planetary albedo and thus affects the shortwave radiation imbalance, while an increase in convective aggregation primarily enhances clear-sky OLR. This finding has similarities to the iris effect, but with the difference that in most other studies changes in high-level cloud fraction play an important role in the iris effect (e.g., Mauritsen & Stevens, 2015), while in RCEMIP, changes in high-level cloud fraction are of minor importance for the intermodel spread in climate sensitivity. Instead, OLR increases when convection aggregates primarily because of mid-tropospheric drying and secondarily because the subsidence regions expand, both individually and in the aggregate. The mid-tropospheric drying is only weakly related to higher subsidence velocities in CRMs and not at all in GCMs. This indicates that in RCE, the exchange of air between convective and non-convective regions happens slowly enough to even give weak subsidence enough time to dry the subsidence region. Another possible mechanism that can cause drier subsidence regions, which was not analyzed in this study, is an increase of precipitation efficiency when convection aggregates, as discussed by (Lindzen et al., 2001).
The comparison of the mean characteristics of GCMs and CRMs with respect to climate sensitivity also offer some interesting conclusions. In GCMs, is significantly more negative than in large-domain CRMs, which is almost entirely related to differences in the longwave radiation imbalance. About 60% of these differences can be associated with changes in the clear-sky radiative fluxes. In agreement with this, shallow cloud fraction has no overall temperature trend when averaged over the respective sets of GCM and CRM simulations. Deep cloud fraction decreases slightly more strongly with warming in GCMs than in CRMs, consistent with the stability iris effect , which relates the decrease of anvil cloud fraction with warming to increasing stability and a weakening overturning circulation. However, in our study this effect cannot explain the significantly more negative climate feedback parameter in GCMs than in CRMs because there is no indication that this mechanism works substantially differently in GCMs than in CRMs, and the net CRE of deep clouds is relatively small. Instead, is significantly more negative in GCMs than in CRMs mainly because, averaged over all CRMs, the temperature dependency of convective self-aggregation is close to zero, while convective self-aggregation increases with warming in most GCMs, causing a mid-tropospheric drying and enhanced OLR in subsidence regions. Interestingly, Retsch et al. (2019) also found that climate sensitivity is strongly linked to the temperature dependency of mid-tropospheric humidity in subsidence regions (using ICON aquaplanet simulations). However, in their simulations, mid-tropospheric humidity increases with warming when convection is parameterized, causing higher climate sensitivities compared to simulations with explicit convection.
Given that the temperature dependency of convective self-aggregation shows a large spread across the CRMs, it would be plausible that the offset between the average temperature dependency of convective self-aggregation in GCMs and in CRMs is due to shortcomings in some of the CRMs, for example, because shallow clouds in CRMs are poorly represented at kilometer-scale resolution. However, the climate feedback parameters in CRMs are supported by very similar results in the higher-resolution LES experiments. Hence, it is more likely that caveats in GCMs explain the GCM-CRM difference in the temperature dependency of convective self-aggregation. We speculate that the moisture-memory feedback, in which convection favors locations that have been pre-moistened by convection (Tompkins, 2001), explains some of the discrepancy between GCMs and CRMs. The moisture-memory feedback may become more important with warming because of an increase in the saturation deficit, which increases the efficiency with which updraft buoyancy is reduced through entrainment. In this regard, the important difference between GCMs and CRMs is that GCMs usually entrain the grid-cell mean air, from an area that spans ∼100 × 100 km 2 , while CRMs feel less of this effect because they entrain air much more locally, from within their moist shell (Becker et al., 2018). Even though there is also some evidence that convective parameterizations are not sensitive enough to environmental humidity due to too-low entrainment (e.g., Mapes & Neale, 2011;Sherwood et al., 2013), our results suggest that the increase in convective aggregation with warming in many GCMs may be an artifact of convective parameterization. If true, this implies that many GCMs underestimate climate sensitivity, not only in the RCE setup but also in less idealized model configurations.
This study is an important step toward using idealized frameworks like RCE in a hierarchy of models to constrain climate sensitivity. We have identified the temperature tendency of shallow cloud fraction and convective aggregation as key contributors to the intermodel spread in climate sensitivity across RCE simulations with GCMs, GCRMs, and large-domain CRMs. An important target for future work is the question of what processes and details in parameterizations are responsible for the different temperature dependencies of these key factors across the different models. Further work could focus on targeted parameter perturbation simulations, as planned for the possible second phase of RCEMIP. As an example, the suite of WRF-GCM simulations, while not included in our main analysis here, shows that the change in shallow cloud fraction with temperature explains most of the spread in across different convective parameterizations ( Figure S15). This type of targeted physics comparison, in which it may be tractable to determine why a particular parameterization has a different change in clouds, ideally in combination with radiative kernel calculations of climate feedbacks, could be extended to other model setups. When comparing the different RCEMIP model types, temperature dependencies seem to be weaker in GCRMs than in CRMs, both with respect to climate sensitivity and to the discussed metrics, indicating that some of the extreme behavior that we find is setup dependent. Thus, it will be interesting to investigate whether the key contributors to intermodel spread that we identified are also key contributors to intermodel spread across other, less idealized intercomparison projects, such as CMIP6 (Eyring et al., 2016) or the DYAMOND project .