Pervasive Warming Bias in CMIP6 Tropospheric Layers

The tendency of climate models to overstate warming in the tropical troposphere has long been noted. Here we examine individual runs from 38 newly released Coupled Model Intercomparison Project Version 6 (CMIP6) models and show that the warm bias is now observable globally as well. We compare CMIP6 runs against observational series drawn from satellites, weather balloons, and reanalysis products. We focus on the 1979–2014 interval, the maximum span for which all observational products are available and for which models were run using historically observed forcings. For lower‐troposphere and midtroposphere layers both globally and in the tropics, all 38 models overpredict warming in every target observational analog, in most cases significantly so, and the average differences between models and observations are statistically significant. We present evidence that consistency with observed warming would require lower model Equilibrium Climate Sensitivity (ECS) values.


Introduction
Numerous studies have pointed to a tendency across climate models to project too much contemporary warming in the tropical troposphere (Bengtsson & Hodges, 2009;Douglass et al., 2007;Fu et al., 2011;Karl et al., 2006;McKitrick et al., 2010;McKitrick & Vogelsang, 2014;Po-Chedley & Fu, 2012;Thorne et al., 2011) with additional evidence pointing to a global tropospheric bias as well (Christy & McNider, 2017). Here we present an updated comparison using the first 38 models made available in the newly released sixth-generation Coupled Model Intercomparison Project (CMIP6) archive comparing model reconstructions of historical layer-averaged lower-troposphere (LT) and midtroposphere (MT) temperature series against observational analogs from satellites, balloon-borne radiosondes, and reanalysis products. We compare trends over 1979-2014, the longest interval for which all three observational systems are available and for which models were run with historically observed forcings. None of our conclusions would be different if we extended the end date to 2018. We examine four atmospheric regions: the global LT and MT and the tropical LT and MT layers.
In previous studies, although a warm bias was typically present, over large atmospheric regions the model spread at least partly overlapped the observational analogs, especially at the global level. This is no longer the case. Every model overpredicts warming in both the LT and MT layers, in the tropics, and globally. On average the discrepancies are statistically very significant, and the majority of individual model discrepancies are statistically significant as well. ©2020. The Authors. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Data 2.1.1. Observations
We use the temperature data collected from three general categories.
1. Radiosonde (or sonde) data are measured by thermistors carried aloft by balloons at stations around the world which radio the information down to a ground station. Sondes report temperatures at many levels, and we use here annual averages at the standard pressure levels: 1,000 (if above the launch site), 850, 700, 500, 400,300, 200,150, 100, 70, 50, 30, and 20 hPa. As noted in Table 1 there are four data sets available: NOAA (RATPAC, Free et al., 2005), U WIEN, Austria (RAOBCORE and RICH, Haimberger et al., 2012). and the University of New South Wales, Australia (UNSW, Sherwood & Nishant, 2015). Note that the commercial software used to process sonde data was revised in 2011 with the result that inferred humidity levels increased after 2009 by several percent (Jauhiainen et al., 2011). This induced a slight warming step which is not observed in other other systems and may be an artifact (Christy et al., 2018). 2. Since late 1978, several polar-orbiting satellites carried some form of a microwave sensor to monitor atmospheric temperatures. These spacecraft would circle the globe roughly pole to pole making a complete orbit in about 100 min. They were (and are) Sun-synchronous, so the Earth would essentially rotate on its axis underneath as the spacecraft orbited pole to pole so that essentially the entire planet is observed in a single Earth rotation (or day). The intensity of microwave emissions from atmospheric oxygen is directly proportional to temperature, thus allowing a conversion of these measurements to temperature. Since the emissions come from most of the atmosphere, they represent a deep-layer-average temperature. For our purposes we shall focus on two deep layers, the LT (surface to~9 km) and the MT (surface to~15 km). The University of Alabama in Huntsville (UAH) and Remote Sensing Systems (RSS) produce averages every month of both products (Mears & Wentz, 2016;Spencer et al., 2017). NOAA provides values for MT globally, and the University of Washington (UW) produces tropical value of MT (Po-Chedley et al., 2015). There are differences in all of the products discussed here, and the reader may want to consult the listed publications for more information. 3. The third category of these data sets are known as Reanalyses. In this category, a global weather model with many atmospheric layers ingests as much data as possible, from surface observations, sondes and satellites, to generate a global depiction of the surface and atmosphere that is made globally consistent through the model equations. We will access the temperature data from these data sets at 17 pressure levels from the surface to 10 hPa and will be able to calculate the deep-layer averages that match those of the satellite measurements. Four such data sets are available to us, two from the European Centre For Medium Range Forecasts (ERA-I and ERA5, Dee et al., 2011, Hersbach et al., 2018 and one each from the Japanese Meteorological Agency (JRA55, Kobayashi et al., 2015) and NASA (MERRA2, Gelaro et al., 2017).

Climate Models
The climate model simulations utilized here are those accepted for analysis in CMIP6 for which the models are executed in standardized simulations, so they may be intercompared properly. We obtained the model runs from the Lawrence Livermore National Laboratory archive (https://pcmdi.llnl.gov/CMIP6/). For this study we used the period 1979-2014 from the simulation set that represents 1850-2014 in which the models were provided with "historical" forcings. These time-varying forcings are estimates of the amount of energy deviations that occurred in the real world and are applied to the models through time. These include variations in factors such as volcanic aerosols; solar input; dust and other aerosols; important gases like carbon dioxide, ozone, and methane; and land surface brightness. With all models applying the same forcing as believed to have occurred for the actual Earth, the direct comparison between models and observations is appropriate. The models and runs are identified in Table 2. We also list the estimated Equilibrium Climate Sensitivity (ECS) values for the 31 models for which we were able to find values, usually through unpublished online documentation (sources available on request).

Methods
Linear trends were estimated on annual observations over the 1979-2014 interval, which is the maximum-length interval for which all observational series are available and for which the models were run using observed forcings. We pretest the temperature series for unit roots, which if present imply nonstationarity of a form that makes conventional trend regressions invalid (Wooldridge, 2020). We use the form of the test derived in Elliott et al. (1996), allowing for a trend stationary alternative and an autoregressive lag. The null hypothesis of the test is that the series contains a unit root. Such tests can exhibit a tendency to underreject in the presence of autocorrelation due to low power, so we expanded the time interval to 1959-2014, which means the sonde record, specifically the mean of the RAOBCORE, RICH, RATPAC, and UNSW products, serves as the observational series. We reject the null hypothesis for all individual model runs and the sonde mean series, thus indicating that the data can be treated as trend stationary. An appropriate method in this case for constructing confidence intervals (CIs) and hypothesis tests of trend equivalence is the autocorrelation-robust method of Vogelsang and Franses (2005 Figure 3 shows the trends and 95% CIs in°C per decade in the 38 individual climate models (red), the climate model ensemble mean (thick red) and the three mean observational series (respectively, radiosondes, reanalysis, and satellites, thick blue). The dashed blue line shows the satellite trend level. Differing data availability leads to somewhat different observational series combinations. For the sonde data, the average includes RAOBCORE, RICH, and RATPAC in all specifications and additionally includes UNSW in the MT layers (global and tropics). The mean of the reanalysis data uses ERA-I, ERA5, JRA55, and MERRA2 for the global LT and the topical LT and MT layers and uses ERA5, JRA55, and MERRA2 for the global MT layer. The mean of the satellite data uses UAH and RSS for global LT and MT and for topical LT and additionally uses NOAA and UW for tropical MT.

Results
The top row of Figure 3 shows the MT layer results for the global (left) and tropical (right) samples. The bottom row shows the same for the LT layer. It is immediately apparent that every model run in every regional and layer average has a mean trend that exceeds the corresponding observed trends regardless of how they are measured.
Tables 3 and 4 show the trend coefficients and symmetric 95% CI widths (in°C/decade) for all individual models, for the average of all models, and for the three observational system averages. For example, the global LT trend in the ACCESS model (top row of Table 3) is 0.250 ± 0.103°degrees C/decade. Table 5 shows the Vogelsang-Franses test scores on the null hypothesis of trend equivalence for each test region. A value greater than 41.53 is significant at

Earth and Space Science
5%. The first row shows the results of testing whether the average model trend exceeds the average sonde trend. The second row shows the corresponding result for reanalysis data and the third row shows the results for satellite data. The fourth row shows the number of individual model runs in which the trend significantly exceeds the satellite average. In the first three rows we see that all 12 tests reject, meaning the average model significantly exceeds the average observed series regardless of region or atmospheric layer, and regardless of observational measurement system. The final row shows that a majority of models also reject individually against the satellite data except in the global LT case, in which 18 of 38 models reject. If we were to extend the data sample to a 2018 end date, the sum would still be 24 and 26, respectively, for the global LT and MT layers, and would increase to 22 and 23 in the tropical LT and MT layers.
An increasingly common form of model diagnostic involves examining what are called "emergent constraints" (Caldwell et al., 2018). ECS values across models vary widely but the correct value cannot be directly determined by measurement. The emergent constraint concept involves looking for observable features of the climate that have measurable counterparts in models that are correlated with the model ECS.
The observed measurement of the correlate will then indicate which model ECS values are more likely to be true. Various metrics have been proposed, such as the difference between tropical and Southern Hemisphere midlatitude total cloud fraction, Tropical zonal-average LT relative humidity in the moist-convective region, model error in total cloud amount between 60°N/S and the fraction of tropical clouds with tops below 850 mbar whose tops are also below 950 mbar (see list in Caldwell et al., 2018, Table 1). The correlations between the proposed metrics and ECS vary widely, and as noted in Caldwell

10.1029/2020EA001281
Earth and Space Science et al., many do not have a valid physical underpinning. Since we are here analyzing model warming rates, which is directly connected to ECS, it is worthwhile examining if an emergent constraint interpretation can be applied to our results.
The correlations between ECS and trend terms are as follows: LT-global 0.67, MT-global 0.60, LT-tropics 0.50, and MT-tropics 0.50. Hence, the models with low ECS values tend to have lower tropospheric trends, thus closer to observed values, and therefore are more likely to be realistic. Figure 4 provides more insight into the data. The models cluster into two distinct groups based on whether the ECS is above (red squares) or below (blue circles) 3.4 K. A solid square or circle indicates the trend is from the LT, and an open shape indicates MT. The mean values in each cluster for both the LT and MT layers are indicated by + signs, and the layer averages are joined by the gray lines (dashed-MT, solid-LT) which represent the emergent constraint.

Earth and Space Science
Within clusters, ECS and warming trend values are not correlated, but as is indicated by the gray lines the correlation emerges when comparing between low and high clusters. In the high group the overall mean trend is 0.28°C/decade and the mean ECS is 4.67 K. In the low group the overall mean trend is 0.21°C/decade and the mean ECS is 2.76 K. The mean observed trends in the LT and MT layers across all measurement types are indicated by the arrows along the horizontal axis (LT solid 0.15°C/decade, MT open 0.09°C/decade). Since the mean trends even in the low-ECS model group are still too high compared to the observed trends, the emergent constraint implies a need to extrapolate into even lower ECS levels to approximately match observations. Examining where the dotted lines cross the arrows

Earth and Space Science
informally indicates how far such extrapolation would need to go; however, as drawn this would imply ECS values well below 1.0 K. Since a curve of any shape can be fitted between two points, one could equally use concave lines which would still imply ECS values below 2.0 K in order to have associated warming trends consistent with observations.

Conclusions
The literature drawing attention to an upward bias in climate model warming responses in the tropical troposphere extends back at least 15 years now (Karl et al., 2006). Rather than being resolved, the problem has become worse, since now every member of the CMIP6 generation of climate models exhibits an upward bias in the entire global troposphere as well as in the tropics. The models with lower ECS values have warming rates somewhat closer to observed but are still significantly biased upward and do not overlap observations. Models with higher ECS values also have higher tropospheric warming rates, and applying the emergent constraint concept implies that an ensemble of models with warming rates consistent with observations would likely have to have ECS values at or below the bottom of the CMIP6 range. Our findings mirror recent evidence from inspection of CMIP6 ECSs (Voosen, 2019) and paleoclimate simulations (Zhu et al., 2020), which also reveal a systematic warm bias in the latest generation of climate models.

Data Availability Statement
The data used in this study are available at https://data.mendeley.com/datasets/sd97vh79v8/1