Crossbreeding CMIP6 Earth System Models With an Emulator for Regionally Optimized Land Temperature Projections

The newest generation of the Coupled Model Intercomparison Project (CMIP6) exhibits a larger spread in temperature projections at the end of the 21st century than the previous generation. Here, a modular Earth System Model emulator is used to evaluate the realism of the warming signal in CMIP6 models on both global and regional scales, by comparing their global trends and regional response parameters to observations. Subsequently, the emulator is employed to derive large “crossbred” multimodel initial‐condition ensembles of regionally optimized land temperature projections by combining observationally constrained global mean temperature trend trajectories with observationally constrained local parameters. In the optimized ensembles, the warmest temperature projections are generally reduced and for the coolest projections both higher and lower values are found, depending on the region. The median shows less changes in large parts of the globe. These regional differences highlight the importance of a geographically explicit evaluation of Earth System Model projections.


Introduction
The multimodel Earth System Model (ESM) ensembles of the Coupled Model Intercomparison Project (CMIP Eyring et al., 2016;Meehl et al., 2007;Taylor et al., 2012) play a crucial role in informing global and regional assessments of future climate change. Currently, ESM runs of the sixth CMIP phase (CMIP6 Eyring et al., 2016) are being produced and will serve as basis for the sixth Assessment Report (AR6) of the Intergovernmental Panel on Climate Change (IPCC). The runs available so far exhibit a large spread in land temperature projections at the end of the 21st century for a high-end emission scenario (SSP5-8.5 Riahi et al., 2017) both at global and regional scales (Figure 1). In fact, their spread even exceeds the full multimodel ensemble spread of the experiments of the fifth CMIP phase (CMIP5 Taylor et al., 2012) under a similar emission scenario (RCP8.5 Riahi et al., 2011).
In this study, we evaluate the realism of the past warming signal in individual CMIP6 models by comparing them to observations on global and regional scales. We subsequently derive large ensembles of regionally optimized and observationally constrained land temperature projections using a modular ESM emulator.

Data
Runs from 27 CMIP6 ESMs that span the historical time period  and the SSP5-8.5 emission scenario (2015-2100) are considered (listed in supporting information Table S1 and processed according to Brunner et al., 2020). To account for the largest ESM interdependencies (Abramowitz et al., 2019;Knutti, 2010) in terms of underlying modeling core, ESMs are grouped into "ESM families" according to their name. The 27 CMIP6 models considered in this study can be assigned to 19 ESM families (Table S1). Additionally, we use CMIP5 projections from 40 climate models over the same 1870-2100 period based on simulations for the historical period  and the RCP8.5 emission scenario   (Table S2). Furthermore, two spatially infilled, gridded observational products are employed to evaluate the realism of the ESM simulations while accounting for observational uncertainty (Cowtan and Way (COWTAN) and Berkeley Earth Surface Temperature (BEST) Cowtan & Way, 2014;Rohde et al., 2013). Observations are used from 1950-2018. Both observations and ESM runs were interpolated to a common 2.5 × 2.5 • grid. Yearly temperature anomalies with respect to a baseline of 1951-1980 are worked with and referred to as "temperature" for simplicity.
BEST and COWTAN consist of blended temperatures, meaning they report 2-m air temperature anomalies above land and sea ice and sea surface temperature (SST) anomalies above open ocean. SSTs warm systematically more slowly than air temperatures (Cowtan et al., 2015), and thus, blended temperatures are also employed for the ESMs to allow for a like-with-like comparison. To obtain blended temperatures from ESMs, 2-m air temperature anomalies, SST anomalies, sea ice concentrations, and land-sea mask fields are required. Note that we follow the BEST approach for blending ESM temperatures, which differs slightly from the COWTAN approach in the treatment of sea ice concentrations (Cowtan et al., 2015). In BEST, sea ice concentrations vary over time, whereas in COWTAN, they are fixed at climatological values.
For regional analyses, 26 SREX land regions (Seneviratne et al., 2012) and the Global Land region, which consist of all land grid points excluding Antarctica, are employed. A map with the SREX regions and their abbreviations is provided in the supporting information ( Figure S1). To account for the fact that grid cell area depends on latitude and that coastal grid cells contain only a fraction of land cover, regional averages consist of area-and land-cover-weighted means.
Throughout the manuscript, a variety of variables, subscripts, and superscripts are employed. An overview is provided in supporting information Table S3.

ESM Emulator
MESMER, the Modular Earth System Model Emulator with spatially Resolved output employed in this study (Beusch et al., 2020), produces large, ESM-specific ensembles of yearly land temperature field time series which closely resemble initial-condition ensembles (e.g., Deser et al., 2012;Maher et al., 2019) at grid point to regional scales. Individual realizations of MESMER mimic single members of a true ESM initial-condition ensemble at a negligible computational cost and are referred to as emulations (Beusch et al., 2020 , accounts for temporally correlated, internally generated variability and is described as an autoregressive (AR) process. var m,s is used to regress this global variability to the local scale and int m,s denotes the intercept term. Finally, m,s,t is represented by an AR process with spatially correlated innovations. A detailed description of the individual emulator components can be found in Beusch et al. (2020).
MESMER is trained for each of the 27 CMIP6 ESMs separately using data from 1870-2100 and employing all available initial-condition ensemble members. With the thereby derived ESM-specific emulator parameters, 1,000 emulations are generated for each ESM according to whereby the subscript e m refers to the ESM-specific emulations. The resulting "superensemble" contains 27,000 emulations and approximates a multimodel initial-condition ensemble. In the following, we refer to this ensemble as the "raw superensemble".
Please note that in Beusch et al. (2020), T m,s,t referred to 2-m air temperature, whereas in this study, it refers to blended temperature. This does not affect land temperatures but leads to different T glob m,t and coastal grid cell values, as SSTs warm more slowly than 2-m air temperatures. Since the two T glob m,t quantities are directly linearly related (see supporting information Figure S2), MESMER can be trained with either one, leading to slightly different trend m,s values without impacting the results otherwise.

Emulator-Based ESM Evaluation
In general, local climate change can be characterized by global mean temperature increase (IPCC, 2013) alongside a linear scaling of local temperatures with global mean temperature (Seneviratne et al., 2016). Hence, we evaluate ESM performance by comparing the ESM runs to observations for these two features. Because natural climate variability affects both features, a direct comparison of observations, that is, the single realization of the real-world climate, and individual ESM runs, that is, single realizations of the ESM climate, is not possible. However, the ESM-specific raw emulations can be used to enable the comparison between ESMs and observations, since the emulations aim to approximate the true distribution of these two features the ESM would generate if it were run many times (Beusch et al., 2020). If the raw emulations perfectly approximated their ESM, the features of the ESM runs would fall within the 2.5th to 97.5th percentiles of the features of the raw emulations in 95% of all cases. In this study, the ESM is regarded as consistent with its raw emulations for a given feature, if the feature of at least one ESM run falls within these percentiles. If the feature of at least one observational product falls within these percentiles too, the ESM is additionally considered consistent with observations for this feature.

Global Mean Temperature Trend
During the last decades, observed global mean temperature T glob o,t has been linearly increasing with time (Cowtan & Way, 2014;Rohde et al., 2013). Here, we evaluate whether individual ESMs are able to reproduce the observed trend. For this purpose, we compute linear OLS trends in time for each observational product (T glob o,t ), for each ESM run (T glob m,t ), and for each raw emulation (T glob e m ,t ) individually. We consider trends of varying length starting in 1975, 1980, 1985, 1990, 1995, and 2000 with the final year always being 2018.
Summarized results are provided in Figure 2, and the results for each time period are shown in supporting information Figure S3. Five ESMs from five ESM families are not consistent with the observed trend in any time period. All of them have been warming more rapidly than observations and continue to warm rapidly in the future. This finding is consistent with recent papers on individual high-end warming models (Sellar et al., 2019;Swart et al., 2019) and the CMIP6 ensemble (Forster et al., 2019;Tokarska et al., 2020). Eight models from six ESM families are consistent with observations in each time period (Figure 2). Nevertheless substantial intermodel uncertainty remains within these best performing models with their T glob,trend m,t diverging by 2 • C at the end of the 21st century ( Figure 2a). Interestingly, their trajectories cluster into two distinct groups. We hypothesize that the models within each group share similar global feedbacks and heat uptake efficiencies. This is also consistent with results of Tokarska et al. (2020) who show that the same subsets of models additionally share similarities in terms of transient climate response and equilibirum climate sensitivity.

Regionally Averaged Local Response
To evaluate the local grid point-level response to global mean temperature trend in ESMs, the emulator framework is applied again but this time different data are used to calibrate MESMER. Specifically, while MESMER has so far been trained for each ESM individually on the full ESM time period (1870-2100) by employing all initial-condition ensemble members (section 3), it is now trained on a single run and a set of shorter time periods instead. This setup allows for MESMER to also be calibrated with observations. The employed periods are 1950-2018, 1960-2018, and 1970-2018. The chosen periods result from a trade-off between maximizing the number of years available for training and maximizing the quality and quantity of the raw data in the observational products. For each period, MESMER is trained on each observational product, ESM run, and raw emulation individually according to equation (1) The local response performance is evaluated on a regional level by averaging the grid point-level parameters ( trend x st ,s ) to regional parameters (  (Figure 2). This test had to be carried out at least once in 41 of the total 729 model-region combinations, and thus, the result only marginally affects the outcome of the overall analysis.
In most regions, many of the 27 ESMs are consistent with observations in all evaluation time periods (Figure 2). Only in sixth out of 27 regions less than 50 % of the ESMs are consistent with observations in all time periods. In these regions, the number of ESMs that are not consistent with observations during any time period lies between five from five different ESM families (Sahara, SAH) and twelve from ten different ESM families (North Asia, NAS). Overall, the performance of the ESMs appears higher on regional scales than on the global scale. Note that this could reflect the fact that the larger interannual variability of the climate signal at regional scales leads to a lower signal-to-noise ratio and thus a wider range of values, which are compatible with observations.

Crossbreeding to Avoid Loss of Information in Observationally Constrained ESM Ensembles
Since ESM biases remain largely stationary across different climate states (Krinner & Flanner, 2018), ESM evaluation results from the observational time period can be used to select a set of well-performing ESMs for future projections. However, our analysis revealed that there is no direct relation between ESM performance for global trend (here, T glob,trend m,t ) and regional response (here, reg,trend m,s ) within individual ESMs, although the very best ESMs in terms of global trend frequently also perform well at regional scales ( Figure 2).
The most naive approach to obtain projections that are consistent with observations would be to only consider models, which perform well with respect to all metrics. In that case, future projections would be based solely on MIROC6 and MPI-ESM1-2-LR, the only ESMs passing each test employed in this study ( Figure 2). Naturally, this would constitute a large loss of information and a severe underestimation of intermodel uncertainty since many other ESMs also perform well in global trend and/or regionally averaged local response.
Alternatively, region-specific projections could be carried out by considering all ESMs which perform well in terms of both T glob,trend m,t and reg,trend m,s in the region at hand (Figure 2). This would nevertheless disregard valuable ESM information, since ESMs which do not perform well in T glob,trend m,t could not be employed for any regional projections.
Here, we propose a novel "crossbreeding" emulation approach to make full use of all ESMs that are consistent with observations at different scales. For this purpose, the equation to obtain raw emulations is reconsidered (equation (2)). To generate "crossbred" emulations, the global mean temperature trend of one ESM j, T glob,trend m ,t , is combined with the local features of another ESM i such that In each region, index i runs across all ESMs whose local features are selected in this region and index j across all ESMs whose global mean trend is selected. New crossbred emulations are generated by repeating each combination of i and j an equal number of times until the total number of crossbred emulations is as close to the total number of raw emulations (27,000) as achievable without exceeding it.
In the following, the results for two selection criteria are presented. The first criterion is referred to as "conservative removal" and excludes as few ESMs as possible by selecting each ESM which is consistent with observations in at least one of the considered time periods (Figure 2, gray ESMs are excluded). The second criterion is referred to as "best estimate" and solely selects ESMs, which are consistent with their emulations and observations in each of the considered time periods and which are additionally part of different ESM families (Figure 2, dark blue ESMs are selected). Thereby, the ESM providing the most initial-condition ensemble members of each family is selected such that future projections are based on the most robustly calibrated emulators. The resulting regionally optimized, observationally constrained, emulated ensembles are henceforth referred to as "crossbred superensembles".

Temperature Projections: From the Original CMIP6 Multimodel Ensemble to Crossbred Superensembles
Regionally averaged time series highlight the added value of the raw superensemble compared to the actual CMIP6 ESM projections since the emulated ensemble samples the temperature projections phase space more thoroughly with 1,000 computationally cheap emulations available for each ESM (Figure 3). A caveat of this raw superensemble, however, is that its projections are based on all 27 available CMIP6 ESMs without taking ESM consistency with observations into account.
To improve the consistency of emulated projections with observations, crossbred superensembles, such as the conservative removal and the best estimate ensemble, can be used (Figure 3). Comparing the superensembles highlights that the impact of the observationally guided crossbreeding differs between regions, quantiles, and ESM selection criteria. Only the 95th percentile is reduced in each region and for both selection criteria.
Because the superensembles contain several thousand emulations, as opposed to the actual CMIP6 ensemble which contains only a few runs per ESM, quantiles can be reliably estimated at each grid point for any given year (Figure 4). Overall, temperature projection uncertainty is large in both the raw and the observationally constrained crossbred superensembles in the year 2100. The conservative removal superensemble shows a consistently reduced 95th and 50th percentiles compared to the raw superensemble, with the magnitude of the reduction being largest in the 95th percentile, especially in the Amazon region and eastern Asia. The best estimate superensemble's 95th percentile, on the other hand, shows a weaker reduction in many regions and even an increase in temperature projections in parts of the world. Its 50th percentile is generally similar to the raw superensemble. The spatial patterns of the changes in the 5th percentile of the two crossbred superensembles tend to be coherent but the best estimate superensemble is warmer.

Discussion and Conclusions
This study proposes a novel emulation approach to derive large ensembles of regionally optimized land temperature projections by combining global-and local-scale features of different ESMs through "crossbreeding". For this purpose, global-and local-scale features of individual ESMs are evaluated with respect to observations to select ESM features, which are consistent with observations. Since local-scale ESM performance depends on the region at hand, the evaluation is carried out for different regions individually.  Maps of different temperature projection quantiles (5%, 50%, and 95%) in the raw, the conservative removal, and the best estimate superensembles for the year 2100. Additionally, the differences between the crossbred and the raw superensembles' quantiles are given ( (Crossbred T − Raw T) Raw T · 100). Note that the crossbred superensembles are optimized for each SREX region individually, which leads to discontinuities in the projections at the regions' borders if different models are selected for the local response in adjacent regions. For land areas not covered by any SREX region, the Global Land optimization results are used.
In each region, crossbred emulations are created by drawing multiple realizations from each combination of the selected global mean temperature trends and the local features. Both a stringent and a soft ESM selection criterion are applied.
The advantages of the resulting crossbred superensembles are manifold. First, by retaining features from the full suite of ESMs which are consistent with observations at global and regional scales, they waste no valuable ESM information. Second, they provide a more complete sampling of the observationally constrained inter-ESM temperature projection uncertainty phase space as they couple global and local features of different ESMs. Third, due to their computational efficiency, it is feasible to generate superensembles containing thousands of emulations, and thus, quantiles can be reliably estimated at the grid cell level for any given year.
Nevertheless, some important caveats remain. The MESMER emulator is a simple statistical tool, which is subject to strong assumptions such as the linearity of the dependence of the local response on global mean temperature, the additivity of the local response and the variability around it, and the stationarity in time of the emulator parameters. Additionally, the only external forcing for local temperature in MESMER is global mean temperature. Other factors, which could be locally relevant too, such as human-induced aerosol emissions and land use forcing, are not considered. For a more complete discussion of the underlying assumptions of MESMER and their justification the reader is referred to Beusch et al. (2020). The caveats of the employed ESM selection approach consist of multiple testing, limited emulations' sample size to estimate parameter distributions, and favoring ESMs with larger variability since they are more likely to reproduce the observed warming signal by chance. Lastly, also the treatment of inter-ESM dependencies (Abramowitz et al., 2019;Knutti, 2010) could be improved. Thus, while this study conceptually illustrates the possibility of crossbreeding ESM features with an emulator, further research is needed to fully exploit the capabilities of the resulting crossbred superensembles, for example, by improving the capacities of the considered emulator or by advancing model selection methodologies.
When applying the crossbreeding approach to CMIP6, the impacts are spatially diverse. Generally, the warmest temperature projections at the end of the 21st century are reduced whereas the median is less affected. For the coolest projections, complex patterns are observed with both increased and decreased projections depending on the region. Naturally, also the ESM selection criterion affects the projection patterns. Additionally, it should be noted that with CMIP6 still growing, these results cannot be regarded as final. Instead, they represent a snapshot based on the currently available CMIP6 climate model projections (status of April 2020).
In conclusion, this study introduces a novel emulation approach to recalibrate raw climate model projections based on historical performance in a modular way that retains the full suite of ESM features which are consistent with observations at global and regional scales. The thereby resulting superensembles of regionally optimized land temperature projections contain several thousand members and approximate recombined multimodel initial-condition ensembles at a negligible computational cost.