The Max Planck Institute Grand Ensemble: Enabling the Exploration of Climate System Variability
Abstract
The Max Planck Institute Grand Ensemble (MPI-GE) is the largest ensemble of a single comprehensive climate model currently available, with 100 members for the historical simulations (1850–2005) and four forcing scenarios. It is currently the only large ensemble available that includes scenario representative concentration pathway (RCP) 2.6 and a 1% CO2 scenario. These advantages make MPI-GE a powerful tool. We present an overview of MPI-GE, its components, and detail the experiments completed. We demonstrate how to separate the forced response from internal variability in a large ensemble. This separation allows the quantification of both the forced signal under climate change and the internal variability to unprecedented precision. We then demonstrate multiple ways to evaluate MPI-GE and put observations in the context of a large ensemble, including a novel approach for comparing model internal variability with estimated observed variability. Finally, we present four novel analyses, which can only be completed using a large ensemble. First, we address whether temperature and precipitation have a pathway dependence using the forcing scenarios. Second, the forced signal of the highly noisy atmospheric circulation is computed, and different drivers are identified to be important for the North Pacific and North Atlantic regions. Third, we use the ensemble dimension to investigate the time dependency of Atlantic Meridional Overturning Circulation variability changes under global warming. Last, sea level pressure is used as an example to demonstrate how MPI-GE can be utilized to estimate the ensemble size needed for a given scientific problem and provide insights for future ensemble projects.
Key Points
- The 100-member MPI-GE is currently the largest publicly available ensemble of a comprehensive climate model
- MPI-GE currently has the most forcing scenarios of all large ensemble projects: RCP2.6, RCP4.5, RCP8.5, and 1% CO2
- The power of MPI-GE is to estimate the forced response and internal variability, including changing variability, to unprecedented precision
1 Introduction
Internal variability and uncertainties in model physics and the future forcing all contribute to uncertainties in climate projections (Hawkins & Sutton, 2009). While multimodel ensembles such as the Coupled Model Intercomparison Project (CMIP; Taylor et al., 2012) can be used to effectively investigate the combined effect of all three in climate projections, it is difficult to separate internal variability from the forced response with a limited number of ensemble members of each single model. For single realizations, linear detrending is often used with the intention of removing the forced response and isolating the internal variability (e.g., Frankcombe et al., 2018). However, this then introduces biases in the amplitude and phase of internal variability, with more complicated scaling methods needed to better separate the two quantities (Frankcombe et al., 2015, 2018). Internal variability can be quantified using a long control simulation in the absence of external forcing (e.g., Thompson et al., 2015; Wittenberg et al., 2014). However, internal variability may itself be influenced by external forcing (e.g., Maher et al., 2015) in ways that are difficult to account for a priori. This means that a long control run cannot be used to address projections where the variability itself might change. A large ensemble of a single model can be used to estimate changes in variability in this model, uncertainties due to future forcing, and together with other model ensembles can be used to address uncertainties in model physics. The Max Planck Institute Grand Ensemble (MPI-GE) is currently the largest such ensemble and will be introduced in this paper.
Large-ensemble projects of comprehensive coupled climate models are gaining traction as methods to robustly estimate internal variability in transient simulations and to quantify the forced signal (e.g., Kay et al., 2015). The first large ensemble project was a 62-member simulation of Community Climate System Model 1.4 run for the period 1940–2080 (e.g., Branstator & Selten, 2009; Zelle et al., 2005). Three other large ensembles are currently publicly available. One is the Community Earth System Modelling Large Ensemble Project (LENS), which was run by National Center for Atmospheric Research (NCAR) for the period 1920–2100 and has 42 members of the historical simulation and representative concentration pathway (RCP) 8.5 scenario (CESM-LE) and 15 members of the RCP4.5 scenario (CESM-ME) (Kay et al., 2015; Sanderson et al., 2018). Another is the Geophysical Fluid Dynamics Laboratory large ensemble, which consists of 30 members from the RCP8.5 scenario (2006–2100; e.g., von Känel et al., 2017). The third is the Canadian Earth System Model Large Ensembles, which has 50 members of three single-forcing experiments run from 1950–2020 and 50 members of the historical simulation run from 1950–2005 and continued using the RCP8.5 scenario run from 2006–2100 (Kirchmeier-Young et al., 2017). Other modeling groups have also recently completed large ensembles; however, they are not yet publicly available (e.g., Frankignoul et al., 2017; Stolpe et al., 2018).
Studies that utilize large ensembles have been extensively used to investigate the internal variability of the climate system (e.g., Dai & Bloecker, 2019; Fasullo & Nerem, 2016; Frankignoul et al., 2017; Smith & Jahn, 2019) and extreme events (e.g., Diffenbaugh et al., 2015; Gibson et al., 2017; Kirchmeier-Young et al., 2017; Tebaldi & Wehner, 2018; Wang et al., 2018). They have also been used as a test bed for new methodologies such as creating an observational large ensemble (McKinnon et al., 2017; McKinnon & Deser, 2018), built by combining the forced response from CESM-LE and the estimated internal variability from observations. These ensembles have also been used as a test bed for dynamical adjustment, which can be used to remove the internal dynamical signal and consequently bring observations closer to the forced response (Deser et al., 2016; Lehner et al., 2017). Additionally, large ensembles have been used to inform observing systems, for example, for marine ecosystem drivers such as ocean acidification and to provide information to optimize the observing system (Rodgers et al., 2015). The previously available ensembles and the work associated with them have provided a treasure trove of information, but more large ensembles are still useful, particularly to investigate the robustness of simulated internal variability and the forced response. Additional information can also be gained from having a very large ensemble, such as investigating extreme events and computing the forced signal for highly variable quantities.
MPI-GE is the largest ensemble of simulations for any given scenario currently available, with 100 members for the historical simulation and each of four forcing scenarios. It is currently the only large ensemble available where three future scenarios (RCP2.6, RCP4.5, and RCP8.5), each consisting of 100 members, can be compared. It additionally enables studies of the targets set by the Paris agreement using the RCP2.6 scenario. MPI-GE has the advantage that it is initialized by sampling the preindustrial control state, which effectively samples the full phase space of both the ocean and atmosphere states. This method has been shown to produce a larger spread than the atmospheric perturbation method and hence a better sample of the full range of variability from the beginning of the simulation (Hawkins et al., 2016). This allows investigation of the late nineteenth century and the early twentieth century warming, given that the ensemble is initialized in 1850, something that is not possible with the other existent ensembles due to their later start dates. Being able to contrast different warming states provides useful constraints on the behavior of the climate system and the magnitude of different forcings (Stevens, 2015). The initialization method is particularly important for investigating the variability of quantities that have been shown to have longer times of divergence when initialized with atmospheric perturbations, such as regional temperature and precipitation trends (Hawkins et al., 2016), the Atlantic Meridional Overturning Circulation (AMOC), and 2,000-m ocean heat content (Marotzke, 2019). Overall, these advantages make MPI-GE very powerful.
The utility of MPI-GE itself has also been demonstrated in previous studies (Bittner et al., 2016; Bengtsson & Hodges, 2018; Dessler et al., 2018; Hedemann et al., 2017; Li & Ilyina, 2018; Manzini et al., 2018; Maher et al., 2018; Marotzke, 2019; Niederdrenk & Notz, 2018; Plesca et al., 2018; Rädel et al., 2016; Stevens, 2015; Suárez-Gutiérrez et al., 2017, 2018; Zhang et al., 2018). Some high-profile examples include the investigation into the 1998–2012 hiatus and extreme events; for example, Hedemann et al. (2017) used MPI-GE to investigate the recent surface warming hiatus. Whereas most studies suggested that ocean heat uptake caused the hiatus, Hedemann et al. (2017) found that energy radiated upward from the surface could have caused the hiatus as well and that observational uncertainty is too large for us to know which explanation is correct. Studies investigating extreme events also benefit from the large ensemble size of MPI-GE. Suarez-Gutierrez et al. (2018) investigated European summer temperature extremes using MPI-GE. They found that in a climate that warms globally by 2 °C, the European summer extremes are 1 °C warmer than in a climate that warms by 1.5 °C. They also found that the 2003 heatwave has a 1 in 2,000 chance of happening under preindustrial conditions, while it occurs under 1.5 °C warming every other year. An ensemble of 100 members allows the sampling of 1 in 100-year extreme events in the ensemble simulated every year on average and further allows the simulation and characterization of large samples of extreme events with return periods over hundreds of years. The use of MPI-GE allowed Suarez-Gutierrez et al. (2018) to investigate events with return periods of up to 500 years, without explicitly parametrizing the tails of the distributions using extreme value statistics.
Internal variability and the forced response can be separated with high precision using MPI-GE. MPI-GE has previously been used to show that the observed negative decadal trend in the ocean carbon sink in the 1990s can be attributed to internal variability (Li & Ilyina, 2018). Li and Ilyina (2018) also indicate that, in the presence of large internal variability, the emergence time of the forced response in the ocean carbon sink is beyond a decade. Forced temperature trends in the upper tropical troposphere are larger in most models that in observations. Suárez-Gutiérrez et al. (2017) used MPI-GE to show that most of these differences can be explained by internal variability alone. This indicates that differences between models and observations may also be misinterpreted in the absence of large ensembles, such as MPI-GE.
Given its large size, MPI-GE can also be used to address the question of how many ensemble members of a single model are needed to address a given problem. While Daron and Stainforth (2013) suggested that ensembles of several hundred members may be required to characterize a model's climate, Drótos et al. (2017) suggest that 100 members may be sufficient for analyzing the forced response.. Maher et al. (2018) found that 30–40 members are needed to robustly estimate ENSO variability in MPI-GE. Olonscheck and Notz (2017) used CMIP5 and MPI-GE to suggest that when investigating sea ice variability, multiple small ensembles of coupled climate models are of more use than either a large ensemble of a single model or a multimodel ensemble of single realizations. Other studies have looked at how many members are needed to detect forced changes in specific variables. Li and Ilyina (2018) found that up to 79 ensemble members are needed to detect the forced decadal trends in the carbon sink under the RCP4.5 forcing regime, with the largest number of members needed in the Southern Ocean. Bittner et al. (2016) investigated how many ensemble members are needed to identify a forced change in the northern hemisphere polar vortex in the winter after the Pinatubo eruption. They found that 7 to 40 members are needed, depending on the latitude considered. Another example based on sea level pressure (SLP) trends is used in this paper to demonstrate how MPI-GE can be used to determine the ensemble size needed.
The purpose of this paper is twofold. The first is to present MPI-GE to the wider community, the second to further demonstrate the usefulness of this 100-member ensemble by presenting a variety of examples and some novel analyses. In section 2, MPI-GE is presented, and the ensemble simulations are described. In section 3, we use MPI-GE to investigate specific quantities and how they evolve in time in different scenarios. In section 4, we demonstrate how to compare MPI-GE to observations and show a novel approach for evaluating the model internal variability. In section 5, we show four examples of scientific problems that can be best investigated with a large ensemble of a single climate model. The first investigates whether temperature and precipitation behave similarly at the same warming levels in the different forcing scenarios. The second demonstrates the quantification of the forced signals in the northern hemisphere atmospheric circulation when there is strong variability. The third investigates changes in variability itself, using the AMOC as an example. The fourth demonstrates how MPI-GE can be used to determine the ensemble size needed for a specific quantity, in this case for projected trends in SLP. Finally, we discuss the use of the MPI-GE in the context of current climate modeling in the scientific community.
2 MPI-GE
2.1 The Model
MPI-ESM is described by Giorgetta et al. (2013). MPI-GE uses MPI-ESM1.1 (version MPI-ESM 1.1.00p2), is run in low-resolution configuration, and consists of the following components. The ocean component is MPIOM (version mpiom-1.6.1p1; Marsland et al., 2003), run on the GR15L40 grid. The ocean biogeochemistry model is HAMOCC5.2 and is run as described by Ilyina et al. (2013). ECHAM (version echam-6.3.01p3; Stevens et al., 2013) provides the atmosphere component, run in a T63L47 configuration. The land component is the JSBACH model (version jsbach-3.00) including dynamic vegetation and land use transitions with the standard fire module (Reick et al., 2013). This model configuration has an atmosphere of approximately 1.8° and an ocean of approximately 1.5° resolution, although the ocean resolution increases closer to the poles in the grid.
MPI-ESM1.1 has some similarities to MPI-ESM used in CMIP5 (Giorgetta et al., 2013), but overall behaves closer to MPI-ESM1.2 (Mauritsen et al., 2019), which is used in CMIP6. The ocean component is very similar to the CMIP5 version, with some minor differences. The atmosphere is based on ECHAM6.3, rather than ECHAM6.1 as was used in CMIP5. MPI-ESM1.1 and MPI-ESM1.2 have specifically tuned cloud feedbacks to better match the historical warming. The equilibrium climate sensitivity is hence lowered from 3.4 K in CMIP5 to 2.8 K in MPI-GE, as calculated using linear extrapolation from 150 years of an abrupt 4 xCO2 experiment (Andrews et al., 2012). HAMOCC is run in the same configuration as CMIP5, and JSBACH is the CMIP5 version of the model component, however now including the soil carbon model YASSO (Goll et al., 2015) and a five-layer soil hydrology scheme (Hagemann & Stacke, 2014).
2.2 Initialization and Forcing
MPI-GE follows the protocol of the CMIP5 simulations (Taylor et al., 2012). The historical and idealized forcing simulations are branched from different years of the preindustrial control simulation after it has reached a state of quasi-stationarity. Both historical and 1% CO2 simulations are initialized from the state on the first of January in different years of the control simulation (Table 1) and thus sample differences in the possible state of the atmosphere, land, and ocean assuming a stationary and volcano-free 1850 climate.
Ensemble member | Branch time | Ensemble member | Branch time |
---|---|---|---|
1 | 1898 | 51 | 3164 |
2 | 1946 | 52 | 3188 |
3 | 1994 | 53 | 3212 |
4 | 2042 | 54 | 3236 |
5 | 2090 | 55 | 3260 |
6 | 2138 | 56 | 3284 |
7 | 2186 | 57 | 3308 |
8 | 2234 | 58 | 3332 |
9 | 2282 | 59 | 3356 |
10 | 2330 | 60 | 3380 |
11 | 2378 | 61 | 3404 |
12 | 2426 | 62 | 3428 |
13 | 2474 | 63 | 3452 |
14 | 2522 | 64 | 3476 |
15 | 2570 | 65 | 3500 |
16 | 2618 | 66 | 3524 |
17 | 2666 | 67 | 2906 |
18 | 2714 | 68 | 2930 |
19 | 2762 | 69 | 2954 |
20 | 2810 | 70 | 2978 |
21 | 1874 | 71 | 2822 |
22 | 1922 | 72 | 2846 |
23 | 1970 | 73 | 2870 |
24 | 2018 | 74 | 2894 |
25 | 2066 | 75 | 2918 |
26 | 2114 | 76 | 2942 |
27 | 2162 | 77 | 2966 |
28 | 2210 | 78 | 2990 |
29 | 2258 | 79 | 3014 |
30 | 2306 | 80 | 3038 |
31 | 2354 | 81 | 3062 |
32 | 2402 | 82 | 3086 |
33 | 2450 | 83 | 3110 |
34 | 2498 | 84 | 3134 |
35 | 2546 | 85 | 3158 |
36 | 2594 | 86 | 3182 |
37 | 2642 | 87 | 3206 |
38 | 2690 | 88 | 3230 |
39 | 2738 | 89 | 3254 |
40 | 2786 | 90 | 3278 |
41 | 2834 | 91 | 3302 |
42 | 2882 | 92 | 3326 |
43 | 2858 | 93 | 3350 |
44 | 3006 | 94 | 3374 |
45 | 3020 | 95 | 3398 |
46 | 3044 | 96 | 3422 |
47 | 3068 | 97 | 3446 |
48 | 3092 | 98 | 3470 |
49 | 3116 | 99 | 3494 |
50 | 3140 | 100 | 3518 |
2.3 Simulations and Data Availability
Monthly mean data are available for all components except the ocean biogeochemistry. Ocean biogeochemical data are available as monthly mean surface variables and annual mean three-dimensional variables, except for the RCP8.5 scenario where the three-dimensional variables are also available as monthly means. The deep ocean biogeochemistry variables in the first 500 years of the preindustrial control simulation are prone to model drift. Hence, in ensemble members branched from an early state of the control simulation (Table 1), this model drift could introduce spurious trends on top of the internal variability and forced response. However, the magnitude of the model drift is much smaller than both the internal variability and forced response, at least for the CO2 flux and subsurface (upper few hundred meters) dissolved oxygen concentrations. Additionally, we note that the carbon cycle in the land component is not fully equilibrated early in the preindustrial control. As such, for analysis of the carbon cycle on land and any variables affected by it, ensemble members 1–18 and 21–39 in both the historical and 1% CO2 scenario should not be used without drift removal. As with many other climate simulations, deep ocean temperature drift also occurs. Details on drift removal and the best current methods to perform these calculations can be found in Gupta Sen et al. (2013). We emphasize that for drift removal calculations using this method, smoothing must be applied to the control simulation. This is because subtracting the control directly from the forced simulations would initially artificially dampen the anomalies due to internal variability, making all members more similar and deflating the variability across the ensemble dimension. Later in the simulation, a lack of coherence between the preindustrial control and the ensemble members would artificially inflate the variability.
- Preindustrial control simulation (2,000 years);
- Historical (1850–2005);
- RCP2.6 (2006–2099);
- RCP4.5 (2006–2099);
- RCP8.5 (2006–2099); and
- 1% CO2 (150 years).
Details of how to download the data can be found on the website (https://www.mpimet.mpg.de/en/grand-ensemble).
3 Quantifying the Transient Forced Response and Evolving Internal Variability



This method allows us to estimate transient internal variability. The spread of the realizations shown in Figure 1 demonstrates the variability of the ensemble around the forced response (Ft). While the parametric method of using the ensemble standard deviation is used in this paper, a non-parametric method or alternate parametric method to estimate variability can also be used in place of the standard deviation.
The superposition of internal variability and externally forced changes in the climate due to external drivers such as anthropogenic emissions and volcanic eruptions is illustrated in Figure 1. In the historical simulation, the long-term trend of each quantity over time can be attributed to anthropogenic forcing (Bindoff et al., 2013), with an increase in global mean surface temperature (GMST) and net carbon dioxide (CO2) flux into the ocean, a small decrease in primary production in the ocean, and little change in precipitation. The forced response to external volcanic forcing is also clear, with decreases in GMST and precipitation, an increase in the CO2 flux, and little response in the ocean primary production seen just after large tropical eruptions (e.g., Segschneider et al., 2013).
In all four quantities, the 1% CO2 scenario quickly distinguishes itself from the historical simulation, with the strong forcing causing large changes in all quantities. Different quantities evolve in different ways. Precipitation demonstrates an almost linear increase over time and primary production an almost linear decrease, while GMST shows a slightly stronger increase at the end of the time series compared to the beginning. The ocean CO2 flux, however, shows the strongest increase right at the beginning of the 1% CO2 forcing scenario and begins to plateau near the end of the scenario as a consequence of ocean warming, increased thermal stratification, and a slower AMOC. There is also more internal variability shown in the CO2 flux at the end (ensemble mean variability ≈0.21 PgC/year in the last year), compared to the beginning of the 1% CO2 scenario (ensemble mean variability ≈0.15 PgC/year in the last year). When considering the strongest forcing, from the 1% CO2 scenario, we also find a decrease of variability of the primary production (from ≈1.35 to ≈1.1 PgC/year) and an increase in global mean precipitation variability (from ≈0.0085 to ≈0.0095 mm/day).
The role of the forced response and internal variability in determining how the future may look is demonstrated in the scenarios. For GMST, all forcing scenarios are distinct at the end of the century, with the highest emission scenario (RCP8.5) showing the most warming. This distinct response between the three scenarios is also the case for the CO2 flux into the ocean. However, we also see that a plateau in the increase in CO2 flux occurs in the strongest scenario (RCP8.5), and a decrease in the flux compared to the beginning of the forcing scenario occurs in both lower emission scenarios (RCP2.6 and RCP4.5).
Global mean precipitation increases with warming, but the spread of realizations overlaps in the two weaker scenarios until the end of the 21st century. Here the distinction between the forced response in RCP2.6 compared to RCP4.5 is smaller than the internal variability. This means that single realizations from both scenarios could show similar precipitation in any given year. This overlap is even larger in the ocean primary production, where single realizations from all three scenarios could exhibit the same primary production at the end of the century. We note that time averages may also be used to distinguish between the scenarios, but this strong overlap tells us that the primary production will take longer to distinguish itself between scenarios, than other variables such as GMST. It is important for future projections to quantify both the forced response and the role of internal variability, because even though mitigation might happen in the future For example, RCP2.6 could still look similar to a high warming future (depending on the strength of the internal variability; Marotzke, 2019; Suarez-Gutierrez et al., 2018).
We note that for all quantities presented in this paper, we rely on the ability of MPI-GE to adequately represent the real world, but multiple models are needed to assess uncertainties due to model differences. The superposition of internal variability and the forced response in a single climate realization can cause confusion as to whether trends seen in one realization or an observational change are due to internal variability or can be characterized as a forced response (Hasselmann, 1976; Hawkins & Sutton, 2009). The use of MPI-GE allows us to disentangle these quantities in the model. In the following section, we demonstrate how to evaluate the performance of MPI-GE with observations for the current climate.
4 Comparison to Observations
In this section we use observations and MPI-GE to both evaluate the model and to interpret observations in the context of the model's internal variability. It is not appropriate to expect observations to match either a single realization or the ensemble mean; however, observations can be put in the context of the model's ensemble mean and variability.
Three methods of evaluating the ensemble are demonstrated. The first method is ideal for quantities that have good observational coverage over a long time span. This method puts the observations in the context of the transient model spread in time (e.g., Bengtsson & Hodges, 2018; Marotzke & Forster, 2015; Risbey et al., 2014) and additionally uses a rank histogram to evaluate the spread in the ensemble dimension (e.g., Marotzke & Forster, 2015). The second method can be used for observations where there is high quality data available but only for a short period of time. Here we compare a single observational estimate with a histogram of the ensemble spread (e.g., Bengtsson & Hodges, 2018; Flato et al., 2013). The third method provides a novel way to assess the agreement of the model's internal variability and forced response with observations on a global map.
To demonstrate the first method, we use GMST. Observed GMST largely falls within the MPI-GE spread, with some observational values sitting on the edges of the model spread (Figure 2a). Due to internal variability, we expect observational GMST to occur everywhere across the ensemble with uniform frequency. We also expect observational GMST to occasionally sit outside of the ensemble by chance. To test this, we perform a rank histogram to evaluate the model (Figure 2b). The rank histogram indicates with which frequency observations occur across the ensemble. For each year, the rank represents the place that the observations would take in a list of ensemble members ordered by ascending GMST values. If the observed value is smaller than all ensemble members, the rank is 1. If the observed value is higher than all ensemble members, the rank is n+1, with n the number of members (here 100). For a large enough record, one would expect that, if variability is perfectly simulated, observations take all ranks with no preferred frequency. That would lead to a “flat” rank histogram. The relatively flat rank histogram (Figure 2b) again demonstrates that MPI-GE is performing well in simulating GMST variability. We find some bias toward higher ranks, suggesting that the observations occur more often in the upper part of the model spread. Additionally, we would expect observations to occasionally sit at ranks 1 and 101 by chance. For a model that underestimates variability, there will be many occurrences at these ranks. For MPI-GE GMST, we find no occurrences at rank 1 but a few occurrences at rank 101, which might be related to a too-strong cooling associated with volcanic eruptions. Overall, MPI-GE performs well for GMST.

To demonstrate the second method, we investigate the monthly variability of the net outgoing longwave, absorbed shortwave, and net top of the atmosphere irradiances, which have far fewer observed data points than GMST. Here the most robust estimate is a single observational estimate of the variability, based on the Clouds and the Earth's Radiant Energy System Energy Balanced and Filled (Loeb et al., 2018) top of the atmosphere irradiance product, which only covers the period 2000–2015. While a previous study claims that climate models do not represent the variability of these fluxes well (Stephens et al., 2015), Figure 3 shows that the observations are within the model spread, and hence, the model is consistent with observations.

We demonstrate the third method of comparison in Figure 4. This is a novel model evaluation method, where we transfer the methodology used on European temperature (Suarez-Gutierrez et al., 2018, Supplementary Figure 3) to the globe. While previous methods to investigate this have used standard deviations and often detrended quantities (Bengtsson & Hodges, 2018; Lehner et al., 2017; McKinnon et al., 2017), this new method allows quantification of whether the whole distribution, including the extremes, agree well with observations. Additionally, by combining the map in Figure 4 with the method from Figure 2, we can investigate exactly why the model does not agree with observations in specific regions identified on the map and can differentiate between discrepancies in internal variability and the forced response.

We use surface temperature for this evaluation due to its long observational record and near global coverage (Figure 4). We first identify where observations lie outside the ensemble, using blue shading to show where observations lie below the ensemble minimum and red where they lie above the ensemble maximum. We then determine where the observations crowd too much in the center of the ensemble by using hatching to show where the observations sit inside the 75th percentile range (12.5 to 87.5 percentile) more than 75% of the time. Crowding too much in the center of the ensemble indicates that internal variability is overestimated in the model. White regions with no hatching or shading are regions where the variability is similar in the observations and the model. The Northern Hemisphere Oceans and the European continents observed variability are well represented in the model. However, we find that in general, the Southern Ocean has too low variability in the model, and parts of the land surface and the ice edge show too high variability.
To delve into specific regions identified as having biases, we can use time series plots similar to Figure 2. Six gridpoints are highlighted. We find that the Arctic point that shows observations lying above but not below the ensemble spread does so because MPI-GE captures cold extremes but not warm extremes accurately (Figure 4a). The East African land point also exhibits a similar bias, but does so because the observed trend over the period at the end of the time series is not completely captured by MPI-GE (Figure 4b). The South American land point shows too high variability in MPI-GE, presenting a distribution with too many warm extremes (Figure 4f), whereas the North American land point shows too strong variability in both warm and cold extremes (Figure 4e). The ocean off the west coast of Africa exhibits too small variability in MPI-GE (Figure 4c). The Southern Ocean also shows too small variability (Figure 4d). This is likely due to the low model resolution and the lack of eddies (e.g., Screen et al., 2009). Poor observational coverage may also contribute to differences between the observed GMST and the ensemble in this region. Overall, this method gives us a means to evaluate model variability and consequently hypothesize why it is overestimated or underestimated in various regions.
As well as using observations to evaluate the models performance, the combination of a large ensemble and observations can be used to put observations into the context of the model's internal variability. As previously mentioned, decadal trends can be misinterpreted due to lack of understanding about internal variability (Marotzke & Forster, 2015). Large-scale sea ice loss, a prominent indication of climate change (e.g., Notz & Marotzke, 2012), is on short time scales strongly influenced by internal variability, (see Figure 5a) with a large effect even on decadal trends (Notz, 2015; Swart et al., 2015). To predict sea ice loss on interannual to decadal time scales, internal variability as well as the forced response of sea ice to greenhouse gas increases must be understood.

To exemplify this point, we examine decadal trends in September Arctic sea-ice area (Figure 5b) as computed from the annual data (Figure 5a). The decadal trend of the ensemble mean is to a substantial degree a direct reflection of changes in the external forcing, primarily increasing greenhouse gas concentration and a few volcanic eruptions. The response of individual ensemble members, in contrast, shows a very clear impact of internal variability, which overshadows the impact of the external forcing on these short time scales. The same is true for the observational record, whose decadal trends fluctuate initially around the ensemble mean, but then deviate because internal variability causes rapid ice loss primarily in the summer of 2007 and the summer of 2012. More recently, the observed trend again recovered to the ensemble mean trend. For more information, see Notz (2017) and Olonscheck et al. (2019).
5 The Power of MPI-GE
In this section we present four novel analyses, which utilize the power of MPI-GE. The first application demonstrates the pathway dependence of future changes, a question that can only currently be answered using MPI-GE. The second analysis shows how a very large ensemble can be used to identify forced changes in the atmospheric circulation that are difficult to observe due to high internal variability. The third analysis utilizes the ensemble dimension to determine whether projected changes in AMOC variability are linear in time. Finally, because MPI-GE is currently the largest ensemble available, we demonstrate how it can be used to determine the ensemble size needed when investigating SLP trends.
5.1 The Value of Multiple Scenarios
MPI-GE is a unique large ensemble in that it can be used to compare multiple scenarios. Previously, Giorgetta et al. (2013) showed how to compare projected warming under different scenarios in MPI-ESM (low-resolution configuration), by scaling the warming (taken in comparison to the preindustrial control) by the global mean warming value. By doing this, Giorgetta et al. (2013) concluded that the warming pattern was generally consistent between the historical, RCP2.6, RCP4.5, and RCP8.5 scenarios, with the absolute magnitude of the warming dependent on the scenario. While they were able to conclude that the patterns were similar, they were unable to quantify the role of internal variability in the comparison. We extend this analysis to precipitation as well as temperature and use MPI-GE to both quantify the differences between the scenarios and compare this to the magnitude of the internal variability.
We plot the global mean precipitation change versus the global mean temperature change for the end (last 50 years) of the historical simulation and three future scenarios for different global annual mean surface temperature anomalies, compared to the change for every year of the 1% CO2 simulation (Figure 6). We find that the relationship between temperature and precipitation is not completely consistent between scenarios, indicating a pathway dependence in the precipitation response.

Additionally, we correlate the scaled ensemble mean temperature and precipitation change patterns from the last 10 years of each scenario with the temperature and precipitation change patterns from the last 10 years of each other scenario (Figure 7). There is a high correlation between the temperature patterns for all simulations, demonstrating that similar information can be found for temperature without running all scenarios, adding strength to the qualitative description by Giorgetta et al. (2013). There is lower correlation between the simulations for precipitation, showing that in this case we get different information by running different scenarios, again pointing to a pathway dependence of the precipitation response. This is likely due to the differing aerosol forcing between scenarios, as projected precipitation changes have previously been linked to aerosol forcing (e.g., Lin et al., 2016, 2018; Pendergrass et al., 2015).

To determine where the pathway dependence of precipitation is most important, we compare the scaled precipitation patterns from the last 10 years of the strong warming (RCP8.5) and weak warming (RCP2.6) scenarios (Figure 8). We find that the differences are largest in the Pacific Ocean, Eastern South America, and the North Atlantic. The standard deviation of the pattern difference (Figure 8d) is used to investigate whether the pattern differences are greater than the noise from internal variability. By subtracting the internal variability from the absolute value of the ensemble mean, we can specifically identify regions where this signal is larger than the inter-member variability (Figure 8e; red regions). We find that the western Pacific Ocean, eastern South America, and parts of the African and southern Asian land masses show differences between the scenarios that are larger than the inter-member variability, indicating that the scenario differences matter in these regions. This shows that multiple scenarios are beneficial, but emphasizes that in some regions, they are not necessary.

5.2 Identifying the Forced Response Under High Variability
Due to the high variability of the atmospheric circulation, projected changes of the tropospheric eddy-driven jets and of the stratospheric polar vortex are highly uncertain and have traditionally been made over longer periods (30- to 50-year averages; Barnes & Polvani, 2013; Manzini et al., 2014; Simpson et al., 2018). Future projections on these time scales have shown a weakening of the stratospheric vortex under anthropogenic warming (Manzini et al., 2014; Simpson et al., 2018). However, this weakening, along with a strengthening of the Brewer-Dobson circulation, may already be occurring, but may be masked by internal variability (Fu et al., 2015).
We take advantage of MPI-GE to estimate the historical and scenario evolution in time of changes in circulation indices representing key aspects of the subtropical jet, the stratospheric vortex, and the eddy-driven jet position (Figure 9). To this end, we calculate time series of yearly differences in the ensemble means of the considered circulation indices minus their ensemble means at 1850. In so doing, we can identify future circulation changes with respect to the preindustrial state. To quantify the forced response, a 3-year running mean is applied to each ensemble member (to reduce the high internal variability) before estimating the forced response. These questions are different from asking if a change has occurred during the time period for which we have reanalysis (≈1955 onward). Indeed, we cannot directly compare the modeled changes to observations because we are computing yearly changes as deviations of ensemble mean quantities from a mean reference state at each point in time (the ensemble mean at a specific year). This procedure cannot be reproduced with a single observational time series due to the high interannual variability of the circulation indices.

The changes in the subtropical jet (zonal-mean zonal wind change at 100 hPa averaged over 20–40°N) are significant at the 95% level for all scenarios and after 1925 in the historical simulation (Figure 9a), with distinct divergence of the three projection scenarios occurring by 2075. The changes in the stratospheric vortex (zonal-mean zonal wind change at 10 hPa averaged over 70–80°N) are more complex (Figures 9b and 9c). Although all scenarios indicate a weakening of the vortex, the behavior is nonlinear, as can be particularly seen in the 1% CO2 scenario, where the vortex is first decreasing and then increasing in the second half of the scenario, despite the monotonous increase in GMST/atmospheric CO2. (Figure 9c). This nonlinear behavior is described in more detail in Manzini et al. (2018). This behavior means that the mean state is very similar under both low and high radiative forcing (e.g., beginning and end of the 1% CO2 scenario). The tendency for a weaker stratospheric vortex during present day indicates that the strengthening of the Brewer-Dobson circulation could indeed be already underway, with respect to preindustrial conditions. Although within the period for which we can compare to observations (1980–2016), MPI-GE shows a trend in the forced change, this trend is weak compared to the confidence intervals, consistent with other large ensemble estimates (Seviour, 2017).
In contrast to the projected changes of the subtropical jet, the projected changes of the stratospheric vortex (Figure 9b) and the tropospheric eddy-driven jets (Figures 9d and 9e) cannot be distinguished between the three scenarios, due to high internal variability. In addition, contrasting influences in different regions lead to different shifts of the latitude of the tropospheric eddy-driven jets. In the North Atlantic, the projected equatorward shifts (Figure 9d) suggest that the stratospheric influence dominates, given that the stratospheric vortex weakens (Kidston et al., 2015). These projected changes are significant with respect to preindustrial conditions. While the changes are significant in this case, it is likely that if compared to present day, they would no longer be significant, as suggested by other large ensemble estimates (Kwon et al., 2018). In the North Pacific, the poleward shift (Figure 9e) instead indicates that the tropospheric response (Figure 9a) dominates in this region with little influence of the stratosphere. Similarly, the lack of change in the North Atlantic jet for the 1% CO2 simulation is likely due to opposing stratospheric and tropospheric changes, while the North Pacific projection for this simulation clearly branches off and evolves similar to the RCP8.5 projection scenario. MPI-GE therefore clearly illustrates that different processes dominate the forced response of these jets to climate change in different regions.
5.3 Are Changes in Variability Time Dependent?
With projections of a decreasing AMOC strength under anthropogenic warming (e.g., Collins et al., 2013), the North American and European climates are expected to significantly alter (e.g., Sutton & Hodson, 2005). While a weakening of the AMOC is associated with stronger global warming, internal variability also plays an important role in driving the climate response (Maroon et al., 2018). In a large ensemble, ensemble members with a stronger AMOC due to internal variability are likely to show increased surface warming, compared to ensemble members with a weaker AMOC (Maroon et al., 2018). By utilizing the ensemble dimension (similar to Herein et al., 2017; Maher et al., 2018) and computing the standard deviation of any given quantity across the ensemble, we can assess whether projected changes in AMOC variability are time-dependent.
MPI-GE captures the observed AMOC reasonably well, with the short observed time series largely fitting within the MPI-GE ensemble spread (Figure 10a). Similar to previous studies (e.g., Collins et al., 2013), MPI-GE projects a decrease (from 19 to 14 Sv) in the AMOC strength under strong future warming scenarios (Figure 10a). The CMIP5 models show a weakening of the AMOC in RCP8.5 and a weakening in RCP4.5 in the first half of the 21st century with a recovery thereafter (Cheng et al., 2013), similar to the forced response found in MPI-GE. The forced response of the AMOC in CESM-LE at 26.5°N has also been investigated and is shown in Figure 1 of Maroon et al. (2018). There are some differences between the two large ensembles. MPI-GE has a more realistic AMOC strength in the historical period, whereas in CESM-LE, it is somewhat too strong. The forced response in the historical period shows changes in CESM-LE that do not exist in MPI-GE, and the recovery in the AMOC in RCP4.5 only occurs in MPI-GE. However, both models show overall similar trends in the AMOC forced response, with a weakening in both RCP4.5 and RCP8.5 scenarios.

CMIP5 projections suggest that under global warming, the internal variability of the AMOC will decrease (Cheng et al., 2016). The CMIP5 projections were completed by comparing the period 2100–2300 in each future forcing scenario to the preindustrial control. Using MPI-GE, we can determine the time dependence of the projected variability change. The 1% CO2 run shows a 30% drop in the internal variability in the first 80 years of the simulation, stabilizing around 0.9 Sv. All three RCP scenarios show a drop in variability after 2000 and a stabilization after 2050, with the RCP2.6 stabilizing at the highest variability and RCP8.5 stabilizing at the lowest variability, with a similar stabilization value to the 1% CO2 run. By using MPI-GE, we can clearly demonstrate that the changes in projected AMOC variability have a high time dependence.
When considering the interplay of the forced response and variability, we can see that even though the forced response of the AMOC varies in each of the future scenarios, all three scenarios could have the same AMOC at any given year due to internal variability. When considering the 30-year running mean, all three scenarios are very similar until 2050. By 2100, the scenarios have diverged; however, there is still overlap of the 30-year means of RCP2.6 and RCP4.5, and even RCP4.5 and RCP8.5 have a few ensemble members that could have a similar 30-year mean (Figure 10a). This demonstrates how internal variability is important in determining possible observable futures in a single-forcing scenario.
5.4 Assessing the Required Ensemble Size
Deser et al. (2012) have previously used the 40-member Community Climate System Model 3 ensemble with SRES A1B forcing to ask the question of whether the forced response can be estimated with fewer than 40 members. When considering SLP trends from 2005 to 2060, they find a wide variety of SLP responses in individual realizations across the 40 members and argue that this demonstrates the need for an ensemble of size 20–30 members to accurately quantify the forced response. Given that this estimate of necessary ensemble size is close to the actual ensemble size of 40 members, a different answer may be found with a larger ensemble. We use the 100 members from MPI-GE to investigate this question and determine whether this result is robust when a larger ensemble is used. The SLP trend and variability patterns from MPI-GE (Figures 11a and 11b) over the period 2007–2099 in RCP4.5 are qualitatively similar to the results of Deser et al. (2012), indicating that both models have a similar forced response and variability of the future trend in SLP and that we can use MPI-GE to build on the results of Deser et al. (2012).

To assess the ensemble size needed to robustly isolate the pattern of the forced trend in SLP, we compare the ensemble mean trend of a subset of the ensemble to the 100-member ensemble mean trend by computing a pattern correlation. A low pattern correlation indicates that a subset does not capture the pattern of the forced trend. A large spread of trends in the subsets indicates that the ensemble mean trend in the smaller ensemble is still dominated by internal variability. We compute the pattern correlation for different ensemble sizes by randomly subsampling the 100 members. For each ensemble size, the subsampling is repeated 30 times to investigate if an ensemble of a given size can robustly isolate the spatial pattern of the forced trend (Figure 11c). The ensemble size is sufficient when the pattern correlation for all random subsets of a given ensemble size is high. For the trend in SLP, we see that the pattern correlation increases with increasing ensemble size and the spread of different samples reduces, with around 40–50 members sufficient to capture the ensemble mean pattern of the full ensemble. Similar to Deser et al. (2012), the error in the pattern is much lower above 20–30 members than below.
We then investigate the ensemble size needed to quantify the magnitude of the forced trend in SLP. To do this, we investigate two different regions (boxes in Figure 11; A: high variability and B: low variability). We estimate the “true” magnitude of the forced trend using the full ensemble of 100 members. To estimate the reliability of trend magnitudes for smaller ensemble sizes, we use resampling from the 100-member ensemble to create a set of 30 smaller ensembles for each ensemble size. We find that more ensemble members are needed in box A than in box B based on the faster reduction of the spread of subsets with increasing ensemble size (Figure 11d).
The true trend in box A is positive (0.56 hPa/93 years), while the true trend in box B is negative (−0.26 hPa/93 years). We find that with small ensembles, the trend in both boxes A and B could be estimated as either positive or negative (Figure 11d), meaning that with only a small ensemble, the sign of the forced trend could be misidentified. To correctly determine the sign of the forced trend, five members are needed in box B, and 40 members are needed in box A. To determine the actual ensemble size needed to quantify the trend, one must first decide on the size of the acceptable error. The error is estimated as the largest difference between the true trend and the subsampled trends for each ensemble size. In box B, the error is already small at 10 members, while in box A, there is still a noticeable error at 50 members. In box A, it is unclear whether the reduction of the error above 50 members is due to resampling or because the ensemble size is large enough.
As we approach the maximum ensemble size (n=100), the overlap between different random samples increases; therefore, the spread is reduced. This reduction of the spread due to resampling impedes identification of a spread reduction due to a larger ensemble size. Further research on this topic and the effects of resampling is needed to determine whether this limitation in the interpretation can be overcome and how we can use larger ensembles to quantify the error in smaller ensembles. It is clear, however, that in box A, the error is reduced by having 40–50 members compared to 20–30, demonstrating the utility of using a 100-member ensemble for this analysis.
This example demonstrates how the 100-member ensemble can be used to estimate the ensemble size needed for a specific application. While MPI-GE can be used to address this question, the answer will depend on both the question asked, how large an error is acceptable to the user, and likely the model used. When the ensemble size needed approaches the size of the ensemble itself, it can be difficult to determine whether the ensemble is large enough or the apparent error is only reduced because of resampling from a limited sample. MPI-GE is by far the largest ensemble currently available and is hence currently the best tool to investigate the ensemble size problem when ensemble sizes needed approach the actual size of smaller ensemble projects. We suggest that MPI-GE can be used to inform how to design new ensembles or the process of choosing which ones of the currently available ensembles might be suitable for a given application.
6 Summary and Conclusions
MPI-GE has been presented, and its power has been demonstrated. First, due to its large size of 100 members, events with long return periods and quantities with high internal variability can be investigated. The initialization strategy means that most quantities can be investigated from the beginning of each simulation because the distribution of internal variability is adequately sampled from the beginning of the ensemble. Second, MPI-GE is the only large ensemble currently available with three future scenarios and a 1% CO2 simulation allowing investigation into the targets set by the Paris Agreement and early twentieth century warming, something that could not be done with previous ensembles due to their later start dates.
We have demonstrated in this paper three ways to evaluate a large ensemble using observations. The first method can be used for observations with a long time series of observational coverage and uses where the observations sit within the ensemble spread as well as a rank histogram to evaluate the model. The second method is appropriate when good observational coverage is only available for a short time period (such as the satellite era). Here we demonstrate how to compare a single observational estimate with a histogram from a large ensemble. The third method provides a novel way to assess where on the globe internal variability in the model agrees well with observations by considering the agreement of the whole distribution. Additionally, this allows us to delve into specific regions and determine why there is a disagreement between the model and the observations and determine if the forced response and internal variability are realistic. We have additionally provided an example of how the model can be used to contextualize observations, by looking at the decadal variability of Arctic sea ice trends.
MPI-GE has then been used to complete four novel analyses that can only be undertaken using a large ensemble. The first addresses the question of whether there is a pathway dependence of temperature and precipitation responses under differing future scenarios, something that can only currently be addressed using the multiple future scenarios of MPI-GE. We find regional pathway dependence for precipitation, but not for temperature, that is larger than the model's internal variability. The second analysis asks whether there are forced changes in the highly variable atmospheric circulation. We demonstrate that these changes could already be occurring and that the tropospheric response dominates the Northern Pacific, whereas both the stratospheric and tropspheric changes are important in the North Atlantic. The third analysis asks whether changes in AMOC variability are time-dependent. We demonstrate that the projected decrease in AMOC variability largely occurs in the first half of the 21st century, indicating a strong time dependence. Finally, we give an example of how MPI-GE can be used to investigate the ensemble size needed for a given problem and demonstrate its utility for a problem such as forced SLP trends, where the ensemble size needed appears to be close to or larger than the ensemble size that is available for other large ensembles. For quantities that need fewer ensemble members, we recommend that multiple large ensembles should be used to make such an estimate to account for possible model dependence of forced trends.
Overall, due to its large size and multiple scenarios, MPI-GE is a powerful tool that can be used to address uncertainties both due to internal variability and the unknown future pathway. Much can be learnt from using this ensemble alone, in combination with observations and with other large ensemble projects. The data are now publicly available, and we urge potential users to access it. Future studies that combine multiple large ensembles and in particular compare the magnitude of model uncertainty to internal variability will be vital to additionally address model uncertainty and to build on the work completed with single-model ensembles.
Acknowledgments
We thank the Max Planck Society for the core funding that made this project possible. We are indebted to T. Schulthess and the Swiss National Computing Centre (CSCS) for providing the computational resources for the historical simulations and the 1% CO2 experiment. The RCP scenario simulations were performed with the facilities at the German Climate Computing Centre (DKRZ). We would like to thank Karsten Peters, Katharina Berger, Heinz-Dieter Hollweg and Fabian Wachsmann for their work in making the data publicly available. We also thank Veronika Gayler for her work with the JSBACH data and Irene Stemmler for her input on the HAMOCC data. Additionally, we thank Florian Ziemen for conducing an internal review, Helmuth Haak for his input on the difference between MPI-ESM and MPI-ESM1.1, in the ocean, and Thorsten Mauritsen for providing the equilibrium climate sensitivity. We thank Gábor Drótos and Tamás Bódai as well as the two anonymous reviewers for their comments on this manuscript. Nicola Maher was supported by the Alexander von Humboldt Foundation. Yohei Takano and Lena Boysen are supported by the European Union's Horizon 2020 research and innovation program under grant agreement 641816 (CRESCENDO). Rohit Ghosh and Elisa Manzini are partly supported by the European Union's Horizon 2020 research and innovation program under grant agreement 727852 (Blue-Action). Information on the publication of the Max Planck Institute Grand Ensemble (MPI-GE) output can be found on our website (https://www.mpimet.mpg.de/en/grand-ensemble/).