Historical Simulations With HadGEM3‐GC3.1 for CMIP6

We describe and evaluate historical simulations which use the third Hadley Centre Global Environment Model in the Global Coupled configuration 3.1 (HadGEM3‐GC3.1) and which form part of the UK's contribution to the sixth Coupled Model Intercomparison Project, CMIP6. These simulations, run at two resolutions, respond to historically evolving forcings such as greenhouse gases, aerosols, solar irradiance, volcanic aerosols, land use, and ozone concentrations. We assess the response of the simulations to these historical forcings and compare against the observational record. This includes the evolution of global mean surface temperature, ocean heat content, sea ice extent, ice sheet mass balance, permafrost extent, snow cover, North Atlantic sea surface temperature and circulation, and decadal precipitation. We find that the simulated time evolution of global mean surface temperature broadly follows the observed record but with important quantitative differences which we find are most likely attributable to strong effective radiative forcing from anthropogenic aerosols and a weak pattern of sea surface temperature response in the low to middle latitudes to volcanic eruptions. We also find evidence that anthropogenic aerosol forcings play a role in driving the Atlantic Multidecadal Variability and the Atlantic Meridional Overturning Circulation, which are key features of the North Atlantic ocean. Overall, the model historical simulations show many features in common with the observed record over the period 1850–2014 and so provide a basis for future in‐depth study of recent climate change.


Introduction
This paper describes and evaluates the historical simulations which form part of the UK's contribution to the sixth Coupled Model Intercomparison Project (CMIP6 Eyring et al., 2016), organized under the auspices of the World Climate Research Programme's Working Group on Coupled Modeling. These historical simulations, combined with the Diagnostic, Evaluation and Characterization of Klima (DECK) simulations described elsewhere in this special issue, form the UK's entry card to model participation in CMIP6.

Experiment Description
Two historical ensembles were run from 1 January 1850 to 31 December 2014, one ensemble per model resolution, GC31-LL and GC31-MM. Each ensemble consists of four members which may help mitigate the impact of model internal variability when analyzing the ensemble mean fields. The influence of resolution on the ability to simulate climate phenomena can be assessed by comparing the two ensembles.

Initialization
The historical experiments start on 1 January 1850 with initial conditions generated from the piControl experiments. The GC31-LL historical experiments used initial conditions from various points in the numerical integration of the GC31-LL piControl run, and, for the GC31-MM historical experiments, from the GC31-MM piControl run Menary et al., 2018). Ideally, the ocean initial conditions for each experiment are chosen to maximize the possibility of ensemble members evolving independently. This reduces the possibility that significant responses to historical forcings will be confounded by long-term internal variability common across ensemble members. To this end, we decided to minimize the chance of experiments starting from similar oceanic states by considering the two leading modes of multidecadal sea surface temperature (SST) variability, the Interdecadal Pacific Oscillation (IPO; Power et al., 1999;Zhang et al., 1997) and the Atlantic Multidecadal Oscillation (AMO; Kerr, 2000), which naturally arise in the piControl runs (Menary et al., 2018). Both the IPO and AMO can have impacts on climate up to the global scale. We cannot prevent chance alignment of these modes later in the simulations, but we can try to ensure that this does not occur in the early stages, since the initial state of these modes can affect the simulation for several decades. The leading Empirical Orthogonal Function of the deseasonalized, detrended, and low-pass filtered Pacific basin SSTs, representing the IPO, for each of the piControl runs is shown in Figures 1a and  1c. The leading mode of variability for the North Atlantic basin, calculated as the area-weighted mean time series of the deseasonalized, detrended, and low-pass filtered North Atlantic SSTs, representing the AMO, is shown in Figures 1b and 1d. Low-pass filtering the data prior to calculation minimizes the influence of El Niño-Southern Oscillation. The monthly evolution of the IPO and AMO indices is plotted on phase-space diagrams in Figure 2. The phase-space diagrams are used to identify initial conditions which differ in terms of ocean state, differ in trajectory (not shown) through the phase-space, and which are separated by at least 20 years. The GC31-LL historical experiments were initialized with initial conditions from 1850, 1885, 1930, and 1970 of the GC31-LL piControl experiment and for the GC31-MM historical experiment from 1850, 1885, 1930, and 1960 of the GC31-MM piControl experiment. The selection of oceanic modes for GC31-LL initial conditions includes positive and negative IPO and AMO phases. For GC31-MM the initial conditions were chosen prior to the completion of the piControl experiment, resulting in IPO and AMO phases that appear less significant when considering the entire 500-year piControl experiment.

Temperature, OHC, and Precipitation
The response of the climate system to evolving historical forcings, in particular the combination of greenhouse gases, atmospheric aerosols and volcanic eruptions, is evident in the historical observational record of global mean surface temperature ( Figure 3; Hartmann et al., 2013, Figure 2.20). To assess the ability of our historical experiments to reproduce features such as centennial temperature trends and rapid near-present-day warming, we examine the global mean surface air temperature evolution. The cooling immediately following volcanic eruptions is analyzed in section 5.  We also explore the upper OHC of the historical simulations. The heat absorbed and released by the ocean is an important factor when considering the Earth's energy budget. The relatively short ocean subsurface observational record, which covers the near-present-day warming period, provides a benchmark against which the model ocean heat uptake (OHU) can be assessed.
Furthermore, we examine decadal variability of precipitation in our experiments to understand whether notable historical wet and dry epochs can be replicated, which would then suggest a link to historically evolving forcings.

Global Mean Near-Surface Air Temperature
Global mean surface air temperatures (GMSAT) at 1.5 m from the historical simulations are compared with an observed temperature record in Figure 3. HadCRUT4 is a gridded blended land air and SST data set, covering the period 1850 to the present day (HadCRUT.4.6.0.0 Morice et al., 2012). The GC31-LL and GC31-MM data are processed following Jones et al. (2013), that is, monthly 1.5-m air temperatures are regridded onto the HadCRUT4 spatial grid, anomalies found with respect to the 1961-1990 period removing the seasonal cycle, masked by the HadCRUT4 coverage so that only grid points with observed data are retained, anomalies found again with respect to the , where 50% of data are needed to create the mean for the reference period. Global annual means are then calculated for the model and observed data and the average over the 1880-1919 period removed to find anomalies relative to that period. The uncertainties on the HadCRUT4 global annual means are given with the data set as 95% error bounds for measurement, grid box sampling, and bias uncertainties. Here they are shown relative to the  period. As the model data are sampled the same as the observations, coverage uncertainty is not included in the figure.
It is common for studies to attempt to remove model drift from data in plots of transient changes in model diagnostics, with different approaches having pros and cons (Sen Gupta et al., 2013). Many studies do not try to remove the drift, as what may be assumed to be model drift could be dominated by internal model variability on long times (Jones et al., 2013). Here an estimate of the models' surface air temperature drifts is subtracted from the global mean historical temperatures. The piControl simulations (Menary et al., 2018) have long-term trends of 0.026 K/100 years (GC31-LL piControl) and 0.110 K/100 years (GC31-MM piControl), with full spatial coverage. To account for the changing spatial coverage of HadCRUT4, the following is done. The spatial trend through the first 500 years of the piControl experiments (Menary et al., 2018) is deduced. The trend over the 1850-2014 period is reconstructed and then processed as for the historical simulated data above. The global mean is removed from the historical temperatures, thus detrending with an estimate of the models' drifts.
The GC31-LL historic simulations warm 0.79 ± 0.32 K for the 2000-2014 period relative to 1880-1919. GC31-MM warm 0.64±0.07 K over the same period. This compares with the observed warming of 0.83±0.05 K. Uncertainties are ±2 standard deviations.
While the overall trend over the historic period is similar between the observations and the models, both models show less warming than observed over the first half of the twentieth century and somewhat more rapid warming than observed over the last 20 years of the period. One has to be careful in how to interpret differences between the simulations and observed temperature record, especially when not examining a sample of models (Jones et al., 2013). The choice of reference period can make a substantial difference in when differences between models and observations are apparent (Hawkins & Sutton, 2009).
Superimposed onto the longer-term trends in GMSAT are multiyear temperature dips associated with the injection of volcanic aerosols into the atmosphere during major tropical volcanic eruptions, in particular 1883 (Krakatau), 1902(Santa Maria), 1963(Mount Agung), 1982, and 1991 (Mount Pinatubo). Analysis of the impact of these eruptions is discussed further in section 5.

Global Mean Land Near-Surface Air Temperature
The global land surface air temperature (GMSAT-Land) evolution of the two historical experiments is shown in Figure 4. The processing is the same as that for GMSAT in Figure 3, with an additional step of applying a land mask to the data, before the global mean is calculated. The land mask is constructed from GC3-LL's land area fraction, regridded onto the 5 • × 5 • grid and set to missing data value where land area is less than 25%. The land air temperature data set, CRUTEM4, is used (CRUTEM.4.6.0.0 Jones et al., 2012), and the models are masked by its spatial coverage. For consistency, CRUTEM4 is also masked by the same land mask used by the models. As for HadCRUT4, the uncertainties on the CRUTEM4 global annual means are given with the data set as 95% error bounds for measurement, grid box sampling, and bias uncertainties.
The GMSAT-Land increase, between the 1880-1919 and 2000-2014 periods, is smaller for the models than observed. GC31-LL warms by 1.03 ± 0.48 K and GC31-MM by 0.85 ± 0.20 K, compared to the observed warming of 1.27 ± 0.22 K. Both models, again, show a stronger cooling over the 1950-1970 period, followed by a stronger warming than observed. The models also show a strong cooling between 1850 and 1879. A comparison of the GC31-LL cooling trend over the period 1850-1879 with the equivalent 30-year segments from the piControl (30 years following the date of each initial condition), using a difference of means Student's t test, indicates that the cooling is marginally significant (p = 0.04). CRUTEM4 has limited coverage before 1900, being concentrated over North America and Europe; hence, it is probable that this is a representation of regional changes rather than of a global signal. A larger ensemble size would be necessary to confirm that this is more than an artifact of small ensemble size and poor observational coverage in the early part of the instrumental record.

Global Ocean Heat Uptake
About 93% of the current Earth's energy imbalance enters the world ocean as an increase in ocean heat content (OHC). This makes OHC one of the most relevant metrics of climate change, arguably more relevant than surface temperature records (Von Schuckmann et al., 2016).
The fifth Coupled Model Intercomparison Project (CMIP5) General Circulation Models displayed a wide range of OHC change in the upper 700 m over the late twentieth century, ranging from 30 to 280 ZJ for the period 1971-2005(Flato et al., 2013.17), with the observational data sets indicating between 100 and 120 ZJ, where 1 ZJ is 10 21 Joules. Differences across the General Circulation Models in the representation  d) in the 0-to 700-m layer (a,c) and the 0-to 2,000-m layer (b, d). Ensemble means in bold color (blue for GC31-LL, red for GC31-MM); individual ensemble members in thin colored lines; various observational products in thick gray and black lines. All time series normalized to the year 1971. In the 0-to 2,000-m panels the NCEI data are 5-year running means. and parameterization of OHU processes explain this wide range. We have used various observational data sets of OHC to evaluate the two historical experiments against. These are the following: EN4.2.1, an updated version of Good et al. (2013); NCEI, an updated version of Levitus et al. (2012); ISH, an updated version of (Ishii et al., 2017); CHG, Cheng et al. (2017); and DOM+LEV, an updated combination of Domingues et al. (2008) and Levitus et al. (2012). We consider time series of global and basin-wise OHC in the upper 700 m and upper 2,000 m which are readily available for the observational data sets.
We discuss the OHC drift in the piControl simulations (Menary et al., 2018) first. In the global average, both GC31-MM and GC31-LL are warming at all levels in the piControl: 0-700 m, 0-2,000 m, and the full-depth global ocean volume (not shown). For the full-depth global ocean, the warming rates in the piControl simulations appear fairly linear, amounting to about 200 ZJ/100 years in GC31-LL and 400 ZJ/100 years in GC31-MM. Hence, we decided to use linear detrending of the historical simulations. For each simulation and layer depth individually we estimated the linear trend from the matching part of the piControl and subtracted it.
For the global OHU in the layer 0-700 m and a reference year of 1971, both sets of the GC31 simulations do not warm in the 1970s and 1980s, as opposed to the observational data sets that show a warming of about 50 ZJ during the 1970s (Figures 5a and 5c). After 1993, the simulations warm at a faster rate than the observations suggest. For the overall time period from 1971 to 2014, both GC31-LL and GC31-MM simulate an OHC change very close to observations, especially given the wide spread among the CMIP5 models (Flato et al., 2013, Figure 9.17). The GC31-LL ensemble mean warms by about 25 ZJ more than the GC31-MM ensemble mean in this time period, which is consistent with a warmer global mean near-surface temperature and a higher near-surface air temperature warming rate for GC31-LL compared to GC31-MM ( Figure 3).  d;red) in the 0-to 700-m layer (a, c) and the 0-to 2,000-m layer (b, d). The narrow gray bars indicate ±1 standard deviation of the ensemble. Orange bars: observational estimate of OHU from the NCEI data set.
In the layer 0-2,000 m (Figures 5b and 5d), the GC31-MM ensemble mean shows zero OHC change between 1971 and 1993, and the GC31-LL ensemble mean even shows a slight heat loss, while the observations suggest a continuous warming of about 100 ZJ in this time. This is in line with the higher warming rate simulated in GC31-MM near-surface air temperature in this period. From 1993 to 2014, the simulated warming rates are very close to the observed ones. Since too much heat is taken up in the 0 to 700 m layer in this period, it follows that the warming rate below 700 m must be smaller than observed.
For GC31-LL, we note that the four-member ensemble members are split into two pairs with distinctly different warming rates, especially in the 0 to 700 m layer ( Figure 5a). Andrews et al. (2019) noted the accompanying, distinctly different development of Southern Ocean sea ice cover in these two simulation pairs (their Figure 8). We believe that this is related to a temporary increase (blip) in the Southern Ocean deep water formation rate, revealed by an associated positive blip in the strength of the ACC, compare with Figure  12 in Menary et al. (2018), and followed by a more stably stratified Southern Ocean. Two of the GC31-LL simulations branch off the piControl before this blip and two after. The simulation pair with the smaller warming rate in Figure 5a is the one branching off after the blip, indicating a more stable stratification and hence less OHU in the Southern Ocean. We also find that the pair with the smaller warming rate in the late twentieth century has a smaller cooling rate in the years 1850 to about 1950, indicating again a more stable ocean stratification and hence less heat loss.

Regional Ocean Heat Uptake
The globally integrated trends in OHC discussed above mask distinctly different trends in the individual basins. To analyze these regional trends, we have separated the three major basins (Atlantic, Pacific, and Indian) into a northern and southern parts, cutting at the equator and disregarding marginal seas (e.g., Hudson Bay and the Mediterranean). Figure 6 shows the total change in OHC from 1971 to 2014 in these six individual basins.
In the 0 to 700 m layer (Figures 6a and 6c) the simulated global OHU is rather close to the observational estimate, with GC31-LL taking up about 15% more heat than GC31-MM and having a larger spread as discussed above. In both models there is a tendency to underestimate the OHU in the South Indian Ocean and to overestimate it in the South Pacific. The larger OHU in GC31-LL happens mostly in the North and South Atlantic basins. This might well correlate with the stronger Atlantic Meridional Overturning Circulation (AMOC) in GC31-LL in the late 20th and early 21st centuries ( Figure 17; see also discussion in section 4).
By contrast, both models show almost exactly the same OHU in the 0 to 2,000 m layer globally (Figures 6b  and 6d), with the simulated values being much smaller (25%) than in the observations. In both models, the underestimation of the OHU is largest in the South Indian, North Pacific, and North Indian oceans, while in the two Atlantic basins the underestimation is small or even insignificant.

Decadal Precipitation
Variability in precipitation is one of the most important ways in which climate can impact human societies, to the extent that dry epochs in semiarid regions can even lead to life there becoming unviable. Assessment of the extent to which the historical model ensembles reproduce these variations, however, is limited by the availability of observational data. Before the era of satellite observations (prior to the late 1970s), there were essentially no marine precipitation estimates, so comparisons are limited to land-based data derived from rain gauges. In addition, the availability of even these data becomes confined to fewer regions further back in time (Becker et al., 2013). As such, it is not possible to make comparisons with the model simulations over their full period from 1850. Here we present an analysis for the period from 1910 to 2010. The focus is on decadal anomalies as these highlight eras of persistent wetness or drought which have more chance of being a response to the imposed external forcings than year-to-year precipitation variations, which are likely to have a substantial component of internal variability.
Observed decadal mean precipitation anomalies, derived from the Global Precipitation Climatology Centre (GPCC) monthly data set (Schneider et al., 2011), are shown alongside the GC31-LL and GC31-MM ensemble means for equivalent decades in Figure 7. It is noticeable that observed regional precipitation variability is generally larger and more spatially heterogeneous than in the ensemble means of the historical simulations. The overall level of correspondence between the model means and observations is generally low, although the correspondence between the simulations at the two model resolutions is much greater. This suggests that the two model resolutions respond to a common influence, namely, the external forcings. In contrast, the dissimilarity of the observed decadal means suggests that decadal rainfall anomalies are mainly a result of internal climate variability (Scaife et al., 2009). This is consistent with previous work, which has suggested that decadal or multidecadal modes of variability are important in generating decadal regional precipitation variability (e.g., Folland et al., 1986;Power et al., 1999). Despite the low level of agreement between the models and observations, there are some similarities; the drying of the African Sahel region seen from the 1970s is present to some extent. This suggests at least a contribution from forcings, as in CMIP5 historical model simulations (Biasutti, 2013) and as proposed in response to anthropogenic aerosols (Booth et al., 2012). In the simulations presented here, this feature appears to be part of a wider drying of many parts of the tropics and subtropics between the 1960s and 1990s. To examine regional decadal rainfall in more detail, we show area-averaged decadal precipitation for largeand regional-scale areas in Figure 8. For the globe as a whole, the Northern Hemisphere (NH), the Tropics, North America, Africa, and Asia in the models (at both resolutions), there is a tendency for the first half of the twentieth century to be wetter than the 10-decade average, with a drier-than-average second half of the century. At the end of the period there is a recovery to near average or wetter conditions. This typically does not correspond well with the temporal changes in many of these regions, which tend to show wetter conditions following a dry early twentieth century. Exceptions include the Tropics and Africa, although the onset of drying occurs earlier in the models. For South America, observations show periods of wetter and drier conditions, but the models show a long-term drying trend, albeit with a high ensemble spread, such that precipitation anomalies in most decades fall within the model range. For Europe, there is very little decadal variability in the model, inconsistent with observations that show climate tending wetter since the 1940s.
At the smaller scale, we examine regions that have a tendency for drought. As already mentioned, Sahel rainfall is suppressed in the models in the 1970s (and less clearly in the 1980s), although to a much lesser degree than in observations. The simulations also show the previous two decades to be dry, however, in contrast to wetter-than-average conditions in observations. A similar relationship is found in West Africa. There is a simulated increase in precipitation in the 2000s in East Africa, whereas the observations show continued drying. In the Indian monsoon region, there is also a positive trend in rainfall simulated in recent decades that is not seen in observations. In central China, there is some indication of the observed drying from the 1940s to the 1960s in the models, but the recovery in the 1980s and 1990s is not reproduced. There is no indication of the South East Australian drought of the early 2000s in the model simulations, although observations are not outside the spread of the ensemble for at least one model resolution. The famous Dust Bowl drought of the 1930s on the North American Great Plains is also inconsistent with the forced responses of the models, as is the moist epoch of recent decades. Rainfall in the Brazilian Nordeste region shows large multidecadal variations that have been linked to the Atlantic Multidecadal Variability (AMV; Knight et al., 2006). Although the simulations capture part of the observed AMV (see section 4), in the model this does not appear to act as a significant control on rainfall in these regions, as is implied by observations. Decadal rainfall over the Amazon does not appear to reflect the occurrence of unprecedented drought in specific years in the 2000s, but the simulations do contain the forced decline in rainfall that is seen for the South American continent as a whole.

Sea Ice, Ice Sheets, Permafrost, and Snow Cover
Aspects of the cryosphere are sensitive to changes in global and local temperature variations on a range of timescales. For example, over recent years there is evidence of a reduction in Arctic sea ice extent, a decline in the mass of the Greenland ice sheet, a reduction in snow cover, and a decline in permafrost extent. This has a potential impact on the Earth's climate system by influencing the surface energy budget, sea levels, and atmospheric and oceanic circulations. See Vaughan et al. (2013) for a comprehensive assessment of trends relating to the cryosphere and the impacts on the climate system.
Here we consider a number of components of the cryosphere in the historical simulations, which include sea ice extent and volume, Antarctic and Greenland ice sheet surface mass balances, permafrost, and snow cover extent. We investigate whether the simulated response to climate variability is similar to observations.

Near-Present-Day Sea Ice, 1990-2009
Figures 9 and 10 compare 1990-2009 average summer and winter sea ice concentration and extent, in both hemispheres, for GC31-LL and GC31-MM, respectively with the HadISST.2.2 data set of Titchner and Rayner (2014). The modeled and observed sea ice extent-the total area of ocean with at least 15% ice cover-are included as annotation. . Areas with sea ice less than 15% are masked out. Also shown (orange lines) are corresponding 15% sea ice concentration contours from the HadISST.2.2.0.0 data set (Titchner & Rayner, 2014). Inlaid boxes show the observed and modeled sea ice extent (units: 10 6 km 2 ).
In the Arctic there is generally good agreement between the modeled (shading) and the observed (orange lines) location of the ice edge as depicted by the 15% concentration contour. In the summer the Arctic sea ice is in very good agreement with HadISST.2.2 within the Arctic Ocean basin, except for in the eastern Kara Sea where the sea ice cover in both resolution models is too low. The winter ice extent in the GC31-LL configuration is also in good agreement with HadISST.2.2, but there are a few differences in the GC31-MM configuration, which has too little winter ice in the Sea of Okhotsk and the Labrador Sea and a larger overestimation of ice north of Iceland. The GC31-MM model has a more realistic path for the North Atlantic current (Menary et al., 2018) and the deep convection occurs principally in the Labrador Sea, while in GC31-LL the deep convection occurs south of Svalbard . Consequently, the GC31-MM winter sea ice is less extensive in the Labrador Sea and more extensive in the Greenland Sea.
In the Antarctic the GC31-MM model sea ice is under extensive, which is caused by the Southern Ocean warm bias being much higher in the ORCA025 ocean used in the GC31-MM configuration (Hyder et al., 2018). The GC31-LL model has a weaker Southern Ocean warm bias leading to more ice and hence better agreement with HadISST.2.2.
Although much more realistic than for the higher-resolution GC31-MM configuration, Southern Hemisphere sea ice in the GC31-LL configuration is still a little less extensive than observed. In summer ice is low in the Amundsen Sea, the north and eastern Weddell Sea, and most of east Antarctica; in winter ice cover is low in the eastern Ross Sea and east of the Weddell Sea sector south of Africa-the regions of strongest Southern Ocean warm bias .  Figure 9 but for GC31-MM. Note that the observed extent values are different than those reported in Figure 9 because they are calculated independently for each resolution using the relevant model grid. Figure 11 shows the evolution of summer and winter sea ice extent in both hemispheres. Total basin-wide NH sea ice extent ( Figure 11a) from GC31-LL agrees well with HadISST.2.2 in both March and September. Winter NH extent in GC31-MM is lower than for GC31-LL and is low compared with HadISST.2.2. This is consistent with Figures 9 and 10, which show that GC31-MM has too little winter ice in the Sea of Okhotsk and the Labrador Sea. This offset is apparent throughout the whole 1850-2014 historical period, in line with the offset in the piControls. However, GC31-MM March sea ice is close to the NSIDC sea ice index (Fetterer et al., 2017) and so these are within observational uncertainty. The September ice extent is very similar for both model resolutions. NH sea ice extent shows some evidence of multidecadal response to the forcing throughout the historical period but is mostly within 1 standard deviation of the 500-year mean of the piControl, until around 1995 when the ice starts to decline.

Evolution of Sea Ice Extent and Volume, 1850-2014
Trend lines and gradients are also included on Figure 11. The 1985-2014 trends in GC31-LL are always higher than in GC31-MM. The summer decline is much larger than the winter decline in both models, consistent with observations. However, the difference between winter and summer trends is not as high as observed. GC31-MM agrees well with observations in the winter while GC31-LL is too large; meanwhile in summer GC31-LL agrees well with observations while GC31-MM is too low.
The Southern Hemisphere sea ice extent (Figure 11b) is roughly constant, within 1 standard deviation of the piControl, throughout most of the historical period until around 1985 when the winter ice starts to decline. This is contrary to the observational data sets, which show a small increase in Antarctic extent for this period. The extent increase associated with the mean observed trends over the 1985-2014 period is generally smaller March (lower). In all panels individual members are plotted with thin colored lines and ensemble means are the thick colored lines. Colored shading is added between the ensemble minimum and maximum. The long-term mean from the corresponding piControl is plotted as the colored dotted line; ±1 standard deviation is indicated by the colored chain lines with gray shading in between. Linear regressions for the ensemble mean 30-year period 1985-2014 are included; trend slopes with 95% confidence intervals are displayed to the right of each panel. (unit: 10 6 km 2 ). than 1 standard deviation of the piControl simulations, except in winter (September) when the HadISST.2.2 increase is above 1 standard deviation of the GC31-LL piControl. However, the extent increase associated with the 95% confidence interval of the observed trends is quite large and spans the piControl standard deviation.
Arctic sea ice volume is an important climate indicator because it can be considered a consequence of the integrated surface energy balance of the Arctic Ocean. As sea ice volume depends on both ice thickness and areal extent, it is therefore more directly tied to climate forcing than the areal extent alone. Figure 12 shows annual mean NH sea ice volume for both model configurations compared with the PIOMAS reanalysis of Zhang and Rothrock (2003). NH sea ice volume increases through the historical period, which is a consequence of the aerosol-induced cooling of the planet common to both GC31 configurations Mulcahy et al., 2018). With extent remaining fairly level, this means that thickness is increasing until around 1985 when it starts to decline rapidly. Over the 30-year period from 1985-2014, the volume trend is larger than for the PIOMAS reanalysis. There is much larger internal variability in sea ice volume than extent, attributable to the fact that Arctic winter ice extent is constrained by the coastline and SSTs.
Year-to-year volume is well correlated with hemispheric and global atmospheric temperature (not shown).

Antarctic and Greenland Ice Sheets
Ice sheets gain mass through snowfall and deposition and lose mass through sublimation, surface melt, and subsequent meltwater runoff. This component is known as the surface mass balance (SMB) and is positive when the ice sheet is gaining mass (Lenaerts et al., 2019). The Greenland and Antarctic ice sheets in  Figure 11 but for Northern Hemisphere annual mean sea ice volume compared with the PIOMAS reanalysis (Zhang & Rothrock, 2003). (unit: 10 3 km 3 ).
HadGEM3-GC3.1 are static, that is, the model does not include ice sheet dynamics. The surface snow is depicted by a multilayer-layer scheme with prognostic grain size and associated albedo (Walters et al., 2019). The snow is initialized to several hundred meters depth to permit continued melt in all future scenarios. Lacking an underlying ice component, if the snow were to be depleted, then no further runoff from melt would be generated. The ice sheet loss, mechanisms of ice shelf basal melt, and iceberg calving are depicted as a freshwater flux at the ice shelf front and a source of Lagrangian icebergs at glacier and ice shelf fronts. The spatial distribution of ice shelf basal melt and iceberg calving is based on climatological observations. The overall fluxes are then weighted to balance the SMB of the piControl. The weighted fluxes are held constant throughout the historical and future scenarios as there is no model mechanism to predict changes. Through this mechanism, the global ocean maintains mass and salinity throughout the control. If the ice sheet SMB changes in historical and future scenarios, then the ocean mass and salinity respond accordingly. In Antarctica there is little runoff because the environment is extremely cold (Lenaerts et al., 2012), and as a consequence the SMB is positive. An estimation of the Antarctica SMB, through regional models forced by ERA-Interim, is approximately 2,200 Gt/year (Agosta et al., 2019). It is anticipated that increasing regional temperatures will increase snowfall and the SMB will consequently increase. In Greenland, runoff is an important process of mass removal (Ettema et al., 2009). Between 1961and 1990, the SMB is calculated to be approximately 450 Gt/year (Ettema et al., 2009). Rising temperatures are expected to increase melting faster than increasing snowfall, and the Greenland SMB is expected to decline, likely becoming negative this century.
The SMB calculated from the historical simulations is shown in Figure 13 for both the Greenland and Antarctic ice sheets and for both model resolutions. Variations in the SMB are indicators of global temperature change which influences both snowfall and melt. In observations, Greenland shows a sharp decline in SMB from an average of 422 ± 10 Gt/year prior to 1990 and an average of 286 ± 20 Gt/year from 2000-2018, indicating increased melting (Mouginot et al., 2019). Over the same historical periods, GC31-MM has 431 and 302 Gt/year (2000Gt/year ( -2014 and GC31-LL has 307 and 191 Gt/year, respectively. The Antarctic SMB remains constant through the historical period with observations indicating a value of 2,200 ± 111 Gt/year (Agosta et al., 2019). Over this period, the SMB of GC31-LL is 1,842 Gt/year and GC31-MM is 2,049 Gt/year. Thus, GC31-MM agrees well for both ice sheets, within the observational uncertainty, but GC31-LL is producing a rather low SMB. This pattern is consistent with increased orographic snowfall over the steeper coastal slopes at higher resolution (Ettema et al., 2009).

Permafrost and Snow Cover
The presence or absence of permafrost is, to first order, determined by the annual mean air temperature. Chadburn et al. (2017) constructed a broad relationship between the probability of permafrost occurrence and the annual mean air temperature. The relationship showed that, on average, for any grid cell with a mean annual air temperature less than −4.3 • C there is greater than 50% probability of finding permafrost in an equilibrium climate. The WFDEI 2-m air temperature data set (Weedon et al., 2014) has an area of around 15.2 × 10 6 km 2 (1995-2014) which tallies fairly well with the observed extent of permafrost (Chadburn et al., 2017). GC31-LL has an area of 16.6-17.9 × 10 6 km 2 and GC31-MM generally has a slightly smaller area of 16.4-16.7 × 10 6 km 2 across the ensemble members over the same period. This suggests that the GC31-MM historical experiment is slightly warmer in the Arctic than the GC31-LL historical experiment. However, both experiments have a colder Arctic than the observations. Figure 14 shows the change in area of the land surface where the mean annual air temperature is less than −4.3 • C. In the regions where mean annual air temperature is less than −4.3 • C, it is more than 50% probable that permafrost will exist (Chadburn et al., 2017). The change is not significant in the control runs but is significant in the transient runs and in the observations. The observations show a decrease in area of 0.55-0.58 × 10 6 km 2 per decade. In comparison, the decrease in the model is approximately double, with a loss of 0.9-1.7 × 10 6 km 2 per decade. The GC31-MM experiment warms slower in the Arctic than the GC31-LL experiment, with a slightly smaller reduction in land area with a temperature less than −4.3 • C. These results are derived using air temperature alone, but the findings of Chadburn et al. (2017) suggest that they are a suitable proxy for the presence of large-scale permafrost. Therefore, on the basis of air temperature alone, the historical experiments may well predict a greater loss of permafrost than might be expected over the twentieth century. It should be noted that variable land surface characteristics will moderate the deep soil temperatures and hence the local presence or absence of permafrost. This is particularly relevant in the regions with discontinuous or sporadic permafrost. For example, deeper snow will insulate the soil from the cold air in the winter and reduce the probability of permafrost occurrence. Other influencing factors include topography, hydrology, soil properties, and vegetation. Many of these factors are included in the land surface component.  There has been an observed significant reduction in northern high latitudes snow cover extent of 0.7 × 10 6 km 2 per decade in March-April-May (Estilow et al., 2015) during the period between 1972 and 2012. The modeled loss of snow cover extent ( Figure 15) is slightly lower than the observed loss (Robinson & Mote, 2014) with reductions ranging from 0.28 to 0.71 × 10 6 km 2 per decade.

AMV
We now turn our attention to the North Atlantic in order to consider the representation of the observed AMV. The observed North Atlantic SST has generally evolved differently to the rest of the globe (Sutton et al., 2018). Specifically, it has gone through periods where it warmed faster than the global mean SST (e.g., the 1920-1950s and 1990s-2010s) and periods where it cooled (e.g., the 1960s-1980s). Furthermore, AMV has been linked to many important climate impacts, including rainfall anomalies over Northern Europe and the United States, the strength of monsoons worldwide, and changes in the number of hurricanes (Monerie et al., 2019;Sutton & Dong, 2012;Sutton & Hodson, 2005;Zhang & Delworth, 2006). Most models, including HadGEM3-GC3.1, simulate AMV-like modes of variability spontaneously in their control simulations (Menary et al., 2018;Wills et al., 2019). However, there is evidence that radiative forcing changes have played an important role in shaping the AMV over the twentieth century (Booth et al., 2012;Chang et al., 2010;Watanabe & Tatebe, 2019). Therefore, here we explore evidence for a forced response of AMV in the GC31-LL and GC31-MM historical simulations.
A simple measure of AMV is the basin mean SST anomalies, which we compute over the domain 7.5:75 • W and 0:60 • N. As simple North Atlantic basin mean AMV indices do not discriminate between local and globally forced signals (Qasmi et al., 2017;Trenberth & Shea, 2006), we also compute an AMV index after removing a global mean signal (hereafter called AMV-GMT). We compute a near-global mean ensemble mean SST index (GMT) by taking the area-weighted mean of all grid points outside the AMV box (7.5:75 • W, 0:60 • N) and between 60 • S and 60 • N (hence excluding the poles to permit a more robust comparison with observations). We then remove this ensemble mean GMT index from the ensemble mean AMV index, by linear regression, to produce an ensemble mean AMV-GMT index. We compute the same indices in the historical simulations and in observations (HadISST.1.1 Rayner et al., 2003). The comparison of both AMV indices allows us, therefore, to quantify how much of the variability in the North Atlantic is related to global or local changes.
Using simple basin mean AMV indices, we find that the GC31-LL and GC31-MM historical simulations appear to simulate much of the observed AMV index. Figures 16a and 16b show that both resolutions simulate pronounced multidecadal SST variability within the North Atlantic. The forced basin mean AMV signal, represented by the ensemble mean, is large compared to the ensemble spread. Furthermore, the evolution and timing of the simulated basin mean AMV index is also similar to HadISST, with the correlation of model and observed indices of r = 0.7 for both resolutions. However, the basin mean AMV index is not reproduced perfectly. For example, the magnitude of the multidecadal variability is smaller than in HadISST, particularly during the 1940-1960s, and the 1930s warming also appears to occur earlier (by approximately a decade) when compared to observations. Furthermore, the observations lay outside the ensemble spread during the cooling in the 1920s, and the warming between 1930 and 1950. These discrepancies could show model deficiencies in simulating AMV (i.e., an underestimation of the forced response or internal variability or both). But we cannot rule out that these decades represent an exceptional phase of variability not captured by the small ensemble.
Once the globally forced signal is removed, however, the agreement between simulated and observed AMV (i.e., the AMV-GMT) is much weaker (see Figure 16c). Removing the globally forced signal from the models reduces the magnitude of the forced signal of AMV. In other words, a large component of the total forced AMV signal in the simulations is not unique to the Atlantic. However, this is clearly not the case in observations, where the timing of phases of AMV (i.e., the warm period between 1940 and 1960s and the cool period from 1970s-1990s) is largely unchanged in the AMV-GMT. As a result, correlations between the model and observed AMV-GMT indices drop to r = 0.2.
When we consider the spatial fingerprint of the Atlantic local forced response of AMV (i.e., after the globally forced signal is removed), there is evidence for a forced response in the tropical North Atlantic, consistent with previous studies (i.e., Booth et al., 2012;Chang et al., 2010). Figures 16g and 16h show the grid point linear regression of ensemble mean SST-GMT (SST with the global mean signal (GMT) removed at each gridpoint via linear regression) on the ensemble mean AMV-GMT (Figure 16c) in both models. Both model resolutions show a clear forced signal in the tropical North Atlantic (i.e., with large regression coefficients), and the ensemble mean AMV-GMT index explains more than 40% of the local variance in ensemble mean SST-GMT. In terms of spatial pattern, the tropical fingerprint compares well with that observed (see Figure  16i-here the observed AMV-GMT (Figure 16c) is removed from observed SST via linear regression). However, in the extratropical region, the spatial pattern does not compare well with observations. In the models, the main anomalies are focussed in the Gulf Stream extension region, with little or no signal in the subpolar North Atlantic (between 50 • N and 65 • N). However, the subpolar North Atlantic is the dominant region of the observed AMV. The extratropics also explains a much smaller fraction of the ensemble mean variance (i.e., < 10%). Given that the internal AMV fingerprint in the piControl simulations of HadGEM3-GC3.1 is extratropical (i.e., Menary et al., 2018) in both resolutions, these results support the view that tropical AMV has a locally forced component but that internal variability is more important for the extratropical AMV (Watanabe & Tatebe, 2019).
There is also evidence for a forced response in another important metric of the North Atlantic, the AMOC. Figures 17a and 17d show that the AMOC at 40 • N increases from 1850-1990 in both resolutions (GC31-LL: 0.23 Sv per decade, GC31-MM: 0.16 Sv per decade). The simulations also suggest that there is a multidecadal forced AMOC signal that goes beyond the longer timescale forced AMOC trends found in earlier studies (Delworth & Dixon, 2006;Menary et al., 2013). This increase is large compared to the internal variability and moves the AMOC at 40 • N outside the range of variability seen in the piControl. Following 1990, both model resolutions show a decline in the AMOC. The decline in the AMOC is larger in GC31-MM than in GC31-LL (−1.8 and −0.85 Sv per decade, respectively). Nevertheless, the evolution of the ensemble mean AMOC on decadal timescales is similar between resolutions (correlation r = 0.7 after linear detrending and smoothing with a 5-year running mean). AMOC changes are difficult to verify due to a lack of observations, but  (2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012). This is consistent with recent studies (Roberts et al., 2014;Yan et al., 2018) that suggest that the AMOC variability in models may be smaller than is observed. However, the overlapping confidence intervals of the models and observations present some uncertainty in this conclusion.
AMOC changes are usually associated with changes in upper ocean temperatures, and Figures 17b and 17e also show that there is a forced response in the subpolar gyre (SPG) in the North Atlantic (40-60 • N) 0 to 700 m average temperatures. At low resolution (GC31-LL, Figure 17b), the upper ocean temperature follows the pattern of cooling and warming seen in the wider AMV-GMT (Fig 16b)-a cooling until the 1970s of about 0.4 K, followed by a rapid warming of about 0.5 K. Interestingly, this evolution compares relatively well with the changes in the observations post-1950, but the ensemble mean does not capture the recent cooling since 2005, which has been in part attributed to a weakening AMOC (Robson et al., 2016). The high-resolution experiment (GC31-MM, Figure 17e) appears to undergo a long-term warming trend modulated by periods of cooling, including after 2000. GC31 does not capture the observed decadal salinity changes in the SPG (not shown); both resolutions show clear long-term trends not seen in observations and do not capture the post-1970s Great Salinity Anomalies. The relationship between upper ocean SPG temperature and salinity will be part of an extended analysis of the SPG for a future study. It is important to note that the upper ocean temperature changes are not only reflecting changes in AMOC (compare Figures 17a  and 17b, for example); changes in surface fluxes may also be important. The apparent lag between AMOC changes and changes in SPG mean temperature suggests that the SPG temperatures are driven by a rapid response to surface shortwave fluxes, combined with a lagged response to ocean heat transport.
The forced response in the North Atlantic in the historical experiments (at both resolutions) is largely consistent with previous studies examining the competition between anthropogenic aerosols and the long-term greenhouse gas forcing. For example, Figures 17c and 17f show that the net shortwave absorbed over the subpolar North Atlantic decreased substantially between the 1850s and 1980s, before subsequently increasing. These changes in net shortwave are consistent with sulfate aerosol precursor emissions, which increased over North American and Europe up until the 1980s, before a large subsequent reduction (Booth et al., 2012). Therefore, as in previous studies, we find evidence that anthropogenic aerosol forcings have played a role in driving AMV (Booth et al., 2012;Watanabe & Tatebe, 2019) and AMOC (Delworth & Dixon, 2006;Menary et al., 2013;Undorf et al., 2018) in historical simulations. However, based on our analysis alone, we cannot rule out a role for natural forcing factors (Ottera et al., 2010). We note that, as with previous models (Zhang et al., 2013), HadGEM3-GC3.1 does not explain all the observed changes in AMV. However, we cannot rule out that these results are affected by the strong aerosol forcing in HadGEM3-GC3.1 Mulcahy et al., 2018), or a strong response to aerosol forcing, which may explain the strong correlation between Atlantic and Global forced responses in Figure 16. If the real world had a smaller role for aerosol forcing than that simulated in this model, then we would expect any forced response to have explained a smaller fraction of real-world variability. However, we also note that the comparisons with observations are subject to various uncertainties, including uncertainty in estimating the forced response due to the presence of internal variability in both models and observations and due to the uncertainty in producing the observations themselves.

Influence of Volcanic Eruptions on Surface Temperatures
Major tropical volcanic eruptions have a large impact on global climate for several years and should be included in historical experiments in order to capture an important component of subdecadal timescale climate variability (Smith et al., 2016). In this section, we attempt to quantify the impact of volcanic eruptions on global surface temperature. We use the approach of Driscoll et al. (2012) who looked at the impacts of volcanoes in CMIP5 models and include the five volcanoes that they used: Krakatau (erupted August 1883), Santa Maria (October 1902), Mount Agung (March 1963), El Chichón (April 1982, and Mount Pinatubo (June 1991). These are all large tropical volcanic eruptions. We avoid extratropical volcanoes as they appear to have a different impact on the circulation in the NH winter (Stenchikov et al., 2006). For each volcano, we calculate a climatology from the years preceding the volcanic eruption, the same approach as used in Driscoll et al. (2012), and subtract the climatology from the years following the volcanic eruption to obtain anomalies. These anomalies are a mixture of the impacts of the volcano and internal variability in the climate system, so we average over all volcanoes to reduce the influence of the latter. Figure 18 shows how the global mean surface air temperature anomalies develop after a large volcanic eruption. In both GC31-LL and GC31-MM, the experiments cool, on average, approximately 0.2 Kafter 12 months. Subsequently, the temperature returns to close to normal about 4 years after the eruptions. There is a lot of variability across the range of ensemble members and volcanic eruptions, with some events not appearing to cool. The observed temperature anomalies from HadCRUT4 follow a very similar evolution to the models. The biggest difference with the model is in the first 12 months when models appear to cool more quickly, but given the large variability in model responses, this difference is likely due to chance. In conclusion, both GC31-LL and GC31-MM have realistic amounts of cooling in the global mean surface air temperature compared to HadCRUT4.
We now consider the pattern of surface temperature anomalies in the two NH winters following the eruptions. Past studies have shown that these feature counter-intuitive warming at high latitudes (Driscoll et al., 2012;Stenchikov et al., 2006) and in the tropical East Pacific (Khodri et al., 2017;Stevenson et al., 2017). Figure 19 shows surface temperature anomalies averaged over the first two winters (December-January-February) after the eruptions. As expected, there is widespread cooling in the tropics for both model resolutions and HadCRUT4. The exception is the tropical East Pacific, which has no significant anomalies in GC31-LL and some positive anomalies in HadCRUT4. If this tropical response is real and not due to chance in the observations, then these integrations do not capture it fully. In the extratropics, the main feature in HadCRUT4 is a warming greater than one degree over Scandinavia and northern Russia and a cooling of slightly less than a degree in the Middle East and South East Asia. A similar response is captured by both GC31-LL and GC31-MM, but the model anomalies are weaker and not as widespread. In conclusion, both GC31-LL and GC31-MM capture the patterns of the temperature response in the first two NH winters seen in HadCRUT4, but the magnitude of the anomalies is often weaker.

Influence of the Solar Cycle on Atmospheric Temperatures and Precipitation
The 11-year SC influences the entire atmosphere throughout the historical period, either directly by absorption of spectral solar irradiances (SSI) or indirectly through complex coupling processes involving chemistry and dynamics (Gray et al., 2010;Seppala et al., 2014). While the total solar irradiance varies by merely ∼1 W/m 2 (approximately 0.1% total solar irradiance) within the 11-year SC, the largest SSI variability is observed in the shorter wavelengths, causing significant stratospheric heating and ozone production (Haigh, 2013). This direct radiative-chemical interaction warms the upper stratosphere significantly and leads to subsequent dynamical changes which are proposed to transfer the solar signal to the surface, influencing the climate in the North Atlantic region (Andrews et al., 2015;Gray et al., 2013). We detect the SC contribution in the historical simulations with a multiple linear regression model that takes into account the SC, volcanic eruptions, El-Niño-Southern Oscillation, Quasi-Biennial Oscillation (Baldwin et al., 2001), and greenhouse gases as described in Misios et al. (2016). Given the periodic nature of the SC fluctuation, long-term background trends (insignificant in the stratosphere) do not bias the detection of solar signatures and hence we did not detrend the time series prior to analysis. The model calculates lagged SC regression coefficients by lagging the SC predictor by −5 to +5 years. The null hypothesis of zero regression coefficients is rejected with a P value smaller than 0.1 using a two-tailed t test.
Given that the same SSI and ozone concentrations are prescribed in both GC31-LL and GC31-MM, we expect very similar direct atmospheric heating in response to the SC in both model versions. Figure 20 shows this to be the case. The global atmospheric warming at solar maximum is about 0.7 K peaking in the upper stratosphere/lower mesosphere. There is no evidence of a time lag in the stratosphere and mesosphere, as anticipated by a direct heating. Analysis of zonal mean air temperature reveals a band of significant warming between 5 and 0.1 hPa, with a maximum in the upper tropical stratosphere of 0.8 K. The pattern of atmospheric warming shown in Figures 20b and 20d is consistent with the previous high-top version of the model (HadGEM2-CC) used for CMIP5 experiments (Mitchell et al., 2015), despite the different top-of-atmosphere (TOA) SSI and ozone concentrations that were specified.
The global mean atmospheric warming is considerably weaker in the troposphere. It does not exceed 0.2 K and is characterized by a time lag of 1 year (Figures 20a and 20c). A similar time lag is also found in the global mean surface temperature (not shown), presumably associated with the ocean absorption of the visible and infrared spectral range, leading to direct warming at the Earth's surface. This warming, however, is not distributed evenly over the Earth's surface because dynamical amplification through atmosphere-ocean interactions has the potential to amplify or dampen surface responses.
As shown in Figure 21, we find that the surface temperature response to the SC does not depend critically on the horizontal resolution given that both model configurations show surface patterns of stronger warming in the tropics. GC31-LL shows a warmer tropical Pacific and Atlantic Oceans with magnitude exceeding 0.2 K at lag +1 year. The surface response gets stronger in GC31-MM, where we find a broad warming over the Pacific taking a horseshoe shape in the North Pacific similar to the SC signature detected in the HadCRUT data set (Misios et al., 2016). Recent model simulations and analysis of historical precipitation records have identified a pattern of increased precipitation in the western Pacific at solar maximum (Misios et al., 2019). Figure 21 shows that a similar pattern of precipitation response is simulated at solar maximum. Both model configurations show increased precipitation of about 1 mm/day at lag +1 year in the equatorial Pacific. In GC31-LL, increased precipitation is found concentrated in the western Pacific, while GC31-MM shows increased rainfall over a broader region in the equatorial Pacific. This pattern of precipitation response is likely to be related to a slow down of the Walker Circulation associated with global changes in the hydrological cycle at solar maximum, but further analysis is needed to disentangle the actual mechanisms. Figure 20. SC regression coefficients of annual global mean air temperature at different time lags (panels a, c) and zonal mean air temperature at lag 0 year (panels b, d). Units in K for a typical SC. Shading denotes statistical significance for p values ≤ 0.1.

Understanding the Model's Global Climate Response
In this section we use a simple energy balance framework to investigate plausible interpretations of the model's GMSAT time series (Figure 3). The model changes are within the envelope of the CMIP5 historical simulations, but we wish to develop a deeper understanding of what is driving the model GMSAT variations and their differences from observations. It is important to note that a successful simulation of the historical GMSAT time series does not guarantee that the model is correctly simulating the processes responsible for the historical changes (Golaz et al., 2013) nor that the model will correctly project the response of the climate system to future forcing (Gregory et al., 2019). Nonetheless, by identifying likely causes of model-observation differences, we develop insight into their likely impact on future projections and inform future process-based model development.
We focus on changes over certain periods to avoid the issue with choice of reference period mentioned in section 2.1. In particular, we wish to understand the weaker than observed warming from around 1930-1950, the stronger than observed cooling from around 1950-1980, and the stronger than observed warming from around 1980-2014.
Our analysis is based on the simple energy balance model: Figure 21. SC regression coefficients of annual mean surface air temperature (panels a, b) and total precipitation (panels c, d) at time lag +1 year. Units in K and mm/day for a typical SC. Shading denotes statistical significance for p values ≤ 0.1.
where F(t) is the effective radiative forcing (ERF; Myhre et al., 2013), N(t) is the TOA heat flux, T(t) is the global mean surface temperature anomaly, and (t) is a time-or state-dependent climate feedback parameter (inversely proportional to effective climate sensitivity). Taken over the whole period, there is a potential compensation between forcing and feedback. Specifically, a model with strong aerosol forcing (cooling) and high effective climate sensitivity could produce a similar temperature change over the whole historical period to a model with weak aerosol forcing and low effective climate sensitivity (Kiehl, 2007). By splitting the historical period into a few multidecadal periods, each characterized by different forcing trends, it may be possible to differentiate better between the drivers of the model-observation differences. A similar framework was used recently by (Golaz et al., 2019) to interpret the historical simulations in E3SM, another CMIP6 atmosphere-ocean general circulation model (AOGCM). Here we extend the approach of Golaz et al. (2019) by allowing to vary with time. This is motivated by the analysis of Gregory et al. (2019) who show that typically varies by a factor of 2 through the historical period in the CMIP5 historical runs. We ask in turn: Can the differences between observed and modeled GMSAT over the three periods be explained by differences in F, in , or in internal climate variability, between the real world and the model?

Effective Radiative Forcing F(t)
ERF is a measure of the radiative forcing associated with particular climate forcing agents, including quasi-instantaneous feedbacks due to rapid adjustment of the atmosphere. Figure 22 shows the time series of F for the GC31-LL model, derived and described in Andrews et al. (2019)    is consistent with both of these ranges but several tenths of W/m 2 below (i.e., stronger than) the modal value of both. We are not aware of uncertainty estimates for the aerosol ERF at intermediate times within the twentieth century, but we note that the discrepancy with the AR5 central estimate is particularly large around 1950-1980. Note that the strong aerosol ERF in GC31-LL results in a total ERF that is decreasing between around 1930 and 1960. Because the TOA heat flux is always partially adjusted to the history of the ERF, the TOA flux can become zero during this period, even though the ERF is still positive relative to 1850. This accounts for the relatively constant OHC in the decades preceding 1960 ( Figure 5).
Overall, we conclude that it seems likely (though by no means certain) that the modeled aerosol ERF is somewhat too strong through the twentieth century and especially in the 1950-1980 period. Such a bias would be consistent with the strong GMSAT cooling during that period.

Climate Feedback Parameter (t)
We calculated (t) for the GC31-LL historical ensemble and for the corresponding AMIP-piForcing run (which has constant preindustrial forcing and prescribed SST from observations; see Webb et al., 2017), using the method of Gregory et al. (2019) (Figure 23). Both GC31-LL and AMIP-piForcing show a similar time variation of (t) as seen for the CMIP5 models by Gregory et al. (2019) (their Figure 5c) but with overall values of around 50% greater in GC31-LL than in CMIP5. The 30-year mean value of at the end of the GC31-LL run is 0.8 W m −2 K −1 , while for the GC31-LL AMIP-piForcing run it is 1.8 W m −2 K −1 . Assuming an ERF due to doubled CO 2 concentration of 4.05 W/m 2 , these correspond to effective climate sensitivities of 5.1 • C and 2.2 • C, respectively. The corresponding effective climate sensitivities for CMIP5 are 3.5 • C and 1.7 • C, respectively. Thus, we see that the apparently strong effective climate sensitivity in recent decades likely reflects both the known higher-than-average climate sensitivity of GC31 Bodas-Salcedo et al., 2019) and time variations in driven by discrepancies between modeled and observed SST patterns in recent decades (Gregory et al., 2019), with the time variation being the dominant factor. We discuss the time variation further in the following paragraph. Gregory et al. (2019) suggest that the lower recent values of in the CMIP5 AOGCMs, compared with AMIP-piForcing, are due to the recent IPO-like observed SST pattern in the Pacific Ocean, which is associated with large but is not seen in the AOGCMs. They propose that the discrepancy is either because the models' internal IPO variability is too low or because the SST response pattern to the volcanic forcing of the 1960s-1990s, which resembles the IPO pattern, is too weak. During the assessment of GC31's IPO variability (section 1.4), we found no evidence that the model variability was weaker than the (limited) observational record. However, the pattern of the short-term response to volcanic forcing is much weaker in the model than in the observations (see Figure 19, section 5). This suggests that for GC31, an underplayed volcanic response may be contributing to the underestimate of in recent decades, as proposed for CMIP5 by Gregory et al. (2019). Further investigation using partially forced runs would be needed to confirm the cause but is beyond the scope of the present paper.

Internal Variability
The stronger than observed cooling between about 1950 and 1980 and stronger than observed warming between about 1980 and 2014 are seen in all ensemble members for GC31-LL and GC31-MM (eight members in all), and in all ensemble members of the UKESM1 Earth System Model, which uses GC31-LL as its physical climate core (Sellar et al., 2019) (16 further ensemble members). It is therefore unlikely that these features can be explained by internal climate variability, unless the models are underestimating the internal GMSAT variability of the real climate system.
In the piControl simulations of GC31-LL and GC31-MM we see interdecadal variations in GMSAT with detrended standard deviations of 0.055 and 0.062 • C, respectively, similar in magnitude to the variations in the first part of the historical observed data set, before strong forced responses are expected (Figure 3). The corresponding figures for the NH only are 0.084 • C and 0.088 • C, which is comparable to the interdecadal variations seen in reconstructions of whole Northern Hemisphere surface temperature over the past 2,000 years (Masson-Delmotte et al., 2013). So there is no obvious evidence to suggest that the internal interdecadal variability of the model GMSAT is too low.
For the earlier period 1930-1950 a few members of the larger UKESM1 ensemble do appear to capture the warming from 1930-1950((Sellar et al., 2019, Figure 33), so internal variability is a plausible explanation of this difference.

Summary of Factors Affecting GMSAT Response
Based on our analysis above, we can draw some tentative conclusions about the causes of the discrepancies with the observed GMSAT time series: 1. The weaker-than-observed warming from around 1930-1950 in the GC31 historical ensembles is most likely due to some combination of strong ERF from anthropogenic aerosols and internal climate variability that is undersampled by our ensembles. We have been unable to separate these two causes in our analysis, but note that some members of the larger UKESM1 ensemble do capture the observed warming in this period. We cannot rule out a role for other forcing agents, for example, ozone which shows an accelerated increase during this period. 2. Internal variability is unlikely to explain the multidecadal differences between the models and observations since 1950. The most likely explanation of the strong cooling between 1950 and 1980 appears to be aerosol forcing that is stronger in the models than in the AR5 central estimate. However, we note the large uncertainty in the AR5 aerosol ERF estimate, as well as in more recent estimates. 3. In common with CMIP5 models (Gregory et al., 2019), the climate feedback parameter (t) is too low (effective climate sensitivity too high) during the decades since 1980, when compared with estimates of made using observed SSTs. The too low is likely contributing to the strong warming since 1980. It seems most likely that a too muted SST pattern response to the volcanic eruptions of the 1960s-1990s is contributing to the erroneous SST patterns, but this would require further integrations to confirm. In any case the strong time dependence of during this period suggests that warming over recent decades may 10.1029/2019MS001995 not be a good indicator of the response to substantially higher levels of greenhouse gases over the 21st century (Gregory et al., 2019). 4. The strong modeled rate of warming since 1990 is partly a result of the strong cooling spike following the 1991 Mount Pinatubo eruption (around 0.2 • C stronger than observed). Around 0.1 • C of this difference may be attributable to the moderate-to-strong El Niño observed in 1991-1992(Folland et al., 2013. Since the UKESM1 model (Sellar et al., 2019) uses GC31-LL as its physical climate core and its simulated GMSAT evolution is very similar to the runs studied here, we believe that our analysis applies also to UKESM1. Deeper study is needed to confirm our conclusions and is planned for a later paper.

Summary
We briefly summarize the key results arising from the analysis of the historical ensembles, GC31-LL, and GC31-MM. Various aspects of the simulated climate system have been assessed against available observations and shown to produce a reasonable representation of historical climate variability. The overall warming trends in the global mean surface air temperature between 1880-1891 and 2000-2014 are similar to observations; this is despite global mean land surface air temperature trends being smaller than observed. On average, the global temperature response to major tropical volcanic eruptions after 12 months exhibits realistic amounts of cooling, although internal variability complicates the comparison for individual eruptions. The pattern of warming in the high-latitude NH for the first two winters after these eruptions is captured, albeit weaker than observed. The observed low-to middle-latitude pattern of SST response to volcanic eruptions is weaker in the models, and this may have important implications for recent simulated rates of warming. The atmospheric response to SC variability shows a warming in the upper stratosphere-lower mesosphere at solar maximum consistent with the CMIP5 response. The two model resolutions exhibit similar temperature responses to the combination of historically evolving greenhouse gases, aerosols, volcanic eruptions, and solar variability.
The overall time series of global temperature in the historical runs resembles the observed time series but with exaggerated cooling from around 1930-1980 and stronger warming from 1980-2014. A simple analysis of these features suggests that a strong ERF from aerosols may be playing an important role in the cooling phase, while errors in the simulation of SST patterns since 1980 result in a too strong effective climate sensitivity during this period, possibly influenced by the above-mentioned weak SST response to the volcanic eruptions of the 1960s-1990s. Internal climate variability in the observations complicates the comparison before 1950 and after 1990. Work is in progress to develop a revised version of the model with improved representation of processes affecting aerosol forcing and will be reported elsewhere.
The historical ensembles generally do not tend to reproduce observed decadal mean precipitation in an ensemble mean sense, supporting the view that observed variability is internally generated and does not have much contribution from external forcings. Nonetheless, this conclusion is dependent on the fidelity of the forcings and responses in the models discussed here and others more generally. Precipitation variability is often mediated by dynamical changes (e.g., to monsoons), for which the sensitivity to climate forcings can be difficult to simulate. This is perhaps shown most clearly by the very large uncertainties in model projections of the 21st century regional precipitation, but the same differences in model processes could also affect the simulation of the 20th century. A common theme in the model ensembles is a tropical/subtropical drying from the 1950s to 1990s. This approximately coincides with the period when the models produce a cooling or hiatus of global mean temperatures. A resulting lack of evaporation and thereby lack of atmospheric moisture availability is consistent with the decline in rainfall. In some regions, observed rainfall variability is far in excess of the spread of the model's ensemble (e.g., West Africa), suggesting deficiencies in the model's ability to generate variability, or even perhaps that the forced response could actually be underestimated in the simulations. Also of note is the forced response to the 11-year SC that results in enhanced precipitation in the equatorial Pacific.
The OHC, an important indicator of the Earth's energy budget imbalance, shows a warming in the 0 to 700 m layer consistent with observations over the period 1971-2014. The lack of warming compared to observations in the 1970s-1980s is compensated by a warming rate larger than observed after 1993. The 0 to 2,000 m layer also exhibits no warming during the early period and warms at a similar rate to observations after 1993. This suggests that the post-1993 warming of the 0 to 700 m layer is too large and for the 700 to 2,000 m layer too little, compared to observations. The Arctic summer sea ice extent for 1990-2009 is in agreement with observations for both resolutions of the model. Differences arise in the winter extent; the GC31-LL experiment is in good agreement with observations, whereas GC31-MM is underextensive in the Sea of Okhotsk and Labrador Sea. This may be linked to the differences in the location of deep convection and the path of the North Atlantic current in each model. In general, the GC31-MM Arctic winter sea ice extent is less extensive than in GC31-LL over the entire historical period but is within near-present-day observational uncertainties. The negative trends in the near-present-day sea ice extent are reasonably consistent with observations at both resolutions.
For the Antarctic sea ice, both resolutions are underextensive in winter and summer for 1990-2009, more so for GC31-MM than for GC31-LL. This is linked to the Southern Ocean warm bias present in both models, which is greater for GC31-MM than for GC31-LL. The slightly positive trends in observations of the near-present-day sea ice extent are not reflected by the experiments.
The magnitude of the SMB of the Greenland and Antarctic ice sheets in the GC31-MM historical experiment is consistent with observations, suggesting that the constant SMB of the Antarctic ice sheet and the near-present-day decline in the Greenland ice sheet is captured by the higher resolution model, despite the model lacking ice sheet dynamical processes. The GC31-LL experiment exhibits a smaller SMB than observations, which may be linked to weaker coastal orographic snowfall at low resolution, but still manages to capture the magnitude of the near-present-day decline in the Greenland ice sheet. The area of permafrost, which is related to air temperature, in the low and medium resolution models is slightly larger than for observations. This is likely due to a colder Arctic in the model. The near-present-day permafrost loss rate is approximately double that of observations. Conversely, the loss of snow cover is lower than but within an order of magnitude of observations.
The principal mode of variability of North Atlantic SSTs, the AMV, appears to be associated with trends in the global mean surface air temperature in the historical simulations, unlike for the observed AMV. When trends are removed, the correlation between the model and observed AMV is greatly reduced. However, regression of North Atlantic SSTs with the AMV, having removed the influence of global mean SSTs, indicates that a significant proportion of the variance in the tropical Atlantic can be explained by forcing, that is, they are consistent across the historical ensembles. This is consistent with observations, supporting the idea that the observed tropical AMV has a forced component. It also suggests that internal variability is more important for the extratropical AMV. The evolution of the AMOC on decadal timescales is strongly correlated between the low and medium resolution experiments, exhibiting an increasing circulation up to 1990 followed by a decline. This correlation suggests that AMOC variability is somehow being forced, possibly linked to upper ocean temperatures and shortwave surface fluxes. The variations in net shortwave radiation are linked to sulfate aerosol precursor emissions (Booth et al., 2012), providing a potential link between anthropogenic aerosol forcings and variability in the AMV and possibly AMOC.
Is model resolution an important factor when interpreting the results of the historical experiments? The answer differs depending upon which phenomenon is under scrutiny. The response of the GC31-LL and GC31-MM experiments is similar in some cases and divergent in others. For example, phenomena that have a direct influence on atmospheric and surface air temperatures, such as major volcanic eruptions, solar variability, greenhouse gas concentrations, and aerosols, appear to invoke similar responses at each resolution. Equally, the similar precipitation response of both ensembles suggests that the models are responding to external forcings in a consistent way. It should be noted that the small number of ensemble members, four at each resolution, may influence this interpretation. It is possible that larger ensembles may reveal significant differences between the resolutions.
Model resolution appears to be an important consideration for sea ice extent, reflecting known differences in the mean sea ice state in the model piControl runs. In the Arctic winter, the extent of the sea ice is greater in the lower resolution experiment and is suspected to be influenced by the different centers of deep circulation in northern oceans. In the Antarctic, for both winter and summer, the sea ice is underextensive for GC31-MM compared to GC31-LL and is thought to be related to a stronger Southern Ocean warm bias in GC31-MM.
In conclusion, the results of the historical simulations performed using the HadGEM3-GC3.1 model at N96/ORCA1 and N216/ORCA025 resolutions show many features in common with the observed record over the period 1850-2014. The simulations contribute, as part of the wider CMIP6 multimodel effort, to Figure A1. Global mean surface air temperature (GMSAT) evolution for GC31-MM historical experiments with (blue) and without (red) ozone redistribution. Shaded region indicate ±2 standard deviations. The piControl trend over 500 years was removed, and the spatial anomalies found with respect to 1961-1990, prior to anomalizing to the 1880-1919 period. No observational masking has been applied.
understanding the causes of observed climate change since 1850. Future in-depth analysis of the mechanisms of change in models and observations will lead to an improved understanding of how the historical record can be used to better constrain model projections of future climate.

Appendix A: Ozone Redistribution in GC31-MM Historical Experiments From 1951 to 2014
The physical HadGEM3-GC3.1 CMIP6 model experiments use prescribed monthly ozone concentrations derived from the CMIP6 ozone concentration data sets as detailed in Eyring et al. (2016). While the climate sensitivity experiments were running, it was discovered that the height variation of the ozone tropopause, implicit in the prescribed ozone data set, did not match the changing tropopause height in the experiments. This was also found to occur, to a much lesser extent, in the near-present-day portion of the historical experiments. This mismatch results in increased concentrations of stratospheric water vapor which has a radiative influence on the surface climate sensitivity. The solution was the implementation of an ozone redistribution scheme within the model which acts to align the prescribed ozone tropopause with that of the running experiment. Hardiman et al. (2019) describe the implementation of this scheme. The inclusion of the scheme in the GC31-LL piControl simulation was shown by Hardiman et al. (2019) to have negligible impact, whereas the impact on the abrupt-4xCO2 experiment was significant due to increasing tropopause height as the climate warmed. The historical simulations are more closely aligned with the piControl than the more extreme abrupt-4xCO2 experiments, with only a small increase in tropopause height occurring toward the end of the historical period.
The GC31-LL historical simulations described in this paper use the ozone redistribution scheme throughout, from 1850 to 2014. The computationally intensive GC31-MM historical simulations could not be restarted from 1850 with this scheme due to CMIP6 delivery timescales. Hence, the ozone redistribution scheme was introduced partway through the simulation, from 1 January 1950, becoming effective 1 January 1951, over a decade prior to the impact of the Mount Agung volcanic eruption and a few decades prior to the near-present-day increase in global mean temperature.
Here we provide a brief impact assessment of introducing the ozone redistribution scheme partway through the GC31-MM historical simulation. The evolution of the ensemble mean global mean near surface temperature is shown in Figure A1. The original historical ensemble without ozone redistribution (1850-2014) is shown in red and the historical ensemble with ozone redistribution  in blue. The ensemble members of the latter are initialized using 1 January 1950 initial conditions generated by the non-ozone redistributed experiments. For clarity, the DECK historical ensemble delivered for CMIP6 is a combination of the red curve prior to 1950 and the blue curve from 1950 onward. The numerical integration of the two ensembles diverges due to the influence of the ozone redistribution scheme, all other factors being equal. Comparing the ensemble means of both experiments, the DECK experiment warms slightly prior to 1959 and thereafter remains cooler, although these differences are not significant, which is a result consistent with the piControl results described in Hardiman et al. (2019). A comparison of the mean difference over the period 1951-1955 (−0.01 K) with that of 2010-2014 (+0.16 K) suggests that the influence of ozone redistribution becomes increasingly important toward the end of the historical period, as anticipated.
We also consider the influence of the ozone redistribution scheme on the temperatures in the troposphere and stratosphere. The zonal mean zonal temperature differences between the original and the DECK historical experiments over the periods 1951-1955, and 2010-2014, shown in Figure A2, are attributable to a combination of the ozone redistribution scheme and internal variability associated with the numerical divergence of the two ensembles. A small impact on ozone concentration and zonal mean temperature is inevitable in the first few years after the scheme is introduced ( Figure A2a), but is not significant for zonal mean zonal temperature in the context of interannual variability. Comparison with the equivalent Figure  7a of Hardiman et al. (2019), demonstrates that the response in the early 1950s is commensurate with that of the piControl experiment. There is more significant impact in the near-present day ( Figure A2b), with a larger stratospheric warming response, although this is still weaker than the abrupt-4xCO2 experiment (see Figure 7b of Hardiman et al., 2019). This demonstrates that the "correction" provided by the ozone redistribution scheme is working as intended.
It should be noted that the 1950-2014 GC31-MM historical experiment without the ozone redistribution scheme, which was used by the CMIP6 Decadal Climate Prediction Project (DCPP), does not form part of the UK-CMIP6 DECK contribution. It is likely that the 1960 onward GC31-MM historical ensemble without ozone redistribution will be made available as part of DCPP.
Data Availability. The simulation data used in this study are archived on the Earth System Grid Federation (ESGF) node (https://esgf-index1.ceda.ac.uk/). The model Source IDs are HadGEM3-GC31-LL (Ridley et al., 2019a) and HadGEM3-GC31-MM (Ridley et al., 2019b). The four members of each historical ensemble are identified by the following Variant Labels: r1i1p1f3 to r4i1p1f3.