Volume 121, Issue 12 p. 6969-6992
Research Article
Free Access

Uncertainties in the attribution of greenhouse gas warming and implications for climate prediction

Gareth S. Jones

Corresponding Author

Gareth S. Jones

Met Office Hadley Centre, Exeter, UK

Correspondence to: G. S. Jones,

[email protected]

Search for more papers by this author
Peter A. Stott

Peter A. Stott

Met Office Hadley Centre, Exeter, UK

Search for more papers by this author
John F. B. Mitchell

John F. B. Mitchell

Met Office Hadley Centre, Exeter, UK

Search for more papers by this author
First published: 13 June 2016
Citations: 26

Abstract

Using optimal detection techniques with climate model simulations, most of the observed increase of near-surface temperatures over the second half of the twentieth century is attributed to anthropogenic influences. However, the partitioning of the anthropogenic influence to individual factors, such as greenhouse gases and aerosols, is much less robust. Differences in how forcing factors are applied, in their radiative influence and in models' climate sensitivities, substantially influence the response patterns. We find that standard optimal detection methodologies cannot fully reconcile this response diversity. By selecting a set of experiments to enable the diagnosing of greenhouse gases and the combined influence of other anthropogenic and natural factors, we find robust detections of well-mixed greenhouse gases across a large ensemble of models. Of the observed warming over the twentieth century of 0.65 K/century we find, using a multimodel mean not incorporating pattern uncertainty, a well-mixed greenhouse gas warming of 0.87 to 1.22 K/century. This is partially offset by cooling from other anthropogenic and natural influences of −0.54 to −0.22 K/century. Although better constrained than recent studies, the attributable trends across climate models are still wide, with implications for observational constrained estimates of transient climate response. Some of the uncertainties could be reduced in future by having more model data to better quantify the simulated estimates of the signals and natural variability, by designing model experiments more effectively and better quantification of the climate model radiative influences. Most importantly, how model pattern uncertainties are incorporated into the optimal detection methodology should be improved.

Key Points

  • Twentieth century near-surface temperature detection and attribution results vary depending on the CMIP5 model
  • Results are sensitive to inclusion of weak signals and to the large diversity in response patterns
  • Needed in future: careful consideration of CMIP design and advances in the detection methodology

1 Introduction

Detection of climate change and attribution of its causes are important for the understanding of climate change, for evaluating climate models, and for helping to constrain predictions of anthropogenic climate change. Formal detection studies have examined observed climate changes over a variety of time and spatial scales [see, e.g., Bindoff et al., 2013]. These analyses compare observations with spatiotemporal patterns produced from climate models that represent the responses to different climate forcing factors, such as well-mixed greenhouse gases, aerosols, solar irradiance, and volcanic aerosol influences. Measures of internal variability are incorporated into sophisticated statistical analysis to attempt to detect the influence of the different forcing factors over internal climate variability.

The use of spatiotemporal patterns of temperature in detection studies has been reported to improve the consistency of the attribution of greenhouse gas warming across different models [Stott et al., 2006]. Such an approach attempts to compensate for gross errors in the models climate sensitivity and in the applied radiative forcings by scaling up under responsive models and scaling down over responsive models to better match the observations [Allen et al., 2000]. Using the technique has also been used to constrain 21st century temperature predictions [Allen et al., 2000; Stott and Kettleborough, 2002; Kettleborough et al., 2007], with the result of bringing closer together different model projections of future temperature change than if observations were not used as a constraint. This type of agreement across models, with results predominantly sensitive only to the observations, has been termed “Stable Inference from Data” [Stott et al., 2006; Kettleborough et al., 2007].

However, this viewpoint has been somewhat challenged by results from recent detection studies. Gillett et al. [2013], Jones et al. [2013], and Ribes and Terray [2013] have taken advantage of large numbers of climate model simulations made available from the Coupled Model Intercomparison Project (CMIP5) [Taylor et al., 2012] to deduce the contribution to changes in near-surface temperatures over the last 60–150 years. These studies concluded that attributed trends for greenhouse gases over the last 60 or so years vary substantially, depending on which climate models were used in the detection analyses [Bindoff et al., 2013, Figure 10.4]. This sensitivity of results to model used could indicate that observationally constrained estimates of 21st century temperatures may not be as robust as previously thought.

There have been a couple of ways of potentially dealing with this criticism. First, it has been argued that using a multimodel average gives a more robust assessment of past global temperature variations as model errors and biases may be averaged out [Hegerl and Zwiers, 2011]. A measure of model pattern uncertainty has also been incorporated into some analyses [Huntingford et al., 2006]. Analyses using multimodel averages have detected well-mixed greenhouse gas influences but with other anthropogenic—non-well-mixed greenhouse gas—influences not being robustly detected [Jones et al., 2013; Gillett et al., 2013]. Second, examining the net anthropogenic influence, rather than separating it into greenhouse gas and other anthropogenic influences, gives consistent results across a range of climate models with different analysis methodologies [Gillett et al., 2013; Ribes and Terray, 2013]. The attributed net trend of anthropogenic influences deduced from these studies [Bindoff et al., 2013] was close to, or a little more than, the observed trend seen over the last 60 years. Bindoff et al. [2013] stated that results from using the net anthropogenic approach were “much more robustly constrained” and used the optimal detection results to provide evidence to support the statement that it is “extremely likely” (greater than 95% likelihood) that human activities caused more than half of the observed increase in near-surface temperatures from 1951 to 2010. An earlier study, using a different but related detection methodology, also found that anthropogenic influences were robustly detected in observed temperature trends across several models, but the distinguishing of the greenhouse gas and sulfate aerosol components was not [Hegerl et al., 2000]. Nevertheless, the question remains why recent analyses, using ensembles of CMIP5 climate models, give such wide ranges of warming attributable to greenhouse gases and cooling attributable to other anthropogenic factors.

This study attempts to investigate why using different CMIP5 climate models in detection studies leads to such a variety of results. We use as our basis the main analysis of Jones et al. [2013], which applied an optimal detection analysis to variations of near-surface temperatures over the period 1901 to 2010, using eight CMIP5 models to deduce the contributions from changes in well-mixed greenhouse gas concentrations, other anthropogenic influences, and natural factors. Here we use a period ending in 2005 that enables the examination of a wider range of models than has been looked at before. The differences and similarities between the spatiotemporal response patterns of the models are explored in detail. We look at what choices in the basic methodology can be made to try and increase the robustness of the results.

This paper is structured as follows. The sources of observed and model data are described in the next section together with what model simulations were used. Section 3 describes the optimal detection methodology. The temperature responses and radiative forcings due to the different forcing factors are described in section 4, noting any major differences which might lead to a diversity in the detection results. The results of the optimal detection analyses on the CMIP5 models are given in section 5. Section 6 describes the implications of the results on techniques to constrain transient climate response. There follows a discussion section and the conclusions.

2 Data

The HadCRUT4 [Morice et al., 2012] data set of blended land air temperatures and sea surface temperatures is used for the observations. The data set has a sophisticated error model, a component of which is an ensemble of realizations sampling one source of observational uncertainty. We use the median of this ensemble in this study. Including HadCRUT4's error model in an optimal detection analysis will be investigated in a future study.

Monthly mean near-surface air temperature from climate models were retrieved from the archive of the fifth phase of the Coupled Model Intercomparison Project (CMIP5) [Taylor et al., 2012]. Data from four CMIP5 experiments were obtained (Table 1). We obtained piControl data, to characterize climate internal variability, from 23 models that had at least 500 years of data available (Table 2) (Model details can be found in Table 9.A.1 in Flato et al. [2013]). We obtain data from the 15 models that also have historical, historicalGHG, and historicalNat experiments that cover the 1900–2005 period (Table 2).

Table 1. Experiment Definitionsa
Experiment Definition
piControl Constant preindustrial forcing factors
historicalGHG Variations in well-mixed greenhouse gas concentrationsb
historicalNat Variations in solar irradiance and volcanic stratospheric aerosols
historical Variations in historic anthropogenicc and natural radiative forcings
historicalOA Variations in other (non-well-mixed greenhouse gas) anthropogenic factors
historicalGOA Variations in all anthropogenic factors
historicalOAN Variations in other anthropogenic and natural factors
  • a Experiment definitions used in this study. piControl, historical, historicalGHG, and historicalNat follow CMIP5 nomenclature [Taylor et al., 2012]. Experiments historicalOA, historicalGOA, and historicalOAN are not CMIP5 experiments. They are defined for this study as simple linear combinations of the historical, historicalGHG, and historicalNat experiments.
  • b Well-mixed greenhouse gases include carbon dioxide, methane, nitrous oxides, and CFCs/hydrochlorofluorocarbons. Some models also include variations in ozone concentration in the experiment (Table 2).
  • c Anthropogenic forcings include well-mixed greenhouse gases, ozone, sulfate aerosols, carbonaceous aerosols, and land use changes.
Table 2. CMIP5 Modelsa
Ensemble Members historical/
Model Index Model TCR (K) SI O3 LU historicalNat/historicalGHG
A BNU-ESM 2.6 N N N 1/1/1
B CCSM4 1.8 N N Y 6/4/3
C CNRM-CM5 2.1 Y Y N 10/6/6
D CSIRO-Mk3-6-0 1.8 Y N N 10/5/5
E CanESM2 2.4 Y N Y 5/5/5
F GFDL-CM3 2.0 Y Y Y 5/3/3
G GFDL-ESM2M 1.3 N N Y 1/1/1
H GISS-E2-H 1.7 Y N Y 6/5/5
I GISS-E2-R 1.5 Y N Y 6/5/5
J HadGEM2-ES 2.5 Y N Y 4/4/4
K IPSL-CM5A-LRb 2.0 Y Y Y 6/3/3
L MIROC-ESM 2.2 Y Y Y 3/3/3
M MRI-CGCM3 1.6 Y Y Y 3/1/1
N NorESM1-M 1.4 Y Y Y 3/1/1
O bcc-csm1 1.7 N N N 3/1/1
  • a Details of CMIP5 models used in analysis. The 15 models (A–O) with historical, historicalNat, and historicalGHG experiments are listed together with their transient climate response (TCR) [Forster et al., 2013]; whether, yes (Y) or no (N), indirect aerosol effects are included (SI); whether ozone influences are included in the historicalGHG experiment (O3); whether anthropogenic land use changes are included in the historical experiment (LU); and the number of initial condition ensembles for each experiment. In addition to models A–O, a further eight models were used that had over 500 years of piControl available; ACCESS1-3, CESM1-BGC, GFDL-ESM2G, MIROC5, MPI-ESM-LR, MPI-ESM-MR, MPI-ESM-P, and inmcm4—a total of 23 models.
  • b Model K simulated the radiative effects of stratospheric volcanic aerosol by varying incoming total solar irradiance in the historical and historicalNat experiments. Model K also included land use changes in the historicalGHG experiment.

What forcing factors are included, and how they are implemented, differs across the models [Collins et al., 2013; Jones et al., 2013]. Here all models included variations of the main well-mixed greenhouse gases (CO2, CH4, N2O, and CFCs) in the historical and historicalGHG experiments. Variations in concentrations in tropospheric and stratospheric ozone are included in all the historical simulations and some models' historicalGHG experiments (Table 2). The direct radiative effects (also known as aerosol-radiation interactions [Myhre et al., 2013]) of SO4 and carbonaceous aerosols are included in all the models. Indirect aerosol effects (also known as aerosol-cloud interactions [Myhre et al., 2013]) are incorporated within most of the models but not all (Table 2). Similarly in some of the models, anthropogenic land use changes are not simulated. The radiative effects of changes in solar irradiance and stratospheric volcanic aerosols are included in all the models' historical and historicalNat experiments.

The experiments can be used to extract the climate change patterns associated with well-mixed greenhouse gases, other anthropogenic factors, and the influence of natural radiative forcing in an optimal detection analysis. For example, historicalOA (Table 1) responses are deduced by finding the difference between the ensemble means of the historical and the sum of the historicalGHG and historicalNat simulations. We can include more models than used in earlier studies [Jones et al., 2013; Gillett et al., 2013; Ribes and Terray, 2013] as we use data ending in 2005 rather than 2010. The models drawn from CMIP5 used here, a-so called “ensemble of opportunity,” may not representatively sample the full range of possible model uncertainties [Hegerl and Zwiers, 2011], so there are limitations to the statistical interpretation of such an ensemble [Knutti, 2010; Jones et al., 2013].

3 Optimal Detection Analysis Methodology

We use multiple linear regression—in this case total least squares—of the simulated climate patterns xi against observed climate y (equation 1), optimized by projecting filtered patterns onto an estimate of the leading orthogonal modes of internal climate variability, which have been “whitened” to normalize the noise characteristics and down-weight patterns with low signal-to-noise ratios [Allen and Stott, 2003].
urn:x-wiley:jgrd:media:jgrd53077:jgrd53077-math-0001(1)

The βi in equation 1 are the regression scaling factors, and νi and ν0 are the internal variability components for the ith pattern and observations, respectively. The details of the implementation are identical to that used in Jones et al. [2013], except for the periods considered, and specific models used.

Here the basis of orthogonal modes are the empirical orthogonal functions (EOFs) calculated from estimates of internal variability of the models, normally based on the model's own piControl. There are not long enough piControl experiments to produce EOFs for each model so we have used a common EOF basis approach, where the piControls from the CMIP5 models (Table 2) are combined to enable an estimate of a common EOF basis to be calculated [Jones et al., 2013, and references therein].

A detection is deduced by testing the null hypothesis that the scaling factor has a value of 0. If a pattern's regression scaling factor has an uncertainty range that does not cross 0, we conclude that the fingerprint pattern is detected [Hegerl et al., 2007]. A step toward attribution requires deducing if responses to the expected forcing factors are consistent with the observed change [Mitchell et al., 2001]. Previously, attribution consistency was examined by seeing if scaling factors have uncertainty ranges consistent with a value of 1, indicating that the responses from the models do not need to be significantly scaled up or down [Hasselmann, 1997; Hegerl et al., 2007]. More recently, the constraint of scaling factors being consistent with 1 has been considered unnecessary for attribution, as long as any discrepancies are understandable within an expert judgment of the uncertainties [Hegerl and Zwiers, 2011; Bindoff et al., 2013]. An important stage toward attribution is to exclude other plausible factors as being alternative explainers of the changes [Hasselmann, 1997; Mitchell et al., 2001]. If many patterns are included in a multivariate regression, there is an increased risk of an overfitting to the observations, which would bias the results. Patterns that are correlated with each other risk causing the scaling factors to be degenerate, with wide uncertainties. Using smaller numbers of patterns in the regression can lessen these problems. We should note that using smaller numbers of patterns makes stronger assumptions about equality of scalings on forcings since; if the patterns used are composed of contributions from different forcing factors, then effectively, the scaling factors for those components are assumed to be equal to each other. Using smaller numbers of patterns, and thus predictors, increases the risk of underfitting to the predictand. One conservative test for underfitting/overfitting, and an important step for attribution, is to examine the residuals of the regression and test if the variability is consistent with estimates of internal variability [Allen and Stott, 2003].

4 Climate Response Patterns and Radiative Forcing

The model data analyzed in this section are not masked by observational spatial coverage, as we are only concerned in an intermodel comparison at this stage. All the models show a gradual increase in historicalGHG global temperatures, with ensemble mean linear trends over the 1906–2005 period varying between 0.81 and 1.65 K/century, with the rate of change at the end of the twentieth century greater than the start of the century (Figure 1). The historicalNat simulations show a much smaller warming and then cooling over the twentieth century with overall trends over the whole period, ranging from −0.21 to 0.07 K/century. The historical simulations all warm, with ensemble mean linear trends varying from 0.29 to 1.17 K/century, but with more multidecadal variability than historicalGHG.

Details are in the caption following the image
Global, 10 year running mean, near-surface temperature variations for each model, anomalies relative to the 1880–1919 period. Model ensemble means shown as thick dark lines and individual ensemble members as thin light lines. The “Mod.Avg.” is the multimodel average, calculated as the mean of the model ensemble means. The historical, historicalGHG, and historicalNat experiments are the CMIP5 experiments, with historicalOA inferred from the difference between historical and the sum of historicalGHG and historicalNat.

The ranges in the historicalOA trends over 1906–2005 are wider than the other experiments, varying from −1.25 to 0.01 K/century (Figure 1). Some models (A, E, G, and O) gradually cool to reach a minimum in temperatures around 1980 before then warming; others cool then stabilize after the 1980s (C, I, K, L, and N), and some models continue decreasing in temperature to 2000 (D, F, H, J, and M).

Differences in the models' climate sensitivities and in the radiative forcing will contribute to the wide range in climate responses. One model index, the transient climate response (TCR),—the global mean temperature change at doubling of CO2 in a simulation with 1% per annum increase in CO2—is considered a useful model metric of climate sensitivity to compare multidecadal scale changes in model response [Flato et al., 2013]. However, in practice, the temperature response to different forcing factors will be more complex across models than the differences in TCR may suggest. There is no clear simple relationship between the models' TCR and the historicalGHG trend (Figure 2a) [Gillett et al., 2013]. One reason for this is that those models including ozone variations in the historicalGHG experiments (Table 2) may warm more, as ozone has a net warming radiative influence [Myhre et al., 2013], which varies across the models [Eyring et al., 2013]. For instance, models F and K warm more than other models with similar TCR (Figure 2a), but models C, L, M, and N do not indicate an obvious warm bias. Gillett et al. [2013] came to a different conclusion, based on a different assessment of what CMIP5 models included ozone in their historicalGHG simulations, and suggested that the models' varying responses to non-CO2 greenhouse gases will have contributed to the diversity in historicalGHG responses. There is no clear relationship between model TCR and historicalOA cooling (Figure 2b), reflecting a large spread in radiative forcing across the models, the range of model TCR, and in noise contamination due to the way historicalOA is derived.

Details are in the caption following the image
Comparison of transient climate response (TCR), Table 2, with the (a) mean historicalGHG and (b) historicalOA temperature trends over the 1906–2005 period. The 2.5–97.5% uncertainty ranges in TCR are estimated from the variability in 20 year means separated by 70 years in each model's piControl. The 2.5–97.5% ranges in temperature trends are estimated from the variability of 100 year trends in the model's piControl, scaled to account for the number of ensemble members. Diamond symbols signify models which do not include ozone forcing as part of the historicalGHG experimental design and square symbols where ozone is included as a forcing in historicalGHG. Infilled symbols, in Figure 2b, represent models which do not include aerosol indirect effects in the historical simulations. The dashed lines in both panels are linear regression lines through the data, passing through 0 in both axes, and are included as a guide.

There are large differences in the ERF (effective radiative forcing, adapted from Forster et al. [2013]) across the models (Figure 3), in particular with historicalOA having a wider range (−2.11 to 0.10 W m−2), than historicalGHG (1.41 to 3.16 W m−2) by the end of the twentieth century. As for the temperature response, the historicalOA ERF will have some extra noise contamination due to the way it is derived.

Details are in the caption following the image
Global, 10 year running mean effective radiative forcing (ERF), W m−2, as calculated by Forster et al. [2013]—described as “adjusted forcing” in that study. ERF is calculated relative to the model's own piControl. The Mod.Avg. is the multimodel average, calculated as the mean across the models. For model A the required data to estimate ERF were not available.

The largest radiative influence that is not included in all models (Table 2) is due to indirect aerosol effects. Models that simulate the effect should tend to have a stronger cooling influence from aerosols than those models that do not. The historicalOA ERF for models that include indirect effects has a range of −2.11 to −0.58 W m−2, while the models not including these aerosol processes have a smaller magnitude radiative influence by 2000 of −0.46 to 0.10 W m−2. The models not incorporating aerosol indirect effects (A, B, G, and O) have generally smaller magnitude historicalOA cooling trends over the twentieth century (−0.40 to 0.01 K/century) than the models that do include indirect effects (−1.25 to −0.38 K/century). It has been previously noted that models which include indirect effects have historical simulations that appear to match the observed near-surface temperature variations more closely than models that do not [Jones et al., 2013; Wilcox et al., 2013; Ekman, 2014].

Estimates of radiative forcing for ozone (based on a different methodology to that in Forster et al. [2013]) suggest a range of 0.17 to 0.44 W m−2 for a small number of CMIP5 models [Shindell et al., 2013]. Another choice that is potentially important is whether or not land use changes are implemented (Table 2). While not estimated by Forster et al. [2013], the Intergovernmental Panel on Climate Change (IPCC) assessment of land use radiative forcing by 2005 was −0.15 ± 0.10 W m−2 [Myhre et al., 2013]. Those models not including land use variations in the historical simulations may warm more than those that do.

The implementation of aerosol physics differs substantially across the models [Collins et al., 2013, Table 12.1] from being prescribed in some models to varying degrees of interactivity in others. Despite the same source emission data set being used [Lamarque et al., 2010] in most models, the differences in physics lead to the models having differently evolving concentrations of the aerosol species [Wilcox et al., 2015]. The aerosol optical depth (AOD) is a measure of the optical effects of sulfate aerosols as well as a number of other anthropogenic and natural aerosol species, and gives an indication of the evolution of the aerosols [Wilcox et al., 2013; Shindell et al., 2013; Flato et al., 2013]. The increase in global mean AOD from the 1940s to the end of the century (Figure 4a) is common between the models with some showing much stronger increases than others. Most models show a peak in AOD by the 1980s/1990s, but others increase to the end of the century. The differences between the models are more marked when comparing the contrast between a region covering North America, North Atlantic, and Europe and the South East Asia region (Figure 4b). Some models (e.g., F) have a very strong change from the former to latter region, whereas others (e.g., M) have a much smaller change.

Details are in the caption following the image
Aerosol optical depth (AOD) at 550 nm for the models with available diagnostics. (a) Global mean, (b) difference between region covering North America to Europe (120°W–50°E and 30°N–70°N) and the South East Asia region (50°E–150°E and 0°N–30°N).

All the historicalGHG responses show the typical higher-latitude warming due to greenhouse gas increases, with most models showing an asymmetry of larger warming in the Northern Hemisphere (Figures 5 and 6). The historicalOA responses show somewhat more complex evolution, although the forced patterns of change will be more contaminated by noise because of the way historicalOA is derived. Some models show strong cooling in higher northern latitudes, but other models show more spatially uniform cooling which levels out near the end of the century.

Details are in the caption following the image
Zonal, 10 year running means, near-surface temperatures for the ensemble means of (first to third columns) historical, historicalGHG, and historicalNat and (fourth column) historicalOA. Models A to H shown. Temperatures (K) given as anomalies relative to 1880–1919 period.
Details are in the caption following the image
As in Figure 5 but for models I to O and Mod.Avg.

Differences in the model responses can be demonstrated by looking at three models: B, D, and F. Models B and D have the same TCR (Table 2), with similarities between the historicalGHG global mean responses (Figure 1), albeit with the spatial temporal pattern being somewhat more symmetric in D than in B (Figure 5). Model F has only slightly higher TCR than models B and D but has much more warming in the historicalGHG, especially over the high northern latitudes, reflecting a larger ERF (Figure 3) contributed to by the presence of ozone in Model F's historicalGHG. The historicalOA responses are very different for the three models. Model B shows little if any cooling globally or spatially, related to the lack of modeling of indirect aerosols. Model D has strong cooling that is again notably symmetric across the hemispheres, while model F has the strongest cooling globally, with most of the response across high northern latitudes.

One way to quantify the similarities and differences across the models is to look at the cross correlations of the spatiotemporal data shown in Figures 5 and 6. The historicalGHG experiment has the most similar patterns between models with correlation coefficients varying between 0.68 and 0.97 (Figure 7). In contrast, both the historicalNat and historicalOA experiments have intramodel correlations that vary considerably, from −0.40 to 0.72 for historicalNat and from −0.35 to 0.87 for historicalOA, an indication of large differences between the response patterns. Noise from internal variability will mask some of the forced response, contributing to the pattern differences, particularly when the forced component of the response is weak, such as for historicalNat.

Details are in the caption following the image
Area-weighted spatiotemporal cross correlations between models, for the four experiments historical (bottom left of left panel), historicalNat (top right of left panel), historicalGHG (bottom left of right panel), and historicalOA (top right of right panel). The data used in the correlations are sampled from Figures 5 and 6 for the 1906–2005 period using independent 10 year means.

For the historicalGHG experiment all the models have high correlations (>0.8) with the multimodel average, which are generally larger than the correlations with other models. This suggests that the multimodel average is representative of the response patterns of all the models. This is not the case for historicalNat where the biggest correlation of any of the individual models with the Mod.Avg. is 0.49. For historicalOA most of the individual models have higher correlations with the multimodel average than they have with the majority of the other models. For the two models where this is not the case (models B and G) the correlations with the Mod.Avg. are very low, 0.15 and −0.19, respectively. Only five of the models have correlations with the Mod.Avg. that are greater than 0.8, suggesting that the multimodel average is only representative of some of the models' other anthropogenic response patterns. Thus, the multimodel average pattern for historicalOA may contain errors and biases due to the inclusion of some models with diverse and inconsistent response patterns.

5 Results of an Optimal Detection Analysis

We examine large spatial and temporal scales over the 1906–2005 period, using the same standard spatiotemporal filtering as described in Jones et al. [2013] and the optimal detection methodology described in section 3. We project all the model data onto the same spatial grid as HadCRUT4, calculate nonoverlapping 10 year means, masking the model data to have the same spatial coverage as the observations, and project it onto spherical harmonics (T4) to filter out smaller than 5000 km spatial scales [Stott and Tett, 1998].

Combining 250 years from each of the 23 models' piControls (Table 2) creates a common EOF basis with 69 degrees of freedom. Separate 250 years from each model's piControl is used for the uncertainty analysis and residual consistency testing. The scaling factors can be sensitive to the choice of truncation of the EOF space. Including higher-order EOFs will allow a higher proportion of the variance explained of the original data to be retained. This will be at the cost of including variability modes that the required forced response patterns do not project strongly upon, thus adding noise but not any signal, i.e., reducing the signal-to-noise ratio [Hasselmann, 1997]. As such a balance needs to be struck between maintaining a strong signal and retaining patterns that are relatable to the original data [Tett et al., 2002]. An EOF truncation of 40 retains 97% of the variance of HadCRUT4. Using 40 EOFs also captures 97 to 99%, 96 to 98%, 84 to 97%, and 78 to 93% of the variance of the historicalGHG, historical, historicalNat, and piControl simulations, respectively. With a high proportion of the variability of the patterns captured by the first 40 EOFs, this seems a reasonable truncation choice for the following analysis, but the sensitivity to the choice is investigated (Figure S1 in the supporting information).

The first analysis we look at is the three-way regression that uses the historical, historicalGHG, and historicalNat simulations to produce the scaling factors: βhistorical, βhistoricalGHG, and βhistoricalNat (equation 1). These are then transformed to the required scaling factors for well-mixed greenhouse gas forcings (βG), other anthropogenic forcings (βOA), and natural forcings (βN), following equation (2) in Jones et al. [2013] (also see supporting information). This approach has been used on CMIP5 data previously [Gillett et al., 2013; Jones et al., 2013; Ribes and Terray, 2013]. Similar studies in the past have used different experiments to deduce the G, OA, and N scaling factors [e.g., Allen et al., 2000; Tett et al., 2002; Stott et al., 2006].

Of the 15 models examined there are only eight cases where G is detected, with βG>0 (Figure 8a), most of them across a wide range of EOF truncations (supporting information Figure S1). In the cases where G is detected a test of the residual variability being consistent with an estimate of the internal variability [Allen and Tett, 1999; Allen and Stott, 2003] is passed at the two-sided 10% level. Only five of the models (C, I, J, L, and O) have G scaling factors consistent with 1, suggesting that those models do not need significant scaling up or down.

Details are in the caption following the image
Optimal detection analysis for G, OA, and N. (a) Scaling factors and (b) scaled linear trends for the standard analysis for the 1906–2005 period, using 10 year means, spatial meaning of spherical harmonics (T4), and an EOF truncation of 40. Uncertainties given as 5–95% ranges. Squares in Figure 8a indicate where the residual of regression fails an F test when compared with measure of internal variability [Allen and Stott, 2003]. The horizontal line and shaded band in Figure 8b are the observed trend of 0.65 K/century, together with an estimate of 5–95% uncertainty range due to internal variability, estimated from the CMIP5 piControl variability [Tett et al., 2002].

Of the eight analyses on individual models that detect G, other anthropogenic (OA) factors are detected in five (using models D, F, J, L, and O), none with values consistent with 1, although this is somewhat sensitive to EOF truncation choice (Figure S1). Natural factors (N) are detected in seven of the cases that detect G, with five having values consistent with 1. Only for the analyses using models F, J, L, and O are all three patterns detected simultaneously. In the analysis using the multimodel average—calculated as the average of the models' ensemble means (Mod.Avg. in Figure 8)—G and N are detected, both with scaling factors consistent with 1, but OA is not detected. This is a robust result across choice of EOF truncations. It should be noted that the multimodel average analysis does not include a measure of model pattern uncertainty [e.g., Huntingford et al., 2006].

The patterns and scaling factors can be used to reconstruct the scaled global mean temperatures of the different forcing factor responses, i.e., βi(xiνi) [Allen and Stott, 2003]. Where G is detected, there are a varied range of scaled trends for the 1906–2005 period (Figure 8b), from 0.36 to 0.82 K/century (model E) to 1.32 to 4.22 K/century (model D). Confidence in the observed changes, with a trend of 0.65 K/century, being mostly attributable to G will be strengthened, for an individual model, if the scaling factors are consistent with 1 (Figure 8a).

For the model analyses that also detect it, OA has also wide linear trend ranges from −3.69 to −0.71 K/century (model D) to −0.48 to −0.09 K/century (model J). Where N is detected, the trends are very close to 0 K/century with the largest magnitude cooling trend around −0.33 to −0.09 K/century (model E). The analysis of Mod.Avg. produces scaled trends for G of 0.60 to 1.16 K/century, for OA of −0.45 to 0.12 K/century, and for N of −0.07 to −0.02 K/century.

These results are generally in line with the analysis in Jones et al. [2013], which examined the 1901–2010 period with a smaller number of CMIP5 models. What differences there are between the studies appear to be due to the change in choice in period and choice of EOF truncation. Using the 1901–2010 period, for these models, gives very similar results to Jones et al. [2013], even though the common EOF basis is constructed from a different sample of CMIP5 models (supporting information Figure S2).

These results can also be compared with those of Gillett et al. [2013] and Ribes and Terray [2013] which also applied an optimal detection methodology to CMIP5 models and HadCRUT4 to deduce the scaling factors for G, OA, and N. While both studies used the same spatiotemporal filtering as used here, there are differences in the analyses. For instance, different sets of the CMIP5 models were used; Gillett et al. [2013] used the period 1861–2010 for their main analysis, and Ribes and Terray [2013] applied an alternative optimal fingerprinting method in addition to the more standard approach. These differences, as well as other methodological choices such as choice of EOF truncation, mean that the results will differ. However, the results shown in Figure 8 support these studies in showing little consistency in the magnitude of the scaled greenhouse gas warming across a sample of CMIP5 models. Using the multimodel average was considered to be the most robust result [Jones et al., 2013; Gillett et al., 2013], but it is then legitimate to question the confidence of the magnitude of the attributed greenhouse gas warming when another important forcing factor with known strong radiative effects is not detected at the same time. As other anthropogenic influences are not robustly detected, is the factor not important for twentieth century temperature changes? Are there errors or biases in the other anthropogenic response patterns? Are other important factors not being included? Or is the detection analysis methodology flawed?

One of the expectations of optimal detection techniques is that they should bring the magnitude of the model responses into closer agreement, as long as there are no major differences in the spatial patterns and temporal evolution of the models' responses [e.g., Hegerl et al., 2007]. However, as also seen in recent studies using CMIP5 models [Jones et al., 2013; Gillett et al., 2013; Ribes and Terray, 2013], the range of G and OA trends after scaling (Figure 9b) is larger than before scaling (Figure 9a).

Details are in the caption following the image
Relationship between G and OA trends for 1906–2005 period (K/decade) for the G, OA, and N analyses. (a) Relationship between unscaled OA and unscaled G trends with uncertainties representing 5–95% range due to internal variability. (b) Relationship between scaled OA and scaled G trends, with uncertainties representing 5–95% range due to scaling factor ranges. Bars on axis represent minimum-maximum range of best estimates before scaling—inner lighter colored bar—and after scaling—outer darker colored bar. Dotted line used as a guide to show where the sum of trends is equal to the observed trend of 0.65 K/century.

One issue that has been often overlooked in recent detection and other related regression studies is a statistical examination of the signals being used as predictor variables to deduce if their inclusion is optimum. The historical (GOAN) and G patterns for most models are strongly correlated with each other and with the HadCRUT4 pattern with values typically around 0.8 (Table 3). With very strong correlations between GOAN and G there is the danger of degeneracy within a regression and large (compensating) uncertainties in the scaling factors for the patterns concerned [Allen and Tett, 1999]. The N signal, in contrast, is only weakly correlated with GOAN and G, reducing the risk of degeneracy. However, N is also weakly correlated with HadCRUT4, which suggests that it has lower importance as an explanatory variable.

Table 3. Correlations of Regression Patternsa
HadCRUT4 GOAN G
Model GOAN G N G N N
A 0.82 0.83  0.12  0.87  0.15  0.05
B 0.89 0.87  0.16  0.95  0.10 −0.08
C 0.87 0.85  0.11  0.89  0.10 −0.10
D 0.76 0.82  0.14  0.56  0.56 −0.08
E 0.82 0.85 −0.18  0.87 −0.10 −0.43
F 0.65 0.82  0.09  0.52  0.42 −0.09
G 0.68 0.75 −0.03  0.69  0.13 −0.14
H 0.83 0.86  0.18  0.86  0.26 −0.08
I 0.83 0.84  0.07  0.83  0.13 −0.16
J 0.58 0.83  0.00  0.51  0.23 −0.16
K 0.86 0.84  0.02  0.93  0.06 −0.16
L 0.85 0.84  0.37  0.79  0.37  0.16
M 0.64 0.78 −0.09  0.53  0.24 −0.17
N 0.89 0.79  0.14  0.83  0.11 −0.08
O 0.87 0.77  0.11  0.89  0.12 −0.03
Mod.Avg. 0.87 0.85  0.11  0.88  0.20 −0.16
  • a Cross correlations between the different patterns used in the optimal detection analyses and the observations, HadCRUT4. Shown are Pearson correlation coefficients between patterns named in the first row and patterns named in the second row, as used in the regression (equation 1) following filtering and after the data have been whitened. For example, the cross correlation between HadCRUT4 and N for the A model is 0.12.

Predictor patterns with weak magnitudes relative to the noise present can lead to biased estimates—when using ordinary least squares—or unbiased but very uncertain results—when using total least squares as used here [Allen and Stott, 2003]. The relative strengths of underlying signals relative to the internal variability can be measured by looking at the signal-to-noise ratios (SNR) (see Table 4—calculated following the technique in Tett et al. [2002]). All models have GOAN and G patterns with SNRs over 2, many considerably larger. In contrast, no model has a N signal with SNR above 2, and seven have a SNR less than 1.5. Of those seven models, their optimal detection analyses produce poorly constrained scaling factors for all three signals in five of the cases (Figure 8a). This illustrates the consequence of including low SNR patterns in a multivariate regression [Allen and Stott, 2003] which can increase uncertainties in the scaling factors. It also suggests that for some purposes, it would not be appropriate to include N as a separate predictor pattern due to its generally weak SNR.

Table 4. Pattern Signal-to-Noise Ratiosa
GOAN G N OA GOA OAN
A 3.04 3.62 1.34 1.14 2.20 1.21
B 6.40 4.71 1.43 0.75 4.11 1.12
C 5.41 6.53 1.39 1.94 3.40 2.67
D 4.28 6.85 1.63 3.34 2.02 4.71
E 5.53 8.27 1.88 1.92 4.31 3.10
F 3.95 7.57 1.77 3.84 2.24 5.23
G 2.29 2.49 1.68 1.18 1.92 1.34
H 4.66 6.70 1.36 1.93 3.07 2.72
I 4.33 6.06 1.52 1.85 3.02 2.57
J 2.97 6.68 1.93 3.11 2.27 4.06
K 7.32 7.63 1.55 1.87 4.36 2.76
L 3.28 5.03 1.51 1.93 2.17 2.27
M 2.22 2.69 0.89 1.42 1.17 1.99
N 2.91 2.89 1.01 1.16 1.61 1.52
O 4.34 3.04 1.17 0.96 2.29 1.17
Mod.Avg. 12.75 16.86 3.45 4.72 7.97 7.13
  • a Signal-to-noise ratios (SNRs) for ensemble mean patterns used in the optimal regression, after filtering, and projected onto first 40 EOFs. The historical all forcings (GOAN), well-mixed greenhouse gases (G), and natural (N) forcing responses are directly included in the regressions. The OA, GOA, and OAN signals are not directly involved in the regressions but calculated here to demonstrate the relative strength of the inferred patterns.

In view of these factors, it is usual in multivariate analyses to consider if one or more of the variables could be discarded from the regression. That the emergent responses to forcing factors from climate models are used, should provide further confidence in the results, and are a major advantage of this approach over other regression approaches to attributing causes to past climate changes [Bindoff et al., 2013]. So rejecting the inclusion of a pattern, when there are good physical reasons for it to be included in an analysis, could be considered unreasonable [Allen et al., 2006]. Thus, excluding a factor's climate response from the analysis should not be done lightly.

One set of techniques [Tett et al., 1999] that has previously been regularly used in optimal detection climate studies is based on degeneracy tests [Mardia et al., 1979]. Several tests examine the component patterns within the predictor variables and allow some measure of how many patterns should be allowed in the regression. The aim is to not have too many signals which can increase the chance of overfitting in the regression. We use principal component analysis tests [Mardia et al., 1979, pp. 243-245], implemented as in Tett et al. [1999], that examine the independence of the predictor variables to deduce which are of less importance and thus could potentially be discarded from the regression. None of the tests support using more than two signals for any of the models. Hence, we consider all combinations of two or less predictands that can be constructed from the available, historical, historicalGHG, and historicalNat experiments. This enables the examination of alternative plausible causal climate factors. If any of these are ruled out the confidence in the attribution of a dominant greenhouse gas influence is strengthened [Hasselmann, 1997].

5.1 Regression With GOA and N

This combination is based on the two-way regression using historical and historicalNat to deduce anthropogenic (GOA) and natural (N) scaling factors (Figure 10).

Details are in the caption following the image
As in Figure 8 but for two-way regression of GOA and N. Note the smaller trend range displayed in Figure 10b than that in Figure 8b.

How the signals are transformed and how sensitive the results are to choice of EOF truncation are summarized in the supporting information and Figure S3. The derived anthropogenic spatial temporal responses (historicalGOA) are generally quite similar across the models (supporting information Figure S9), apart from models M and G which have fairly low spatiotemporal correlations with other models. GOA is detected by all the model analyses, although in three of the cases (using models F, J, and M) the residual consistency test fails, suggesting possible underfitting/overfitting to the observations. Only when using three models is βGOA found to be consistent with 1. N is detected in nine of the model analyses. This signal combination was also examined by Gillett et al. [2013] and Ribes and Terray [2013] and was found to give more robust results than the G, OA, and N combination. The GOA signal was detected in the analysis of all nine models considered by Gillett et al. [2013] and in the analyses of six out of the 10 models considered by Ribes and Terray [2013]. Ribes and Terray [2013] also analyzed global 10 year means as a climate index and found analyses of nine out of 10 models that detect GOA (as reported by Bindoff et al. [2013]).

The scaled temperature trends for GOA are tightly constrained, being near to the observed trend, varying from 0.40 to 0.59 K/century (model L) to 0.64 to 0.77 K/century (model G). The Mod.Avg. analysis detects both GOA and N with values consistent with 1 and with scaled trends for GOA of 0.57 to 0.73 K/century and N of −0.05 to −0.01 K/century for the 1906–2005 period. The scaling factors for the individual models and Mod.Avg. are largely in line with what was found by Gillett et al. [2013] and Ribes and Terray [2013] despite the differences between the analyses. The reconstructed trends from the Gillett et al. [2013] and Ribes and Terray [2013] studies [in Bindoff et al., 2013, Figure 10.4] also found a fairly consistent attributed trend due to total anthropogenic influences to be near the observed warming trend, albeit for the different period of 1951–2010.

On this evidence the two-way regression of anthropogenic and natural, GOA, and N seems to be a more robust technique for attributing past climate. However, there are two issues to be considered. First, while the scaled GOA trends of the CMIP5 models are in close agreement, not all of the models have scaling factors consistent with 1 or pass the residual consistency test. This may reduce the confidence in the attribution of the observed changes to anthropogenic influences for those models. Second, the models will have quite different contributions from G and OA to produce the same scaled GOA trends. To demonstrate this second point, we can estimate individual contributions to the scaled GOA trends from G and OA with a simplifying assumption. The reconstructed scaled anthropogenic temperatures can be deduced from equation 1 as βGOA(xGOAνGOA). If we assume that νGOA does not contribute much to the trend of GOA, this can be expanded to βGOAxG+βGOAxOA, enabling an estimate of the contributions from G and OA. Figure 11 shows the estimated G, OA, and N contributions to the scaled trends deduced from the G, OA, and N and the GOA and N analyses. Inspection of Figures 8b and 11b demonstrates that this approximation of the G, OA, and N trends is reasonable. The contributions from G and OA to the scaled GOA (Figure 11b) show that even when the scaled net anthropogenic warming is fairly consistent across the models (Figure 10b), the G and OA contributions are far from consistent. The estimated contribution from G (βGOAxG) across the models ranges from 0.63 to 0.83 K/century (model B) to 1.84 to 3.04 K/century (model F). The Mod.Avg. result, for the GOA and N analysis, suggests a larger magnitude G warming and OA cooling than the G, OA, and N analysis does.

Details are in the caption following the image
Comparisons of estimated contributions of G, OA, and N (red, green, and blue bars, respectively) to the scaled trends deduced from the different optimal detection analyses. (a) G, OA, and N analysis: Trends of βGXG, βOAXOA, and βNXN. (b) GOA and N analysis: Trends of βGOAXG, βGOAXOA, and βNXN. (c) G and OAN analysis: Trends of βGXG, βOANXOA, and βOANXN. Dashed line is the 1906–2005 HadCRUT4 trend.

The net anthropogenic warming result raises some pertinent questions. Because of the range in the magnitude of the responses, it will often not be possible to have self-consistency (β≈1) for all the models at the same time as a similarity of scaled trends between the models. Thus, as a step toward formal attribution, it may not be a requirement that scaling factors are consistent with 1 for all the model analyses at the same time. This is consistent with the views that β≈1 should not be a strong constraint for attribution [Hegerl and Zwiers, 2011; Bindoff et al., 2013] and that agreement between models' scaled trends is important for robust attribution [Hegerl et al., 2007; Bindoff et al., 2013]. However, given the varied G and OA contributions to the consistent scaled GOA trends across the models (Figure 10b), together with the limited number of models with scaling factors consistent with 1 (Figure 11a), it must be of concern whether the agreement is an artifact or not [Allen et al., 2006, section 6.1.2].

5.2 Regression With GOAN

When using only the historical experiment, the single pattern of GOAN (supporting information Figure S4) is detected in each of the 15 model analyses, with the scaled trends generally close to the observed trend. Several of the analyses produce lower scaled trends and fail the residual consistency test. This indicates that some model's GOAN patterns are unable to be matched with the observed pattern by scaling alone.

5.3 Regression With G

When using only the historicalGHG experiments, the single pattern of G (supporting information Figure S5) is detected in each of the 15 model analyses, with scaling factors almost always below 1. This is not surprising as the scaled trends are very close to the observed warming, and there is no factor to partially offset the greater G warming [Allen et al., 2006].

5.4 Regression With N

The single pattern of N, derived from historicalNat alone (supporting information Figure S6), is either not detected or fails the residual consistency test across all the model analyses. This is important in the attribution of anthropogenic influences as it means that natural external factors alone cannot explain the observed changes.

5.5 Regression With G and N

Using the experiments historicalGHG and historicalNat in a two-way regression, both G and N (supporting information Figure S7) are robustly detected except for the analyses of a few models, with G having low scaling factors as N is unable to offset much of the G warming. When different combinations of patterns are available in a regression of a physical process, it is simplest to choose those that incorporate all the known major forcing factors [Allen et al., 2006].

5.6 Regression With G and OAN

The analysis of G and OAN uses the regression of the historicalGHG and historical experiments to transform the scaling factors for G and OAN (supporting information). This combination has previously been investigated, albeit for precipitation changes, in Wu et al. [2013]. In each of the 15 model analyses G is detected, with 10 of the cases having G scaling factors consistent with 1 (Figure 12a). OAN is detected in all but one of the analyses but is only consistent with a scaling factor of 1 in five of the cases. G and OAN are robustly detected across EOF truncations for each of the 15 models, with 10 of the cases detecting G across all truncations (supporting information Figure S8). The scaled G temperature trends (Figure 12b) are much more in agreement than in the G, OA, and N analysis. There is still a substantial range, however, with the trend of G varying from 0.57 to 0.78 K/century (model G) to 1.12 to 1.67 K/century (model N). Similarly, OAN has trends closer together than OA were in the G, OA, and N analysis with ranges varying from −0.10 to 0.08 K/century (model G) to −1.00 to −0.49 K/century (model N). For the analysis using the CMIP5 model mean, Mod.Avg., both G and OAN are detected with scaling factors consistent with 1, with attributed trends of 0.87 to 1.22 K/century for G and −0.54 to −0.22 K/century for OAN.

Details are in the caption following the image
As in Figure 8 but for the two signal regression of G and OAN.

An examination of the cross correlations of the spatiotemporal patterns gives a range of −0.11 to 0.92 but with most models appearing to have similar historicalOAN patterns (supporting information Figure S9), apart from models B, G, and O. The marginal closer agreement between the historicalOAN patterns and the historicalOA patterns is partially due to the higher SNR of the former patterns (Table 4). While the trend of scaled G (Figure 13b) is much more constrained than in the G, OA, and N analysis, it still has about the same range as the unscaled G trend (Figure 13a). However, the OAN scaled trend is in closer agreement than the unscaled OAN trend.

Details are in the caption following the image
Relationship between G and OAN trends for 1906–2005 period (K/decade) for G and OAN analysis. (a) Relationship between unscaled OAN trend and unscaled G trend. (b) Relationship between scaled OAN trend and scaled G trend. See Figures 12 and 9.

The scaling factors for G and OAN can be used to estimate the contributions from G (βGxG), OA (βOANxOA), and N (βOANxN) to the scaled trends (Figure 11c). The G and OAN analysis has the lowest spread of contributions from G and OA for the three analyses shown. This could suggest that this combination may be a more robust technique than either the three-way regression (G, OA, and N) or two-way regression (GOA and N). However, close consistency of scaling factors does not necessarily mean that the results are more accurate, as there may be consistent biases in the methods. For instance, any differences in the uncertainty in the magnitude of forcing and response to natural and other anthropogenic factors in the G and OAN analysis cannot be accounted for.

As there is only one reality to test the models against, which technique is more robust is not easy to check for. One option is to do perfect model tests, where the patterns being investigated are known before time, for instance, using climate models as surrogates for the observations [e.g., Stott et al., 2003; Ribes and Terray, 2013]. One can then examine how well the models attribute the forcing components in other models.

5.7 Perfect Model Results

From the 15 CMIP5 model used in this study there are 72 historical simulations that can be used as predictands or surrogate observations. For each of the 15 models being used as predictors or “detector” models the analysis is repeated on each of the surrogate observations—not including the historical simulations from the same model. Then in each case the scaled temperature trends can be compared to the surrogate model's own estimate of that forcing contribution. The fraction of detections and scaled trends consistent with the expected trend (when the residual test for consistency is also passed) is calculated for each detector model (Figures 14a and 14b) as well as for each surrogate observation model (Figures 14c and 14d). Thus, it is possible to examine how faithful each model is in attributing influences in the other models and in how well each model's own forced components are attributable by the other models.

Details are in the caption following the image
Summary of perfect model study results for (a, c) the G, OA, and N analysis and (b, d) the G and OAN analysis. Shown are fractions of cases where the signal is detected (light color bars) and where the scaled signal trend is consistent with the expected trend of the signal for the model being used as surrogate observation (dark color bars)—both where the residual passes the consistency test. Figures 14a and 14b show, for each predictor (or detector) model, the fractions of surrogate observations—up to 72—drawn from historical simulations from the other 14 models. Figures 14c and 14d show for each model that is used as a predictand or surrogate observation (our “truth” model if you will) the fraction of predictor models—up to 14 times the number of ensemble members for that predictand model.

For the G, OA, and N combination (Figure 14a) G is detected in the surrogate models more than 80% of the time for only three of the models, and only once is OA detected in the surrogate models more than 80% of the time. In contrast, G and OAN are detected when using 11 and 10, respectively, of the models more than 80% of the time in the analyses of the G and OAN combination (Figure 14b). Several models are very poor, <10%, at providing patterns that can detect G in the G, OA, and N analysis (models B, D, H, and M). In contrast, in none of the model analyses is G detected less than 60% of the time for the G and OAN combination. Based on this analysis the G and OAN combination is more discriminating than using the G, OA, and N combination.

The fraction of scaled trends consistent with the expected trends is generally higher in the G and OAN analysis than the G, OA, and N analysis, but there are a number of models where that is not the case. For instance, for the G, OA, and N combination, when six of the detector models are used, there are higher G consistency fractions than when the same models are used for the G and OAN analysis. From the viewpoint of the models acting as surrogate observations (Figures 14c and 14d), for any given model acting as the predictand, there is a higher fraction of the predictors that make detections and, except for one predictand model, a higher fraction of predictor trends consistent with the expected trends in the G and OAN analysis than in the G, OA, and N analysis.

In other words, some models are better at being detectors than others and some models better able to have their component signals detected than others. Overall, the G and OAN analysis appears to be more skilful in detecting and estimating the true G trend in the surrogate observations than the G, OA, and N analysis. As noted for the analysis of G, OA, and N against the observations, the analyses on models with low SNR, <1.5, for historicalNat (Table 4), tend to detect G with lower frequencies than analyses on models with higher SNR.

6 Implications for Observationally Constrained TCR

Optimal detection analysis results have been used to attempt to provide observationally constrained estimates of future warming (often called the “ASK” approach after Allen et al. [2000], Stott and Kettleborough [2002], and Kettleborough et al. [2007]). This applies scaling factors for G to future warming trends by assuming that a model that is overresponding/underresponding in the past will do the same in the future. This technique can also be applied to the transient climate response [Stott and Forest, 2007], i.e., βGTCR, and has been used to provide estimates of observationally constrained TCR [Stott and Jones, 2012; Gillett et al., 2012; Stott et al., 2013; Gillett et al., 2013]. Gillett et al. [2013] estimated the optimal detection constrained TCR, based on the multimodel mean analysis, to be 0.9 to 2.3K (5–95%), which contributed to the IPCC's assessment that TCR “is likely in the range 1°C to 2.5°C” [Bindoff et al., 2013; Collins et al., 2013].

As expected the wide, often unconstrained, ranges of βG in the G, OA, and N analysis produce wide ranges of scaled TCR (Figure 15a). Where G is detected, the scaled TCR has ranges that vary from 0.72 to 1.65 K (model E) to 2.53 to 8.12 K (model D). The lack of consistency across models is similar to what was found by Gillett et al. [2013]. The constraining of future warming based on the analysis on a single climate model could be misleading if the impact of model diversity is not acknowledged.

Details are in the caption following the image
Transient climate response (TCR) deduced from each of the optimal detection analyses. (a) G, OA, and N; (b) GOA and N; and (c) G and OAN. Uncertainty ranges, 5–95%, for TCR for each model before scaling (black lines) calculated from variability from each model's piControl. Uncertainty ranges, 5–95%, after scaling (red lines) calculated from regression scaling factors. (d) The multimodel average (Mod.Avg.) of TCR before and after scaling for different analyses, as well as the scaled multimodel average TCR from Gillett et al. [2013, hereafter G13]. The uncertainty range of the Mod.Avg. unscaled TCR estimated by boot strapping the means from each of the models to give an estimate of the “true” multimodel range.

The regression analysis used by Bindoff et al. [2013] to support the anthropogenic warming attribution statement was the two-way case of GOA and N. In principal the scaling factors for GOA, from a GOA and N analysis [Gillett et al., 2013], could also have been used to constrain the TCR if one assumes that βGOA=βG. The ranges of scaled TCR (βGOATCR) are considerably varied (Figure 15b) across the models, from 1.09 to 1.65 K (model G) to 3.17 to 5.03 K (model D).

The ranges of scaled TCR (βGTCR) deduced from the G and OAN analysis (Figure 15c) are less spread than in the two previous analyses, from 0.98 to 1.44 K (model G) to 1.91 to 2.75 K (model C). There is, however, still limited consistency across the models, and it is arguable that TCR after scaling is not better constrained than before scaling.

The scaled multimodel average of TCR for the G, OA, and N analysis (Figure 15d), 1.07 to 2.06K, is similar to that reported in Gillett et al. [2013]. Some of the uncertainties associated with the historicalGHG trends not having simple relationships with TCR (Figure 2) were included by Gillett et al.[2013] but not by us, so the spread in scaled TCR here is slightly smaller. The scaled multimodel average TCR is higher for the GOA and N and G and OAN analyses, 1.84 to 2.40 K and 1.54 to 2.17 K, respectively.

The wide variety of scaled TCR values across models and different choices in the analysis has been seen previously [Gillett et al., 2012; Stott and Jones, 2012; Gillett et al., 2013]. The results presented here strongly imply that considerable uncertainties exist with the specific use of a three-way regression which includes N as a separate pattern. This highlights the importance of not putting too much emphasis on one result when there are sensitivities to analysis choice and to the uncertainty in the climate response to given forcing factors.

7 Discussion

The method of optimal detection relies on a number of assumptions. The most important is that patterns can be linearly combined [Gillett et al., 2004] and that while response magnitudes are uncertain there are no errors in the spatiotemporal patterns. The examination of response patterns in section 4 clearly demonstrates that there are a wide variety of responses spatially and temporally across the models, especially for non–greenhouse gas anthropogenic factors. The limited consistency in scaled temperature trends across the models in the observational analysis and in the perfect model tests is of concern. As some of the model responses are inconsistent with each other, the historicalOA multimodel average is not as representative of the models as the historicalGHG multimodel average, and as such its use will not give as robust results as once was considered [Gillett et al., 2013; Jones et al., 2013].

Clearly, further work is required to address the issues raised in this paper: the importance of experimental design and analysis methodology and how to incorporate pattern uncertainty. Other approaches related to optimal detection may be helpful to respond to these concerns. One technique that has been used in multimodel approaches is error in variables [Huntingford et al., 2006]. The method outlined in Huntingford et al. [2006] uses intramodel variability to estimate the uncertainties in the shape, but not the magnitude, of the spatiotemporal patterns which can be included as an extra component in the uncertainty analysis. A related approach that it is claimed to be an improvement on the Huntingford et al. [2006] technique and that can incorporate different sources of uncertainty may be a promising development [Hannart et al., 2014]. However, techniques that try to include measures of model uncertainty often rely on sampling an ensemble of opportunity (e.g., CMIP5), which may not be representative of the population of all physically plausible models [Hegerl and Zwiers, 2011]. Another related approach to optimal detection described here is the recently proposed regularized optimal fingerprinting technique [Ribes et al., 2013]. The technique derives a regularized estimate of the covariance matrix that is more accurate than other methods [Bindoff et al., 2013], with a “a trade-off between bias and variance” [Ledoit and Wolf, 2004]. Techniques which attempt to account for model/observational discrepancies [Harris et al., 2006], apply Bayesian methods [Lee et al., 2005], or screen models depending on a measure of their quality [Santer et al., 2012] may be helpful.

The optimal detection analysis applied in this study uses a common EOF basis, as the different CMIP5 models did not have enough available piControl data to characterize each model's internal variability well enough. As the CMIP5 models sample a wide range of interannual to interdecadal temperature variations due to internal variability [Jones et al., 2013, Figure 5], using a common EOF basis will be suboptimal for analyses on individual models. A preferred approach, for a regression analysis with an individual model, would be able to characterize the internal variability of that model, which could be done with much longer length piControls [Jones et al., 2013]. The techniques to investigate if inclusion of predictor signals is superfluous or not should be considered not only for optimal detection studies but also for alternative regression analyses that have also been used to estimate contributions to past climate [e.g., Bindoff et al., 2013, section 10.3.1.1.3] to avoid misinterpretation of results.

A set of model experiments have been recommended by Ribes et al. [2015] for inclusion in the next phase of the Climate Model Intercomparison Project, CMIP6 [Meehl et al., 2014]. The proposed experiments are different to what have been used in previous optimal detection analyses: anthropogenic and natural forcings together, natural only, and aerosol-only forcing factors. Ribes et al.[2015] state that these would enable the deduction of contributions from greenhouse gases, aerosol, and natural forcing factors. By better characterizing the aerosol climate response, it is claimed, the “greenhouse gas” attributable warming would be better constrained. Ribes et al. [2015] recommended that large numbers of ensemble members of historicalNat would be preferable in future analyses. They concluded that given a total number of 25 simulations to apportion to the different historical forcing experiments in their perfect model analysis, results were more reliable with order 10 historicalNat ensemble members. However, this is much higher than were provided by any institution for CMIP5 (Table 2). It is doubtful that many institutions would have the resources for this recommendation, especially given the general increase in the number of experiments that could be done for CMIP6 [Meehl et al., 2014]. In their perfect model analysis, Ribes et al. [2015] were limited to using models with strong aerosol cooling effects. They explain that the optimum number of ensemble members could have been different if models with different magnitude aerosol cooling were considered. Ribes et al. [2015] also did not examine the impact of the detector model having different patterns than the truth model. It is thus unclear how appropriate the recommendations are in practice.

To encourage as many models to be included in CMIP6 for detection studies as possible, more practical recommendations of numbers of experiments and of ensemble members should be considered, such as focusing on initial condition ensembles of historicalGHG and historicalOAN. Experimental design should take into account the likely strength of the response patterns. If a very large number of ensemble members would be needed to reasonably characterize the weak response of some forcing factors, it should be reconsidered if the experiment is a high priority or not. To also help better characterize the responses to forcing factors of particular interest, experiments should be designed to limit having to derive response patterns from other experiments. The impact this may have on the total number of simulations that would be produced will need to be accounted for. Planning multimodel experimental designs (such as CMIP6) [Meehl et al., 2014] will also require considering the importance of a range of variables, on different time and spatial scales, that may have responses with differing SNR. For instance, the historical global precipitation response to natural forcing factors is much stronger than that to well-mixed greenhouse gases [Bindoff et al., 2013] and analyses with different choices of spatial/temporal filtering may produce patterns with much stronger SNR.

Why earlier studies (such as reported in Hegerl et al. [2007]) gave more consistent scaled trends for different models, when deducing the greenhouse gases, other anthropogenic and natural factors contributions to twentieth century observed warming, is still an open question. Before CMIP5, total solar irradiance data sets were used that had larger increases over the twentieth century, which would have lead to stronger natural response patterns. There were simpler experimental designs that improved the SNR of the other anthropogenic factors, and the model response patterns may have been less varied than those in CMIP6.

8 Conclusions

A wider range of climate models have been included in an optimal detection analysis of large spatial and temporal scale variations of twentieth century surface temperature variations than ever before. Only the analyses of eight of the 15 CMIP5 models examined detect a separate greenhouse gas signal, and of those only five also detect the influence of other anthropogenic factors in a standard analysis design. The scaled greenhouse gas trends show a wide range across the models with little consistency, supporting previous studies. This variety of results appear to be largely down to two factors: (1) The inclusion of patterns with low signal-to-noise ratio introduces noise into the regression and (2) differences in the spatiotemporal patterns across the models which are irreconcilable in the optimal detection analysis as it is currently designed. In particular, the temporal and spatial patterns of response to the non–greenhouse gas anthropogenic factors are more varied across the CMIP5 models than the response to greenhouse gases. Using an alternative analysis design, by using historical and historicalGHG experiments, we find that well-mixed greenhouse gases are detected with all models and other anthropogenic and natural influences combined in the vast majority of models. Of the observed warming of 0.65 K/century, the multimodel average analysis (not including a measure of model pattern uncertainty) [Huntingford et al., 2006; Hannart et al., 2014] attributes a well-mixed greenhouse gas warming of 0.86 to 1.22 K/century and other anthropogenic and natural cooling of −0.54 to −0.21 K/century. However, with the models included in the fifth phase of the Climate Model Intercomparison Project sampling a wider range of physical and chemical processes than ever before, the more distinct responses of the models still make it difficult to get consistent attributed trends.

There are a number of ways in which the utility of future model intercomparison projects could be improved for attribution studies of large-scale temporal and spatial variations of near-surface temperatures. The first would be to ensure as much model data are available as possible. i.e., to have long—thousands of years—piControl simulations to better constrain the multidecadal variability from each model and to have large initial condition ensembles to better characterize the forced responses. Second, if for most models very large initial condition ensembles are not possible to consider designing experiments that avoid predictors with low signal-to-noise ratio. Third, to have a set of experiments which do not require the differencing of two or more experiments to deduce the response to wanted forcing factors. This will reduce uncertainties in the responses, for instance, helping to distinguish the other anthropogenic or aerosol response patterns from greenhouse gases. However, depending on the specifics of the experimental design, this may have a cost with regard to the total number of simulations to be produced. Fourth, to have as consistent application as possible, across the models, of what forcing factors are included in each experiment, such as indirect aerosols being applied by all models. This will aid the interpretation of the experiments responses, although it may be technically challenging in some Earth system models. Finally, a thorough quantifying of the radiative forcing from the major individual forcing factors contributing to the experiments for each model would greatly contribute to the understanding of model response patterns [Forster et al., 2013; Andrews, 2014]. Of course, in the planning of future model intercomparison projects, consideration should also be given to the requirements of other analyses which examine other variables on different time and space scales than examined in this study.

Given the wide range of model responses to not only greenhouse gases but also to other anthropogenic factors, consistency in the attribution of trends, across models, is unlikely to happen—with the current formulation of the analysis. There is a challenge to the detection and attribution community to improve how pattern spatiotemporal uncertainty is dealt with in the standard optimal detection methodology and how to interpret the variety of results when using an ensemble of opportunity that may poorly sample model diversity [Hegerl and Zwiers, 2011]. Approaches that try to account for pattern uncertainties [Huntingford et al., 2006; Hannart et al., 2014] should be examined further. The results presented here do not throw into question that well-mixed greenhouse gases are the dominant influence on changes in near-surface temperatures over the last 100 years or so. But a return to more thorough use of techniques in regression studies and a better understanding of differences between the models will help to constrain what has happened in the past and give more confidence in climate projections.

Acknowledgments

We acknowledge the Program for Climate Model Diagnosis and Intercomparison and the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output. We thank Ben Booth, Laura Wilcox, and Annica Ekman for useful discussions about the CMIP5 aerosol modeling. We thank Tim Andrews for providing CMIP5 TCR data and Piers Forster for ERF data. We are grateful to Natalie Mahowald for providing aerosol optical depth data for the CCSM4 model. We wish to thank the reviewers of this manuscript for their useful comments. The CMIP5 data used in this study were obtained from http://cmip-pcmdi.llnl.gov/cmip5/ and were up-to-date as of March 2013. All models used were “p1” physics versions. Version numbers of the data retrieved are available on request from the lead author ([email protected]). HadCRUT4 data were obtained from http://www.metoffice.gov.uk/hadobs, version 4.1.1.0, retrieved March 2013. The work of the authors was supported by the Joint UK DECC/Defra Met Office Hadley Centre Climate Programme (GA01101).