Comparison of Climate Model Large Ensembles With Observations in the Arctic Using Simple Neural Networks
Abstract
Evaluating historical simulations from global climate models (GCMs) remains an important exercise for better understanding future projections of climate change and variability in rapidly warming regions, such as the Arctic. As an alternative approach for comparing climate models and observations, we set up a machine learning classification task using a shallow artificial neural network (ANN). Specifically, we train an ANN on maps of annual mean near-surface temperature in the Arctic from a multi-model large ensemble archive in order to classify which GCM produced each temperature map. After training our ANN on data from the large ensembles, we input annual mean maps of Arctic temperature from observational reanalysis and sort the prediction output according to increasing values of the ANN's confidence for each GCM class. To attempt to understand how the ANN is classifying each temperature map with a GCM, we leverage a feature attribution method from explainable artificial intelligence. By comparing composites from the attribution method for every GCM classification, we find that the ANN is learning regional temperature patterns in the Arctic that are unique to each GCM relative to the multi-model mean ensemble. In agreement with recent studies, we show that ANNs can be useful tools for extracting regional climate signals in GCMs and observations.
Key Points
-
An artificial neural network is trained to identify which climate model produced an annual mean map of near-surface temperature in the Arctic
-
The classification network is evaluated using input from atmospheric reanalysis as a method of comparing climate models and observations
-
An explainability method reveals regional temperature patterns the artificial neural network is using to classify observations with different climate models
Plain Language Summary
Due to many complex processes in the climate system, the Arctic is warming more rapidly relative to other parts of the globe. To understand the impacts of these changes in the Arctic, it is important to evaluate climate model projections. While there are other existing statistical methods for assessing simulations between different climate models, we introduce a machine learning approach for comparing climate models and observations using a tool called artificial neural networks (ANNs). We set up our problem by inputting yearly maps of temperature in the Arctic and then task the ANN to classify which climate model produced each map. To understand how the ANN learns where the temperature map is coming from, we utilize a visualization method to peer into the machine learning black box. After training our ANN on data from different climate models, we then input maps of Arctic temperature from observations to evaluate which climate model is classified for every year in the historical record. Using this setup, we find that the ANN is leveraging regional patterns of temperatures, and not just overall warm and cold biases, in order to make its climate model and observation predictions.
1 Introduction
The Arctic is warming at a rate of more than three times as fast as the globally averaged mean surface temperature trend (Druckenmiller et al., 2021). This dramatic warming, otherwise known as Arctic amplification, is accompanied by long-term losses of Arctic sea-ice extent and thickness (Kacimi & Kwok, 2022; Parkinson & DiGirolamo, 2021; Schweiger et al., 2019), reductions in the ice mass of glaciers and the Greenland Ice Sheet (Mouginot et al., 2019; Tepes et al., 2021), thawing permafrost and boreal wildfires (McCarty et al., 2021; Miner et al., 2022), changes to deep ocean heat content and biogeochemistry (Solomon et al., 2021; Timmermans et al., 2018), shifts in high latitude phenology (Myers-Smith et al., 2020), and other possible connections to local and remote extreme weather (Cohen et al., 2020; Graham et al., 2017). As summarized in Previdi et al. (2021) and P. C. Taylor et al. (2022), local CO2 forcing and other positive feedbacks in the Earth system contribute to Arctic amplification, such as from increases in atmosphere-ocean poleward energy transport, changes in clouds and water vapor, the ice-albedo feedback, Planck and lapse rate feedbacks, and other radiative energy imbalances. To further understand the contributions to Arctic amplification and its far-reaching impacts, it is necessary to evaluate climate models of varying orders of complexities (Dutta et al., 2021; Hahn et al., 2022; Henry et al., 2021; Holland & Landrum, 2021). Moreover, fully coupled atmosphere-ocean global climate models (GCMs) are needed for comparing future assessments of Arctic climate change. However, there are large mean state biases across the Arctic between different GCMs (Davy & Outten, 2020), such as in Coupled Model Intercomparison Project 5 and 6 (CMIP5/6). For example, most CMIP6 models are still too cold over sea ice during the boreal winter (Davy & Outten, 2020). Large internal variability also needs to be accounted for in the high latitudes, especially when considering dynamical changes to the atmospheric circulation (M. England et al., 2019; Peings et al., 2021; Swart et al., 2015). To address some of these issues, one opportunity is to use large ensembles from different GCMs, which includes both internal variability and structural model uncertainties when comparing historical and future Arctic climate change simulations (Deser et al., 2020; Landrum & Holland, 2020).
Improving credibility, understanding, and trust in climate models requires constant evaluation of historical and future projections, especially for considering them in adaptation and mitigation planning in the Arctic. In fact, previous assessment reports from the Intergovernmental Panel on Climate Change have devoted entire chapters to climate model evaluation for summarizing GCM performance and other associated diagnostics (e.g., Flato et al., 2013; Randall et al., 2007). A number of scientific institutions have also developed automated statistical toolboxes, such as the Program for Climate Model Diagnosis and Intercomparison Metrics Package (Gleckler et al., 2016; Lee et al., 2021) and the National Center for Atmospheric Research Climate Variability Diagnostics Package (Phillips et al., 2014, 2020), to assist in methodologically comparing GCMs. These types of software packages usually compare sets of relative skills metrics or rankings for CMIP5/6 models across different mean climate fields, modes of internal variability, trends, extreme events, and teleconnections.
At a basic level, climate model evaluation considers sets of skill metrics, such as measures of bias, variance, pattern correlation, and root-mean-square error (RMSE), for comparing differences between GCMs and observations. The scalar metrics are often then presented in summary displays, such as through Taylor diagrams (K. E. Taylor, 2001) or portrait diagrams of relative error (Gleckler et al., 2008). In recent years, more advanced statistical methods have also been applied to mean climate benchmarks, such as through bias correction, emergent constraints, and model independence and performance-based weighting schemes (Brunner et al., 2020; Eyring et al., 2019; Knutti et al., 2017; Lauer et al., 2020; Merrifield et al., 2020). This includes leveraging output from newly designed GCM large ensembles (Maher et al., 2021). However, these common relative error and emergent constraint measures are not without issues (Chai & Draxler, 2014; Sanderson et al., 2021); in some cases, they may even underestimate the skill of climate models (Willmott et al., 2017). Most of these benchmarks also only consider point-by-point statistics, rather than considering potential (non)linear patterns across space or time. As a result, it is worth exploring new approaches for climate model evaluation, especially considering the growing interest in applying deep learning methods in the geosciences (Nichol, Peterson, Fricke, & Peterson, 2021; Nowack et al., 2020; Reichstein et al., 2019).
Although the use of machine learning methods is still fairly new in climate science applications (Boukabara et al., 2021; Rasu et al., 2019), several studies have already demonstrated their utility over traditional multiple linear regression for identifying mechanistic processes and extracting patterns of climate change and variability (e.g., Barnes et al., 2020; Nichol, Peterson, Peterson, et al., 2021; Pasini et al., 2017). In this study, we use a form of deep learning called artificial neural networks (ANNs) for classifying Arctic maps of temperature data according to different GCMs. We also leverage an explainable machine learning method to identify regional climate patterns that the ANN is using to make its classification.
Overall, the ANN is quickly able to learn which climate model produces each annual mean map of near-surface temperature by using regional patterns that are unique to each large ensemble simulation, especially relative to the multi-model mean large ensemble. The machine learning explainability method then reveals these relevant regional pattern fingerprints of temperature for each climate model. One motivation for this work is that we are interested in applying inputs from observationally derived maps to compare with GCMs using the ANN classification scheme and evaluate whether our method produces similar results relative to other climate model evaluation techniques. Here, the methodological difference is that by using ANNs we can also consider potential regional nonlinear relationships across the entire Arctic map, rather than only computing point-by-point statistics. Notably, we find that although the ANN is using these regional patterns, the classification results for comparing with observations resemble other simple evaluation methods.
2 Data
2.1 Multi-Model Large Ensemble Archive
To train our ANN on climate model data, we use a collection of single model initial-condition large ensemble simulations from the multi-model large ensemble archive (MMLEA) (Deser et al., 2020; NCAR, 2020). The MMLEA consists of seven CMIP5-class GCMs, which range in ensemble size from 16 to 100 members. Specifically, we use the Canadian Earth System Model Large Ensemble (CanESM2; Kirchmeier-Young et al., 2017), Max Planck Institute Grand Ensemble (MPI; Maher et al., 2019), Commonwealth Scientific and Industrial Research Organization Large Ensemble (CSIRO-MK3.6; Jeffrey et al., 2013), EC-Earth Consortium Large Ensemble (EC-Earth; Hazeleger et al., 2010), Geophysical Fluid Dynamics Laboratory Large Ensemble (GFDL-CM3; Sun et al., 2018), Geophysical Fluid Dynamics Laboratory Earth System Model Large Ensemble (GFDL-ESM2M; Rodgers et al., 2015), and the Community Earth System Model Large Ensemble Community Project (LENS; Kay et al., 2015).
We include only the first 16 ensemble members from each simulation, since this is the minimum number of ensemble members available to equally weight all seven GCMs (i.e., EC-Earth includes 16 ensemble members) when training and testing our ANN. The GCMs also differ by their initialization protocol (Hawkins et al., 2016; Stainforth et al., 2007) and utilize micro perturbations (i.e., small roundoff error in the atmospheric initial conditions: EC-Earth, GFDL-CM3, LENS), macro perturbations (i.e., different coupled atmosphere-ocean states: MPI, CSIRO-MK3.6, GFDL-ESM2M), or a combination of these two methods (CanESM2). All of the simulations in the MMLEA use historical forcing until 2005 and Representative Concentration Pathway 8.5 (RCP8.5) forcing thereafter (Riahi et al., 2011; K. E. Taylor et al., 2012). Although RCP8.5 is described as an unrealistically high emissions scenario (e.g., Hausfather & Peters, 2020; Peters & Hausfather, 2020), we focus on data from the observational record (1950–2019) and discuss the broader conclusions of using explainable neural networks to compare maps of climate data between different GCMs. Given that the individual RCP scenarios do not substantially diverge until later in the 21st century (Vuuren et al., 2011), the use of this future emissions scenario does not affect the interpretation of our results.
Large ensembles are useful for disentangling the effects of internal variability relative to external climate forcing, especially in regions such as the Arctic. Recently, the MMLEA has been used in studies for evaluating Arctic amplification (e.g., M. R. England, 2021; Holland & Landrum, 2021; Landrum & Holland, 2020), detection and attribution of extreme events in Siberia and Alaska (e.g., Ciavarella et al., 2021; Weidman et al., 2021), comparing projections of Arctic sea ice (e.g., Bonan et al., 2021; Topál et al., 2020), and identifying extratropical teleconnections (e.g., McCrystall & Screen, 2021; McKenna & Maycock, 2021). The high number of realizations per GCM is also particularly valuable for addressing deep learning and climate science applications, where large sample sizes are required for creating training data sets and improving overall ANN performance. As a recent example, Maher et al. (2022) leveraged the MMLEA and compared different supervised machine learning methods for classifying El Niño-Southern Oscillation events according to their spatial pattern.
In this work, we use monthly near-surface temperature (T2M) data and calculate annual means in each large ensemble simulation. To compare the results of our ANN with observations, we evaluate the 1950–2019 temporal period, which overlaps across all of the large ensembles and observations. Since the ANN requires the input maps to be the same size, all climate model data are regridded onto a common spatial grid of 1.9° latitude by 2.5° longitude using a bilinear interpolation scheme. A brief summary of the large ensemble simulations can be found in Table S1 in Supporting Information S1.
2.2 Atmospheric Reanalysis
We primarily use ERA5 reanalysis to evaluate how the ANN would classify maps of T2M from observations after training the network on only the climate model large ensembles. ERA5 is the fifth generation of atmospheric reanalysis from the European Center for Medium-Range Weather Forecasts (ECMWF) and provides hourly output on a 31 km horizontal grid with 137 vertical levels (up to 0.01 hPa; Hersbach et al., 2020). ERA5 is based on the ECWMF's Integrated Forecast System release 41r2 and uses four-dimensional variational analysis (4D-Var) as a data assimilation scheme. Output from ERA5 is available from 1979 to near real-time and is constrained by numerous satellite and in situ observations, such as from meteorological stations, ships, buoys, radiosonde profiles, and aircraft. To further extend the available observations back in time, we use the preliminary ERA5 back extension (BE), which is described in Bell et al. (2021).
In addition to being used as one of the primary data sets for monitoring Earth's global mean surface temperature (Dunn et al., 2021), ERA5 has been widely adopted for studies on Arctic climate change and variability (e.g., Cai et al., 2021; Davy & Outten, 2020; Nygård et al., 2021; R. Zhang, Wang, Fu, et al., 2021). Detailed assessments of ERA5's representation of Arctic surface temperature can be found in Graham, Hudson, and Maturilli (2019), Wang et al. (2019), and Yu et al. (2021), but in general, ERA5 suffers from a small warm bias over sea ice when compared to buoy observations and other in situ measurements. This bias may result from underestimating surface inversions and the simulation of turbulent and radiative heat flux exchanges, especially during the boreal winter (Graham, Cohen, et al., 2019).
Although ERA5 is a modeled product, its mean long-term trends and interannual variability of T2M compare well with other station-based data sets in the Arctic (Figure S1 in Supporting Information S1). However, in the Supporting Information, we also evaluate the ANN results using a separate observational data set from the National Oceanic and Atmospheric Administration/Cooperative Institute for Research in Environmental Sciences/Department of Energy Twentieth Century Reanalysis (20CR) version 3 (20CRv3; Slivinski et al., 2021, 2019). The difference between the annual mean Arctic T2M for ERA5-BE and 20CRv3 is within 1°C in most years, and both data sets fall in the warmer envelope of the range in MMLEA mean climate states (Figure S4 in Supporting Information S1). However, there are notable regional differences in T2M across the Arctic between ERA5-BE and 20CRv3 (Figures S2–S3 in Supporting Information S1), especially in the vicinity of sea ice and Greenland. The implications of these differences for the ANN output will be further discussed in Sections 7-9. Overall, we focus on these atmospheric reanalysis products as they provide both temporarily and spatially complete gridded data (i.e., no missing data) during our period of interest.
For comparison with the climate model results, we first bilinearly interpolate all reanalysis data onto the slightly coarser 1.9° latitude by 2.5° longitude grid. We then calculate annual mean maps of T2M from monthly output over the period of 1950–2019. A summary of the reanalysis data can be found in Table S2 in Supporting Information S1.
3 Methods
3.1 Artificial Neural Network Architecture
In this work, we are interested in whether an ANN can correctly identify which climate model simulated an input map of Arctic T2M. As previously discussed, ANNs are useful in the geosciences for approximating nonlinear relationships in data-intensive problems (Boukabara et al., 2021; Irrgang et al., 2021). In climate science, this type of data problem often involves maps of climate variables that are available from large data sets, such as satellite data, gridded observational products, or climate models. If provided enough training data for the ANN to learn, and without it overfitting, the ANN can then make correct predictions on data it has not been seen before. Accompanying ANNs with explainability methods can also provide insights into the prediction by considering the trustworthiness of the ANN through scientific intuition for the specific application. An introduction to ANNs and other deep learning methods can be found in Goodfellow et al. (2016), Lecun et al. (2015), and Neapolitan and Jiang (2018).

Schematic of the artificial neural network (ANN) used in this study for classifying which climate model large ensemble (output layer) produced a single map of Arctic near-surface temperature averaged over a given year (input layer). The ANN consists of two hidden layers that both contain 10 hidden nodes. The output layer includes a softmax activation function.
In addition to the method of processing data using absolute T2M, we also conduct a set of experiments by first removing the mean temperature of each Arctic map before standardizing and allowing the training process to begin. In this set of results, the ANN cannot simply rely on differences in the overall mean state of each climate model large ensemble for making correct predictions. This method has also been successfully utilized in other previous studies for using ANNs to reveal regional indicator patterns of climate change (Barnes et al., 2020; Labe & Barnes, 2021).
As will be discussed later, this overall classification problem is simple for the ANN to learn (100% accuracy), and small changes to the proportions of splitting ensemble members do not affect our results. For training, we use a stochastic gradient descent optimizer (Ruder, 2016) with Nesterov momentum turned on (=0.9; Nesterov, 1983), a learning rate of 0.001, a batch size of 32, and we apply early stopping to set the number of epochs. Early stopping is a technique to help prevent overfitting. Here, the ANN is finished training if the validation loss does not decrease for five epochs in row. The ANN is then restored to the iteration with the best model weights, which is generally less than 200 epochs for our application. In addition to early stopping, we also apply ridge regularization (L2; Friedman, 2012) to the first hidden layer in order to reduce overfitting. By limiting the sensitivity of the ANN to outlier weights, L2 helps to reduce spatial autocorrelation that may exist in fields of climate data, such as T2M, and it is associated with smoother fields for interpreting our explainability maps. Our L2 is set to 0.1, although we explore the results of testing observational data using different ridge parameters in Figures S14–S15 in Supporting Information S1.
3.2 Layer-Wise Relevance Propagation
To evaluate how the ANN is classifying each temperature map with the correct GCM, we use a method of explainable machine learning called layer-wise relevance propagation (LRP; Bach et al., 2015; Montavon et al., 2017, 2018). First introduced by Toms et al. (2020) for applications in the geosciences, LRP has now been used in a wide range of studies across atmospheric and climate sciences for attempting to understand the decision-making process of neural networks (e.g., Gordon et al., 2021; Hilburn et al., 2020; Mayer & Barnes, 2021; Retsch et al., 2022; Sonnewald & Lguensat, 2021). Importantly for its use in this work, LRP has also been shown to be an effective technique for extracting regional patterns of forced climate change that are collectively found between climate models and observations (e.g., Barnes et al., 2020; Labe & Barnes, 2021; Madakumbura et al., 2021; Rader et al., 2022). Despite a growing number of other machine learning explainability methods (e.g., Hedström et al., 2022), we find that LRP is well suited for the complexity of our simple neural network problem and geospatial input data.
LRP is a form of a posthoc feature attribution, where its output describes the contribution of each input pixel to the overall prediction of the neural network. In other words, LRP returns a heatmap that describes the relevance (unitless) of each input feature with the same dimensions. Specifically, in this study, LRP returns a vectorized heatmap of the relevance value at every latitude and longitude grid point across the Arctic (2016 units per map) for inputs of T2M. Thus, we can make individual composites of LRP heatmaps for every classification output in order to learn the patterns the ANN used to recognize each GCM.
Although overviews of LRP are described in numerous other studies (e.g., Montavon et al., 2019; Toms et al., 2020), we also summarize its implementation here to help improve clarity. After an ANN has been trained, the weights and biases are frozen, and a single input is passed through the network in forward mode to make a prediction. Next, prior to the softmax activation function, the winning output node (i.e., highest likelihood class) is backpropagated through the ANN using a set of decomposition rules. After propagating backward through the ANN to the input layer, we can then obtain relevance values for each input pixel. This entire process is repeated for each prediction, and therefore, we have a relevance heatmap for every annual mean temperature input.
We use the LRPz method, but there are several other forms of LRP following different backpropagation rules (Bach et al., 2015; Samek et al., 2019) and available using the iNNvestigate package (Alber et al., 2019). In a recent comparison of LRP methods for geoscience applications, Mamalakis et al. (2021) demonstrated that LRPz performed well compared to the ground truth using a benchmark data set with similar characteristics to our climate model large ensemble data. We also compare our results using LRPz with two other explainability methods, LRPϵ (Bach et al., 2015) and Integrated Gradients (Sundararajan et al., 2017), and find similar relevance spatial patterns (Figure S6 in Supporting Information S1).
Finally, while explainability techniques like LRP are useful for assessing whether a neural network is making predictions based on coherent and physically based processes, we note that it is still subject to user interpretation. The LRP patterns here can only be used to identify the local temperature patterns unique to each GCM which are important for the ANN's decision-making process. However, we cannot directly assess how the ANN may be (non)linearly leveraging and weighting combinations of these regional temperature patterns together.
To improve visual clarity of our LRP output, we normalize each heatmap sample to have a maximum of one and then scale each figure composite by its maximum relevance. We elected to concentrate on positive relevance output for this analysis, which highlights areas that contribute positively to the final ANN classification. This also helps to simplify the interpretation of the explainability results for each of the climate model large ensemble considered here. In summary, locations of higher relevance indicate regions of temperature that are more important for the ANN to make its GCM classification.
4 Classifying Climate Model Large Ensembles
To begin exploring the differences between each GCM in the MMLEA, we first analyze their raw composites of annual mean T2M over the historical period in Figure S5 in Supporting Information S1. Unsurprisingly, all of the GCMs capture a similar spatial pattern of temperatures between sea-ice covered regions, open water in the North Atlantic and North Pacific, the Greenland Ice Sheet, and across other land areas. However, there are some notable differences in the mean T2M, especially for CSIRO-MK3.6, which is at least 3°C colder across most of the Arctic Ocean (Figure S5c in Supporting Information S1). This is likely in association with an unrealistic sea ice mean state (i.e., higher sea-ice concentration) and slower rate of sea-ice decline over the last one to two decades (Topál et al., 2020; Uotila et al., 2013). It could also be due to biases in albedo, cloud processes, and other atmospheric dynamics, as decomposed for CESM1 by Park et al. (2014). Figure 2 shows that all of the GCMs capture higher interannual variability of T2M across the marginal ice zone in the North Atlantic, such in the Barents Sea region, with respect to ERA5-BE observations (Figure 2a). However, there is greater variability along and north of Siberia for CanESM2 (Figure 2b) and GFDL-CM3 (Figure 2f), which is likely again in response to differences in sea-ice variability. In summary, despite some differences in average T2M and spatial patterns of variability, all of the GCMs capture the general annual mean climatological characteristics of the Arctic.

(a) Standard deviation of annual mean T2M (contour interval of 0.1°C) for ERA5-BE calculated over the 1950–2019 period. (b–h) Standard deviation of annual mean T2M for the mean of the ensemble members calculated over the 1950–2019 period for CanESM2, MPI, CSIRO-MK3.6, EC-EARTH, GFDL-CM3, GFDL-ESM2M, and LENS, respectively.
We now turn to our ANN to see if it can correctly identify which GCM simulates every input map of annual mean T2M from 1950 to 2019. Recall that we train our ANN on 12 ensemble members from each GCM and then test the skill of the ANN using three ensemble members. The ANN is quickly able to learn how to identify each T2M map with the correct GCM and achieves a categorical accuracy of 100% on testing data. We hypothesize that this perfect accuracy is due to the easy task for the ANN, since the only differences between training, testing, and validation are due to the selected ensemble members. Thus, the systematic differences among the GCMs may be larger and spatially more persistent in both training and testing data than from that due to internal variability alone (i.e., only considering the differences between ensemble members for each GCM class). To further elucidate this point, Figures 3h–3n show the T2M differences for each GCM relative to the overall multi-model mean ensemble. This more clearly reveals the colder mean state in CSIRO-MK3.6 (Figure 3j), along with other regional differences among the other GCMs, especially across the North Atlantic, Greenland, and Canadian Arctic Archipelago.

(a–g) Composite heatmap of layer-wise relevance propagation (LRP) for correct testing data predictions averaged over the 1950–2019 period for CanESM2, MPI, CSIRO-MK3.6, EC-EARTH, GFDL-CM3, GFDL-ESM2M, and LENS, respectively. (h) Composite of differences in T2M between CanESM2 minus the multi-model mean ensemble averaged over 1950–2019. (i–n) as in (h), but for MPI, CSIRO-MK3.6, EC-EARTH, GFDL-CM3, GFDL-ESM2M, and LENS, respectively.
We identify the regions that the ANN is leveraging to make its accurate predictions using the LRP explainability method in Figures 3a–3g. The LRP heatmaps are composted separately for each GCM class across all testing ensemble members and years (1950–2019). Comparing the areas of higher relevance (i.e., locations that are more important for the ANN to make a prediction) in Figures 3a–3g with the differences in T2M for each GCM minus the multi-model ensemble mean (Figures 3h–3n) reveal clear similarities in the spatial patterns. This suggests that the ANN is learning characteristics of each GCM to make its classification. Importantly though, the relevance patterns indicate that the ANN is not simply using the entire map of T2M differences relative to the multi-model mean. For example, GFDL-CM3 is several degrees warmer than the multi-model ensemble means in the Barents Sea region (Figure 3l). Yet, the LRP composite in Figure 3e suggests instead that T2M patterns in Alaska and the North Pacific are more relevant for the ANN to make a final prediction. In contrast, sometimes it is the case that the larger T2M differences correspond to areas of higher relevance, such as for LENS when comparing Figure 3g with the colder anomalies in Figure 3n over the Canadian Arctic Archipelago.
Overall, we interpret that the locations of higher relevance show that the ANN is spatially leveraging patterns of T2M that result in a unique set of characteristics or differences between each GCM class. To check that our interpretations of the LRP results are not sensitive to the choice of backpropagation rule, we compare relevance composites using the epsilon-rule (LRPϵ) and Integrated Gradients method in Figure S6 in Supporting Information S1. The relevance composites are nearly indistinguishable across the three explainability methods for all GCM classes. Given that the ANN is learning distinctive patterns of T2M to characterize each respective GCM, we now turn to observations to consider classifying each year with a GCM as a method of climate model evaluation.
5 Evaluating Observations With Climate Model Large Ensembles
We first calculate the mean T2M bias for each GCM relative to ERA5-BE (Figure S7 in Supporting Information S1). All of the GCMs reveal a cold bias over the sea-ice covered portions of the Arctic Ocean, which has been a persistent issue for several generations of fully coupled climate models (Chapman & Walsh, 2007; Davy & Outten, 2020). There are also other regional differences in T2M biases between GCMs, especially over Greenland and the Canadian Arctic Archipelago. To test the ANN on inputs from observations (Figure S8a in Supporting Information S1), we first rescale each map by subtracting the training mean (Figures S8b and S8e in Supporting Information S1) and dividing by the training standard deviation. In other words, the data is processed in the same method as the climate model large ensembles (Section 3.1). Figure S8c in Supporting Information S1 shows the difference from ERA5-BE minus the training mean, which again shows the Arctic Ocean cold bias in the climate model data. Although there are some small differences in magnitude, especially over Greenland, we find similar results for rescaling observations using 20CRv3 (Figures S8d–S8f in Supporting Information S1). Composites of the rescaled T2M observations over three time periods display a persistent spatial pattern of T2M anomalies, except for the long-term background warming associated with Arctic amplification (Figure S9 in Supporting Information S1).
Finally, after rescaling the observational maps of annual mean T2M, we input them into the ANN to see which GCM is classified from 1950 to 2019. As discussed in Section 3.1, the ANN outputs the confidence (or likelihood) of a single T2M map belonging to each of the GCMs classes (Figure 4a). After applying the softmax operator, we sort these confidence values from lowest to highest and display these rankings in Figure 5 separately for every map year. Accordingly, the class with the highest confidence value is the GCM ultimately selected for each year and hence given a rank of “1”. If the confidence value is below that of random chance (1/7), the GCM is given a ranking of “7”. For ERA5-BE, we find that GFDL-CM3, EC-EARTH, and MPI are mostly frequently classified with the highest confidence in a single year. Interestingly, we also see a temporal evolution of these three models, with EC-EARTH more frequently classified in earlier years prior to 1979, MPI generally classified between 1979 and 2012, and GFDL-CM3 selected in the last few years. We hypothesize that this temporal evolution may be related to the long-term warming of the Arctic, which closely mirrors the Arctic mean T2M in Figure S1 in Supporting Information S1. GFDL-CM3 also observes the largest recent warming trends in the Arctic (not shown).

(a) Confidence values (after a softmax operator) from a single seed artificial neural network (ANN) for each global climate model (GCM) class after inputting an annual mean map of T2M from ERA5-BE over the period of 1950–2019. The line color and marker shading is darker for the GCM class with the highest confidence in each year. (b) Frequency of MPI (dark green line) and GFDL-CM3 (pink dashed line) classes for receiving the highest confidence prediction output for each annual mean T2M map from ERA5-BE. The frequency is considered by training 100 ANNs with different combinations of training, testing, and validation data and random initialization seeds. (c–d) as in (a–b) but after removing (RM) the annual mean of each T2M map from every grid point before inputting the observations into the ANN.

Ranking the order of the artificial neural network (ANN) confidence values (after a softmax operator) for each global climate model (GCM) class after inputting an annual mean map of T2M from ERA5-BE over the period of 1950–2019. A value of 1 indicates that the GCM received the highest confidence (i.e., winning predicted category) for each yearly T2M map. If the confidence value of the ANN output is lower than random chance (≈1/7), the ranking is then set to 7.
To test the robustness of these results, we train 100 separate ANNs using unique random initialization seeds and different combinations of training, testing, and validation data (ensemble members). After training each of these 100 ANNs, we then input the same T2M maps from ERA5-BE and show the frequency of classifying MPI and GFDL-CM3 in Figure 4. Similar to the single seed ANN predictions in Figure 4a, MPI is frequently predicted for the observational maps across the distribution of the 100 ANNs (Figure 4b). However, there are also small differences in the observational predictions, which suggest that there is some uncertainty due to the choice of training ensemble members and ANN initialization states.
As briefly mentioned in Section 3.1, to assess whether the network is simply just using a smaller mean state bias in a GCM for deciding to make predictions for observations, we try training a new ANN experiment by first processing the climate model large ensembles to remove the annual mean T2M of the entire Arctic map from every grid point and for every year. In this case, by design, the ANN needs to learn regional patterns in order to make its classification. Here, the ANN once again quickly learns unique spatial characteristics of each GCM and achieves a perfect accuracy for the testing data. We similarly evaluate the ERA5-BE maps by removing the annual mean T2M from each grid point and year (Figure S10 in Supporting Information S1). After sorting the confidence values of this new ANN (Figure 4c), we rank the GCMs in Figure S11 in Supporting Information S1 for every year of observations. In this case, we find that MPI receives the highest confidence in nearly every year. Again, testing the sensitivity of the observational predictions of this new ANN to the ensemble members selected for training, we compute 100 ANNs and show the frequency of MPI and GFDL-CM receiving the highest confidence in Figure 4d. Notably, processing the data with the annual map mean first removed results in MPI much more frequently labeled than the methodology used in Figure 4b. This suggests that the ANN is instead leveraging regional temperature patterns to more consistently make observational predictions of MPI.
Naturally, a next question is how closely do the ANN results compare with traditional relative error metrics for comparing climate models and observations. As a baseline comparison, we calculate the pattern correlation and RMSE between ERA5-BE and each GCM in Figure 6. The correlations and RMSEs are first computed between the observations and each ensemble member and then averaged together to get an ensemble mean. Most GCMs achieve a high pattern correlation (>0.9), which is unsurprising given the results in Figure S5 in Supporting Information S1. The lowest pattern correlation (and highest RMSE) is found for CSIRO-MK3.6, which is related to its cold bias and extensive sea ice mean state. Turning to RMSE, we find that MPI has the lowest error in most years of ERA5-BE. Notably, this is largely consistent with the ANN results in Figure 5. Finally, Figure S12 in Supporting Information S1 shows temporal correlations calculated at each grid point between ERA5-BE and the GCMs. Using this metric, GFDL-CM3 has the highest correlation over the Arctic Ocean, but most of the other GCMs have a similar spatial pattern too (>0.5) from the long-term warming trend.

(a) Pattern correlation coefficient of T2M computed for each year between ERA5-BE and the climate model large ensembles from 1950–2019. Correlations (area weighted) are first calculated per each ensemble separately and then averaged across ensemble members. (b) Root-mean-square error (RMSE) of T2M for each year between ERA5-BE and the climate model large ensembles from 1950 to 2019. RMSEs (area weighted) are first calculated per each ensemble separately and then averaged across ensemble members.
We consider observations from 20CRv3 to assess how sensitive the GCM prediction results are to the choice of observational data set. Following the same steps, Figure S13 in Supporting Information S1 shows the sorted ANN predictions of 20CRv3 maps according to increasing confidence values for each GCM class. MPI is frequently classified for each year of 20CRv3. However, in this exercise, we do not find any years with confidence above random chance for EC-EARTH. Although this differs from the results of ERA5-BE in Figure 5, this is not overly surprising given the mean state differences between the two observational data sets found in Figures S2–S4 in Supporting Information S1.
Subsequently, it is evident that the spatial patterns of T2M are important for the ANN's prediction. This could be related to our choice of L2 regularization, since a larger L2 can effectively reduce spatial variability and irregularities in the input data. We test the effect of different L2 parameters in Figure S14 in Supporting Information S1 on the observational predictions for ERA5-BE. Here, we find that a larger L2 does in fact result in different GCM labels for the T2M maps, which could result from smoothing out the regional patterns that were originally important for the ANN using our L2 choice of 0.1. Interestingly, we find that repeating this L2 parameter exercise for ANNs with the annual map mean first removed results in more consistent predictions for observations (Figure S15 in Supporting Information S1). In summary, these findings further illustrate that the ANN is learning both information about the mean climate state and regional patterns that are associated with an individual GCM and observations. Moreover, the ANN is particularly sensitive to regional differences in T2M when classifying observations with a GCM.
6 Identifying Regional Climate Patterns
So far, we have shown that an ANN can detect differences in regional T2M patterns that are unique to a particular GCM. We have also shown that observations can be evaluated in the ANN for identifying a GCM with each year in the historical record. This tends to result in observational predictions that are still fairly consistent with traditional climate model evaluation metrics like RMSE.
Now we can leverage our LRP explainability method by applying it to the observational data in order to more clearly see where the ANN is looking to make its predictions. Figures 7a–7g show the LRP results composited separately for each GCM that is ultimately classified from 1950 to 2019 (i.e., rankings of 1 in Figure 5). We compare these relevance heatmaps to T2M composites of the rescaled ERA5-BE input data in Figures 7h–7n. Although at first glance the patterns of the rescaled T2M composites look fairly similar, it is clear that the ANN is using small regional differences to make its classification, as reflected by the relevance patterns in Figures 7a–7g. For example, 1 yr of observations is classified as LENS, which is likely due to the large cold anomaly over the Canadian Arctic Archipelago (Figure 7n) that is similarly reflected as an area of higher relevance in Figure 7g.

(a–g) Composites of layer-wise relevance propagation (LRP) heatmaps for each global climate model (GCM) classification after inputting annual mean maps of T2M from ERA5-BE into the artificial neural network (ANN). Higher values indicate greater relevance for the ANN's prediction. (h–n) Composites of T2M from ERA5-BE that are first scaled by the training data mean and training data standard deviation. Maps are then composited according to each predicted GCM class for every year. Maps that are gray indicate that the GCM was never classified, and the number in the upper left-hand corner indicates the number of times the GCM was classified from 1950 to 2019.
Due to some differences in the GCM predictions for 20CRv3 compared to ERA5-BE (Section 5), we show the LRP results for 20CRv3 testing predictions in Figure S16 in Supporting Information S1. For the composites of 20CRv3, we see higher relevance predominately over Greenland. Uncertainties in T2M are particularly large over Greenland for many reanalysis and other gridded observational data sets (e.g., Delhasse et al., 2020; Jack et al., 2017; W. Zhang, Wang, Smeets, et al., 2021), and this is found to be true for both ERA5-BE and 20CRv3 (Figure S2 in Supporting Information S1). Therefore, this may help to explain the differences found for the observational predictions in Section 5.
We can also explore the LRP results of the ANN experiment using T2M data with the annual mean of the Arctic first removed before training and testing. These relevance composites are shown in Figure S17 in Supporting Information S1. While MPI is selected for most observational years by this ANN (Figure S17b in Supporting Information S1), we can still see spatial differences in the relevance regions compared with GFDL-CM3 (Figure S17e in Supporting Information S1) particularly over northwestern Canada and eastern Siberia.
Finally, returning to the LRP results of the climate model large ensembles, Figure 8 shows the relevance heatmaps of each GCM from the ANN trained on data with the annual mean of the map first removed (RM). Comparing Figure 8 with the original LRP results of Figures 3a–3g show that the ANN is still using many of the same regional T2M signals, such as the cold anomaly signatures over the Barents Sea in CSIRO-MK3.6 and near Iceland in GFDL-ESM2M. But there are also some differences in the higher relevance areas, like those found in the heatmap composites for CanESM2 over Siberia and the North Atlantic. These results support our interpretation that the ANN is making predictions by weighting regional patterns of T2M that are unique to each GCM for comparing with observational data.

(a) Composite heatmap of layer-wise relevance propagation (LRP) averaged over 1950–2019 for correct testing data predictions after removing the annual mean of each CanESM2 map before it is fed into the artificial neural network (ANN). (b–g) as in (a), but for MPI, CSIRO-MK3.6, EC-EARTH, GFDL-CM3, GFDL-ESM2M, and LENS, respectively.
7 Discussion and Conclusions
There are many existing methods for ranking the skill of climate models against observations (Eyring et al., 2019; Gleckler et al., 2008). This exercise is particularly important for climate sensitive regions, such as the Arctic, which have large spreads and uncertainties in future projections and where guidance for weighting climate model projections is not always necessarily straightforward (Knutti et al., 2017). Some of the advantages for exploring deep learning methods for comparing climate models with observations include their ability to leverage spatial patterns and relationships and approximate any nonlinearities. We attempt to evaluate climate model large ensembles and observational data sets in the Arctic using a simple ANN classification framework. That is, we trained ANNs on maps of near-surface temperature (T2M) from the MMLEA and then used the neural network for predicting data from atmospheric reanalysis to see which climate model is classified for each year from 1950 to 2019. To understand the ANN's prediction, we leveraged an explainability method called layer-wise relevance propagation, which revealed that the ANN is using regional temperature patterns, rather than only mean state biases, in order to make each climate model selection.
Although the prediction task itself is quite simple for the ANN to correctly learn which climate model simulated a map of T2M, it is more challenging to interpret the ANN's utility on observations. Here, MPI is most frequently classified by the ANNs for the T2M maps taken from observations, which is likely a result of its mean climate state and patterns of spatial variability that compare closely with ERA5 over both land and ocean areas in the Arctic. Interestingly, we find that this climate model classification for each year of observations produces results rather similar to traditional evaluation metrics, such as comparing with climate models that receive lower root-mean-square-errors. One advantage of our approach is that the ANN can also learn regional relationships across spatial patterns, rather than only computing point-by-point relative error statistics. Further, the relevance maps can be used as tools for highlighting regional pattern fingerprints unique to individual climate models. This is especially true for areas around large temperature gradients. For example, the explainability maps reveal that differences in T2M near Greenland and the marginal ice zone of the North Atlantic are often important for the ANN to correctly identify many of the climate model large ensembles. This is consistent with recent analysis of CMIP6 models (e.g., Cai et al., 2021), which note that climate model differences may be due to their simulation of Atlantic poleward heat transport.
In future work, it may be interesting to use convolutional neural networks for comparing spatial differences in different climate variables or to try the classification architecture on GCMs prescribed with different future emission scenarios, but that is beyond the scope of this preliminary work. Importantly, we note that the output of this approach is dependent on the selection of preprocessing steps, but these choices can be aligned with the overall scientific question one is interested in addressing. For instance, preliminary work has shown some (albeit lower) skill in classifying maps of temperature anomalies that are calculated with respect to a common baseline or by using data with the ensemble mean first removed. Despite these limitations and future work, this study demonstrates that ANNs have the ability to extract regional patterns that are consistent between climate models and observations, but the overall practicality of translating this approach to existing climate evaluation toolboxes should be further investigated.
Acknowledgments
The authors thank two anonymous reviewers and the editor for their constructive comments and suggestions, which helped us to improve this manuscript. This study was supported by NOAA MAPP grant NA19OAR4310289 and by the Regional and Global Model Analysis program area of the U.S. Department of Energy's (DOE) Office of Biological and Environmental Research (BER) as part of the Program for Climate Model Diagnosis and Intercomparison project. The authors would like to acknowledge the US CLIVAR Working Group on Large Ensembles and high-performance computing support from NCAR's Computational and Information Systems Laboratory's (CISL) Cheyenne (https://doi.org/10.5065/D6RX99HX) for the development of the MMLEA (https://www.cesm.ucar.edu/projects/community-projects/MMLEA/). Lastly, the authors would like to acknowledge support for the Twentieth Century Reanalysis Project version 3 (20CRv3) data set provided by the U.S. DOE Office of Science BER, by the NOAA Climate Program Office, and by the NOAA Physical Sciences Laboratory.
Conflict of Interest
The authors declare no conflicts of interest relevant to this study.
Open Research
Data Availability Statement
Climate model large ensemble data used in this study are freely available from the NCAR Climate Data Gateway (https://www.earthsystemgrid.org/dataset/ucar.cgd.ccsm4.CLIVAR_LE.html), which is supported by the U.S. National Science Foundation (NSF). Atmospheric reanalysis data are openly available for ERA5 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels-monthly-means?tab=overview) and the preliminary version of the ERA5 BE (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels-monthly-means-preliminary-back-extension?tab=overview), which are both supported by the Copernicus Climate Change Service (C3S; Thépaut et al., 2018) Climate Data Store (CDS). Twentieth Century Reanalysis Project version 3 (20CRv3) data are provided by the NOAA/OAR/ESRL PSL, Boulder, Colorado, USA (https://psl.noaa.gov/data/gridded/data.20thC_ReanV3.html). References for the data sets are available in Tables S1–S2 in Supporting Information S1.
Preprocessing steps were completed using NCO v4.9.3 (Zender, 2008), CDO v1.9.8 (Schulzweida, 2019), and NCL v6.2.2 (NCAR, 2019). Computer code for the ANN architecture, figures, and other exploratory data analysis is available at https://zenodo.org/record/6564106. Python v3.7.6 (Van Rossum & Drake, 2009) packages used for this analysis include Numpy v1.19 (Harris et al., 2020), SciPy v1.4.1 (Virtanen et al., 2020), and Scikit-learn v0.24.2 (Pedregosa et al., 2011). Additional open source software used for development of the ANN and LRP heatmaps include TensorFlow v1.15.0 (Abadi et al., 2016) and iNNvestigate v1.0.8 (Alber et al., 2019). Matplotlib v3.2.2 (Hunter, 2007) was used for plotting figures, and colormaps were provided by cmocean v2.0 (Thyng et al., 2016), Palettable's cubehelix v3.3.0 (Green, 2011), and Scientific v7.0.0 (Crameri, 2018; Crameri et al., 2020).