The Signal-to-Noise Paradox for Interannual Surface Atmospheric Temperature Predictions
Abstract
The “signal-to-noise paradox” implies that climate models are better at predicting observations than themselves. Here, it is shown that this apparent paradox is expected when the relative level of predicted signal is weaker in models than in observations. In the presence of model error, the paradox only occurs in the range of small signal-to-noise ratio of the model, occurring for even smaller model signal-to-noise ratio with increasing model error. This paradox is always a signature of the prediction unreliability. Applying this concept to noninitialized simulations of Surface Atmospheric Temperature (SAT) of the CMIP5 database, under the assumption that prediction skill is associated with persistence, shows that global mean SAT is marginally less persistent in models than in observations. However, at a local scale, the analysis suggests that ∼70% of the globe exhibits the signal-to-noise paradox for local SAT interannual forecasts and that the Signal-to-Noise Paradox occurs especially over the oceans.
Key Points
- Signal-to-noise paradox is a signature of an underconfident, unreliable prediction system
- CMIP5 models are unable to reproduce observed persistence of surface atmospheric temperature
- In CMIP5 models 70% of the globe exhibits the signal-to-noise paradox in SAT
1 Introduction
There is an increasing demand for accurate and reliable climate predictions on time scales from seasons to decades. This has led to the development of operational climate prediction systems with multiple models (Meehl et al., 2015), which allow for skillful seasonal predictions of hydrology (Svensson et al., 2015), energy supply (Clark et al., 2017), transport system disruption (Palin et al., 2016), and hurricane activity (Smith et al., 2010).
However with these promising results came along a somehow unexpected property: Climate prediction systems can be more accurate at predicting the real climate than predicting themselves (Kumar et al., 2014). Scaife et al. (2014) found that seasonal North Atlantic Oscillation predictions are inclined to develop this apparent paradox. Eade et al. (2014) also described this paradox for interannual predictions of Surface Atmospheric Temperature (SAT), mean sea level pressure, and precipitation in decadal predictions from DePreSys (Decadal Prediction System from the UK MetOffice, Smith et al., 2007). Similarly, Dunstone et al. (2016) verified its existence in interannual predictions of the North Atlantic Oscillation index. This behavior has been interpreted as a higher level of unpredictable components being present in the model than in observations (Siegert et al., 2016). Using an independent operational prediction system (PROCAST), Sévellec and Drijfhout (2018) observed the same property in interannual predictions of global-mean SAT. This discrepancy between model and real world prediction capability has been named the signal-to-noise paradox (see review by Scaife & Smith, 2018, for further details). To explain the paradox, a range of hypotheses has been put forward, such as the nonstationarity and the sampling uncertainty of predictions (Weisheimer et al., 2019).
2 Idealized Statistical Model
2.1 Definition and Prediction Accuracy Metrics
This stochastic model closely follows the one suggested by Siegert et al. (2016) with a small but crucial modification. Here the difference between observations and model is explicitly incorporated into the predictable component of the model rather than implicitly incorporated into the unpredictable component (i.e., noise) of the observations. Hence, in our analysis the observations are split in two terms representing their predictable and unpredictable components by nature; whereas in Siegert et al. (2016), observations are split in two terms representing components predictable and unpredictable by the model. It means that the Predictable Component defined by Eade et al. (2014) or Siegert et al. (2016) is actually the predicted component (i.e., what is predictable by the model), which is by construction smaller or equal to the Predictable Component by nature. (Indeed, what is predictable by the model is smaller or equal to what is predictable by nature.) This modification that we introduce to our stochastic model allows us for a more fundamental approach independent of model skills (i.e., independent of the model ability to accurately predict the full predictable component).
The Coefficient of Determination measures the similarity between the pseudo-observations and the model outputs (i.e., the predicted component). Multiplied by 100, it indicates the percentage of variance of the observations explained by the prediction. This means that a Coefficient of Determination of 1 indicates a perfect prediction, whereas a value of 0 indicates no prediction skill, equivalently a value of 0.5 indicates that 50% of the variance of the observations is represented by the prediction. On the other hand, the Reliability measures the consistency between the error and the model standard deviation (i.e., whether the unpredicted component is well captured by the ensemble member spread). Our formulation of Reliability follows the definition of Ho et al. (2013) but is generalized for nonstationary statistics following Sévellec and Drijfhout (2018). A value of 1 suggests a perfect consistency between the intrinsic prediction error (numerator) and the assessed prediction uncertainty (measured by the ensemble member spread, denominator). Values different from 1 indicate the unreliability of the prediction system. Hence, a value of 2 suggests that the prediction uncertainty is twice as small as the prediction error (corresponding to an overconfident prediction system), equivalently a value of 0.5 suggests that the prediction uncertainty is twice as big as the prediction error (corresponding to an underconfident prediction system).
To test the ability of the model to predict its own simulations (i.e., perfect model approach), these diagnostics are used after replacing the pseudo-observations by a certain model realization (a single member of the ensemble). The choice of the realization does not matter since all the model ensemble members have the same statistical behavior (and statistics have converged for our choice of time iterations).
2.2 Results
2.2.1 Impact of SNR on Prediction Accuracy
Using this statistical model, the prediction skills are diagnosed for a variety of SNR of both model and pseudo-observations (αm and αo, respectively). We first assume the absence of a systematic model error between model and pseudo-observations (i.e., β=0), so they only differ through their relative level of predictable signal. In this case, the predictable and predicted components are equal, hence we recover the results from Siegert et al. (2016), which are summarized below for completeness.
Hence, within a perfect model approach, the skill, R2, increases as a function of the model SNR through a /( ) law (Figure S1a), implying that increasing the relative amplitude of the predictable component leads to more skillful predictions. We also find that the reliability is always 1 (i.e., the model is perfectly reliable to predict itself, Figure S1b in the supporting information), showing that the model behavior is accurately sampled in a statistical sense.
For predicting pseudo-observations, the skill is always improved by increasing the pseudo-observation SNR, regardless of the model SNR (Figure 2a). However, the skill decreases when increasing the model SNR beyond that of the pseudo-observation SNR, which is clearly visible in case the pseudo-observation SNR < 1. In such cases, the prediction can have absolutely no skill (R2<0).
The skill improvement when predicting pseudo-observations instead of the model itself can be expressed as the difference between the two coefficients of determinations ( , Figure 2b). An improvement of skill ( ) always occurs when the pseudo-observation SNR is higher than the model SNR, as suggested by Eade et al. (2014), because the Ratio of Predictable Components (RPC) is larger than 1. This means that the signal-to-noise paradox is a natural outcome for models featuring a weaker signal in the prediction than is present in the observations, as long as model and (pseudo-)observations share the same predictable signal, that is, the model error is zero (β=0).
To further understand the impact of the model SNR and the pseudo-observation SNR being different, we computed the prediction reliability (red contours in Figure 2b). An important property emerges from this diagnostic: a reliability of 1 can only be achieved if the RPC =1. If the signal-to-noise paradox occurs (RPC >1), the reliability of the prediction is decreased. In this case, the prediction of the ensemble spread is overdispersive (Figure 2b, Scaife et al., 2014; Siegert et al., 2016). This is a crucial result of the signal-to-noise paradox since, by measuring the plausibility and consistency of the prediction error, the reliability is arguably the most important property of a prediction system, in particular, if one wants to achieve probabilistic predictions and to provide risk assessments (Weisheimer & Palmer, 2014).
2.2.2 Impact of Systematic Model Error
These results change in the presence of a systematic model error (Figures 2c–2f). When β≠0, the predictable components (αo and αm for observations and model simulations, respectively) differ from the predicted components.
The role of systematic model error is illustrated by setting β in 2 to 1 and 2 (i.e., an error as big as and twice as big as the predictable component in model and pseudo-observations, respectively). For these two different levels of systematic model error, we find that the signal-to-noise paradox ( ) is still possible (Figures 2d and 2f), but the regime of its occurrence becomes smaller for larger model error. Similar to the case without systematic model error, this occurs when the model SNR (αm) is larger than the pseudo-observation SNR (αo), but now with a threshold (upper bound), limiting the paradox to cases of low model SNR depending on the level of model error (Figures 2d and 2f). These upper bounds move to lower model SNR and larger RPC for increasing model error. This threshold/upper-bound breaks down the direct relation between the signal-to-noise paradox and the RPC. However, even in case of a strong model error (twice as big as the predictable component) a regime exists where the signal-to-noise paradox occurs. Since models always have some kind of systematic error (potentially significant), we can conclude that the signal-to-noise paradox is both a signature of a relatively too low model SNR (i.e., high RPC) and a signature that the model SNR is weak (i.e., <1).
The reliability of an erroneous model can still be 1, but this occurs only for RPC values larger than 1. In such cases, while the condition (αo>αm) applies, the prediction uncertainty and the model ensemble uncertainty can become statistically equivalent. Like in the case without systematic model error, the most accurate reliability is achieved when the signal-to-noise paradox is absent (Figures 2b, 2d, and 2f), regardless of the level of model error. We also find that, even under a significant level of systematic model error, the occurrence of the signal-to-noise paradox corresponds to an overdispersive regime in terms of ensemble spread prediction. Hence, the conclusion that the signal-to-noise paradox is the signature of an underconfident and unreliable prediction system is robust to the level of model error, even in cases of a large error of two (twice the predictable component) where the prediction skill is very weak (Figure 2e). It also appears that, in the presence of model error, a perfect RPC (=1) corresponds to an underdispersive ensembles and so to a signature of an unreliable, overconfident prediction system.
3 Application to Climate Models
We now apply the framework of the signal-to-noise paradox to evaluate models from CMIP5 (5th Coupled Model Intercomparison Project) archive through their long forced historical simulations. Such simulations are not initialized with observations and not designed for interannual prediction. However, ideally they still feature similar statistical behavior as the observations. The question we want to answer is whether the signal-to-noise paradox occurs in models used in predictive systems. To this end, we compute the persistence of their annual-mean SAT between 1881 and 2004 for global and local spatial averages. The persistence is often used as the null-hypothesis of climate prediction and corresponds to assuming that the temperature will not change. Hence, the rate of persistence is an underestimation of the predicted component, and we will assume that the former can be used to approximate the latter to diagnose the signal-to-noise paradox (as suggested by Strommen & Palmer, 2018). In reality, predictability and the role of the SNR in this will depend on the state of the system. To investigate this in detail, one has to address initialized predictions, which is beyond the scope of the present study. Here, our main focus is an illustration of the concept developed above in coupled climate models. However, it is worth noting that the signal-to-noise paradox has been shown for global mean SAT with two different state-of-the-art prediction systems (DePreSys and PROCAST from Eade et al., 2014 and Sévellec & Drijfhout, 2018, respectively). Hence, we apply our analysis of persistence to 1- to 5-year hindcast lags (beyond 5 years persistence in models and observations becomes unskillful).
Focusing on the global scale and using 4a, we find that for GMT on average observations are more persistent than models (Figure 3a), with a good consistency between the three sets of observations. However, the model-observation difference becomes only significant after 2 years of prediction. Looking in more detail at individual models (Figures 3b and 3c), it appears that in the three models (CCSM4, IPSL-CM5A-LR, and MPI-ESM-LR), and to a lesser degree (FIO-ESM), persistence of GMT is comparable to observations. This leaves five other models with a too weak persistence in GMT, suggesting these models exhibit the signal-to-noise paradox for globally averaged SAT. This means that prediction systems for GMT based on these models are potentially underconfident and so unreliable. This also implies a GMT spectrum that is less red in these five models than in the observations. However, our analysis in general reveals a rather good agreement between observations and climate models (Figure 3a, with a relative error of 34% on average), suggesting a rather weak signal-to-noise paradox for GMT-predictions.
To characterize the signal-to-noise paradox on local scales we computed the skill of local persistence following 4b for τ=1, 2, and 5 years (Figure 4 and S2). Because the local persistence for 2 and 5 years is extremely weak, we mainly concentrate on τ=1 year. The results of the local coefficients of determination ( ) are summarized by two indices. The first one is the Level of Agreement and measures for each climate model the relative area of the globe (in %) that has a local R2 within ±10% of each of the three observational ones. The second index is the level of paradox and measures for each climate model the relative area of the globe (in %) that has a local R2 smaller than each of the three observational ones, suggesting the occurrence of the signal-to-noise paradox.
The analysis shows a Level of Agreement that is extremely low, with a value below 10% for all nine models (Figure 4m). Analyses made with each of the three different observational data sets show excellent consistency. This suggests that climate models do not represent accurately the observational SAT persistence at local scales (grid box size). Consistently, the Level of Paradox is extremely high (Figure 4n), with 60% to 80% of the globe exhibiting a weaker persistence in climate models than in observations (i.e., featuring the signal-to-noise paradox). Again, this analysis is consistent between the three observational data sets. Hence, the high level of paradox suggests that prediction systems based on these models would be mainly underconfident and so unreliable. The 1-year persistence in observations (Figures 4a–4c) is mainly localized over the ocean. This suggests that the signal-to-noise paradox takes its source from the ocean rather than from land where the spectrum is less red. This means that models are relatively too noisy over the ocean (not enough variance on annual and longer time scales relative to shorter time scales) compared to observations over the ocean and feature a too white spectrum in SAT. Because longer time scales in SAT over the ocean dominantly arise from ocean variability, this suggests that ocean variability or ocean-to-atmosphere forcing are too weak in models. On longer time scales, the absence of persistence in both observations and models mechanically increases the level of agreement (models and observations having no-skill in more regions). This leads to the trivial solution that with weaker prediction skill the level of paradox is also weaker (Figure S3). However, there is still a nonnegligible level of paradox of between 30% and 70% of the globe at longer time scales.
4 Summary and Conclusions
In this analysis, we first tested the signal-to-noise paradox through a simple stochastic statistical representation of observations and model simulations. In particular, we set the statistical properties of pseudo-observations and model outputs to be virtually indistinguishable. The analysis confirmed that, in the absence of model error, the signal-to-noise paradox occurs when the relative part of the predictable component is weaker in the model than in observations. However, when systematic error is explicitly acknowledged, this relation breaks down. Indeed, a perfect Ratio of Predictable Components (RPC =1) leads to an unreliable, overconfident prediction system. It also restricts the signal-to-noise paradox to low model SNR. Hence, the weaker predictable component in models relative to observations becomes a necessary condition for the signal-to-noise paradox to occur but is no longer a sufficient condition anymore. Indeed, the necessary and sufficient condition becomes that the relative part of the predicted component is weaker in the model than in observations.
Hence, by adapting the RPC, we introduce the Ratio of Predicted Components: RΠC =ΠCobs/ΠCmod (=Robs/Rmod), where ΠCobs and ΠCmod are the predicted components in observations and in the model, respectively. This leads to the trivial result, since it is its definition, that RΠC >1, and so ΠCobs>ΠCmod, is a signature of the signal-to-noise paradox. Since ΠCmod= PCmod(i.e., a model can estimate its own predictability) and ΠCobs ≤ PCobs (i.e., models underestimate the actual predictability of the observations), we have RPC ≥ RΠC, so that RPC overestimates the actual occurrence of the signal-to-noise paradox. (Note that in the absence of systematic model error the predictable component and predicted component are strictly identical.) The accuracy of RΠC over RPC, demonstrates that the signal-to-noise paradox is a consequence of a given prediction system, rather than a fundamental property of the observations and their predictability. Our analysis also confirms the result from Eade et al. (2014) that the signal-to-noise paradox is a signature of an underconfident (overdispersive) prediction system. Hence, predictions can still be accurate but need a large number of members for the noisy unpredictable component to average out (Kumar, 2009). This also means that the signal-to-noise paradox is a sign of the weak reliability of a prediction system and could be used to estimate the system reliability.
Applying this new definition to previous work of Eade et al. (2014) on the North Atlantic Oscillation with a seasonal prediction system (GloSea5), we have ΠCobs=0.6 and RΠC=ΠCobs/ΠCmod=2.3. This leads to PCmod=ΠCmod=0.26 and to . However it remains impossible to estimate PCobs, but we know PCobs ≥ 0.6 (because PCobs ≥ ΠCobs), which leads to αo= . From our analysis (Figure 2) this regime leads to the paradox for all tested errors and the conclusion that GloSea5 is an underconfident prediction system (consistently with Eade et al., 2014; Scaife et al., 2014).
Despite this nice result, it is however important to note that our stochastic model and the subsequent analysis of persistence only deal with anomalies. Hence, more common interannual prediction methods, which deal with short-term variations in which the signal has a nonzero mean, have also other sources of error, which were not considered here (e.g., systematic bias). To acknowledge these other types of error, the stochastic model presented here will have to be adjusted. This will be part of a follow-up study.
To diagnose the signal-to-noise paradox in state-of-the-art climate models, we computed the SAT persistence in nine climate models from CMIP5 and compared it to the SAT persistence in three observational data sets. We find that CMIP5 models have an important tendency to underestimate SAT persistence, conducive to the signal-to-noise paradox. This is particularly true over oceanic regions and at smaller spatial scales. This low level of persistence suggests that models can be improved by improving (enhancing) SAT-variance at longer time scales, especially over the ocean. Such improvement would most likely also lead to more reliable forecasts for the associated prediction systems. In light of this CMIP5 model analysis, investigating the SNR through initialized predictions in a range of state-of-the-art prediction systems would be a sensible and worthwhile effort.
The weaker persistence in models compared to the observations can be due to inaccurate observations showing too much persistence. Indeed, the three sets of observations tested are reconstructed from sparse and irregular in situ and remote observations. The reconstruction methods, which heavily depend on large spatial and temporal scale correlations to fill gaps and to extrapolate/interpolate missing data, can lead to overweighting slow variability and persistence in reconstructed observational products. This should be investigated in the future.
Another hypothesis to explain the signal-to-noise paradox, especially visible over the ocean, is the lack of SAT persistence in climate models. The ocean dynamics and ocean-atmosphere coupling are the most likely sources of this lack of persistence and should be targeted to improve the agreement between climate models and observations. To test the robustness of our analysis regarding the lack of persistence in SAT, we computed the 10 main empirical orthogonal functions (EOFs) of SAT in observations and in CMIP5 models. This analysis reveals that the models with more variance explained by their 10 main EOFs are less inclined to exhibit signal-to-noise paradox (Figure S4), especially when focusing on longer time scales. This suggests that the signal-to-noise paradox may come from the lack of coherent large-scale modes of SAT variability in the models and from the models featuring too much small-scale variability. A too weak ocean feedback onto the atmosphere has been found before (e.g., Haarsma et al., 2016) and since this aspect improves significantly in higher resolution models (Foussard et al., 2019; Minobe et al., 2008; Su et al., 2018), the next generation of climate models and their associated predictions system may suffer less from the signal-to-noise paradox, becoming more reliable, and more useful for operational probabilistic forecasts. However, it should be emphasized that the signal-to-noise paradox is not only due to a lack of model resolution and also points to more fundamental shortcomings in terms of physical processes incomplete or inadequately represented in those models.
Acknowledgments
This research was supported by the Natural and Environmental Research Council UK (SMURPHS, NE/N005767/1) and by the DECLIC and Meso-Var-Clim projects funded through the French CNRS/INSU/LEFE program. The authors acknowledge the World Climate Research Programme's Working Group on Coupled Modelling, which is responsible for CMIP, and we thank the climate modeling groups for producing and making available their model output (listed in the Appendix of this paper). For CMIP the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals.
Appendix A: Method
The model SAT was estimated from nine CMIP5 historical simulations restricted from 1881 to 2004 (Taylor et al., 2012). The nine models, with their number of members used in square brackets, are as follows: “CCSM4” [6], “CNRM-CM5” [10], “CSIRO-Mk3-6-0” [10], “CanESM2” [5], “HadGEM2-ES” [5], “IPSL-CM5A-LR” [6], “FIO-ESM” [3], “MPI-ESM-LR” [3], and “MIROC5” [5]. These models have been selected from the CMIP5 database because they have at least three members and the required data fields. We have set three ensemble members as the minimum for obtaining model result uncertainties and so to test their robustness. For observations, the GISS, NOAA, and HadCRUT4 temperature data sets were used.