Are we near the predictability limit of tropical Indo-Pacific sea surface temperatures?
Abstract
The predictability of seasonal anomalies worldwide rests largely on the predictability of tropical sea surface temperature (SST) anomalies. Tropical forecast skill is also a key metric of climate models. We find, however, that despite extensive model development, the tropical SST forecast skill of the operational North American Multi-Model Ensemble (NMME) of eight coupled atmosphere-ocean models remains close both regionally and temporally to that of a vastly simpler linear inverse model (LIM) derived from observed covariances of SST, sea surface height, and wind fields. The LIM clearly captures the essence of the predictable SST dynamics. The NMME and LIM skills also closely track and are only slightly lower than the potential skill estimated using the LIM's forecast signal-to-noise ratios. This suggests that the scope for further skill improvement is small in most regions, except in the western equatorial Pacific where the NMME skill is currently much lower than the LIM skill.
Key Points
- Seasonal tropical SST forecast skill of operational National Multi-Model Ensemble (NMME) is close to that of simpler linear inverse model
- Since operational SST forecast skill is only slightly lower than the estimated potential skill, we may be near the predictability limit
- In the western Pacific, improvement of operational ensemble is possible since its skill is much lower than the linear inverse model skill
1 Introduction
The North American Multi Model Ensemble (NMME) global prediction system [Kirtman et al., 2014] was recently developed to harness the idea that combining forecasts from several state-of-the-art coupled atmosphere-ocean general circulation models (CGCMs) could yield better skill than forecasts using any single CGCM [Hagedorn et al., 2005; Jin et al., 2008; Kirtman and Min, 2009; DelSole et al., 2014]. And indeed the grand NMME mean forecasts are significantly more skillful at predicting El Niño–Southern Oscillation (ENSO)-related sea surface temperature (SST) anomalies than ensemble mean forecasts using the individual models in the ensemble [Becker et al., 2014; Barnston et al., 2015]. The NMME system, used for seasonal predictions since 2011, was made an operational forecast system in 2016.
Given this recent improvement in ENSO prediction, one might wonder if substantial room for further improvement exists (e.g., in model development or initialization techniques) or if seasonal forecast systems are now near an intrinsic predictability limit associated with the chaotic nature of the coupled tropical ocean-atmosphere system. Previous studies in a perfect model framework had suggested that ENSO events might be predictable by some measures up to 2 years ahead [Collins et al., 2002; Chen et al., 2004; Luo et al., 2008; Wittenberg et al., 2014; Larson and Kirtman, 2017]. Some recent studies have been less optimistic in this regard, basing their predictability estimates on multimodel ensembles of retrospective CGCM forecasts (hindcasts) and identifying predictable signals with the ensemble mean forecasts and unpredictable noise with the ensemble spread [Chen et al., 2015; Kumar et al., 2016]. All such predictability estimates are of course model dependent, limited by the CGCMs' ability to realistically represent not only the predictable signals but also the unpredictable noise.
In this study we adopt an alternative empirical approach to estimate tropical Indo-Pacific SST predictability using a linear inverse model [Penland and Sardeshmukh, 1995] (LIM). The LIM is a statistically stationary, stochastically forced, multivariate linear model whose parameters are derivable in principle from the known nonlinear physics of the coupled tropical atmosphere-ocean system [Penland and Sardeshmukh, 1995; Moore and Kleeman, 1999] but can also be estimated more simply from observed covariances of system components as done here. Earlier studies demonstrated that such a LIM captures observed tropical dynamics including the evolution of different types of ENSO events [Capotondi et al., 2013] and ENSO dynamical feedback processes [Newman et al., 2011b], as well as the spectral characteristics of ENSO variability from weekly to multidecadal time scales [Newman et al., 2009; Ault et al., 2013], supporting its use here to benchmark both the actual and potential forecast skill.
Tropical SST predictability studies often focus on specific regional ENSO indices, but comprehensive assessments should be basin wide, given the potentially important pattern and amplitude differences among individual ENSO events [Capotondi et al., 2013] and also the potential for Indian and western Pacific SST anomalies to have global impacts [Barsugli and Sardeshmukh, 2002]. Therefore, we compare the spatial patterns as well as year-to-year variations of the skills of the LIM, the ensemble mean forecasts of the individual models in the NMME, and the grand NMME mean forecasts over the period 1982–2016. Their similarity to each other and to the potential skill estimated in a perfect LIM framework suggests that the seasonal predictability of tropical Indo-Pacific SST anomalies is effectively linear, apart from minor differences likely due to model deficiencies of the specific LIM and CGCMs utilized here, and that we may now be near the intrinsic limit of tropical Indo-Pacific SST predictability.
2 Forecast Models and Metrics
2.1 Linear Inverse Model
(1)Penland and Sardeshmukh constructed a LIM of tropical SSTs of the form equation 1 by estimating both B and the statistics of Fs from observations. In principle, a LIM could represent the entire climate anomaly state vector in x, but in practice a truncated set of variables is used to (1) focus on the climate subsystem under consideration (here the tropics), (2) minimize redundant information provided by additional closely related variables, and (3) robustly estimate B using limited observational data sets. Penland and Sardeshmukh included only SST anomalies (SSTA) in x, reasoning that on seasonal time scales the SST-induced changes in surface winds drive an immediate thermocline response, and wave dynamics allow rapid deeper ocean adjustment to the SST changes [Neelin, 1991; Neelin and Jin, 1993]. Subsequent studies [Johnson et al., 2000; Xue et al., 2000; Newman et al., 2011b] showed that explicitly including some oceanic heat content measure in x improves the LIM by capturing ENSO recharge-discharge physics [Neelin et al., 1998] and other deeper ocean dynamics not enslaved to SSTA on seasonal time scales. Our approach follows Newman et al. [2011b], modifying their x by using (1) monthly instead of seasonal anomalies, taken from the observational data sets of the 1958–2010 period listed in Table S1 in the supporting information [Kalnay et al., 1996; Rayner et al., 2003; Ishii et al., 2005; Smith et al., 2008; Balmaseda et al., 2013], (2) sea surface height (SSH) instead of thermocline depth anomalies, given the interest in SSH forecasts for their own sake [Chowdhury and Chu, 2015], and (3) 200 and 850 hPa horizontal winds instead of surface zonal wind stresses to represent the atmospheric ENSO component. The anomaly fields were filtered in each field's empirical orthogonal function (EOF) space defined in the global tropical strip 24°S–24°N, retaining the leading 18/6/4 EOFs representing about 85/63/25% of the domain integrated SST/SSH/wind anomaly variances. The 28 × 28 matrix B was then estimated using the 0 and 1 month lag covariance matrices of the 28-component state vector x.
B and hindcast skill were cross-validated by excluding 5 year periods (i.e., 10% of the data) at a time, estimating B using the remaining data, and generating hindcasts for the independent 5 year periods. Other key details of LIM construction, including the important “tau test” [Penland and Sardeshmukh, 1995] for the validity of the linear approximation (Figure S1), are described in the supporting information; none of these details affect our conclusions. Note that the LIM used here and its dynamics are consistent with other related studies [Newman et al., 2011b; Vimont et al., 2014; Capotondi and Sardeshmukh, 2015].
2.2 NMME Hindcasts
The LIM's hindcast skill was compared to that of the NMME models used operationally by National Centers for Environmental Prediction (NCEP) (Table S2); including models from earlier stages of the NMME project did not alter our conclusions. All CGCM forecast models drift toward their own climate at long forecast lead times, causing a forecast bias. We removed each model's bias from each of its ensemble members as a function of both calendar month and forecast lead time, using the 1982–2010 hindcast data set and the LIM's cross-validation intervals. Following Barnston et al. [2015], the grand multimodel ensemble mean (NMME mean) forecasts were then determined using such bias-corrected ensemble members of all the models. The bias correction improves the skill of each model and of the NMME mean [Barnston et al., 2015].
The NMME forecasts were initialized on or near (for staggered starts) the first day of each month. The “Month 0.5” forecast was then the mean of the first month of the forecast run, i.e., centered in the middle of the calendar month. The equivalent LIM forecast was initialized with the monthly mean observations centered on the previous month, so that the 1 month lead LIM forecast and the NMME Month 0.5 forecast were verified at the same time. We renamed both these forecasts the “Month 1” forecast and so on for increasing forecast leads.
2.3 Evaluating Forecast Skill and Predictability
We used two skill measures: anomaly correlation (AC) and root-mean-square error skill score (RMS skill score) defined as ε
, where
, σ is the RMS forecast error, and σobs is the observed climatological RMS value. The overall skill over the tropical Indo-Pacific domain was assessed using the standardized RMS error and pattern correlation between the forecast and verification SSTA fields in the region 24°S–24°N, 30°E–60°W.
Potential skill in a chaotic system may be defined as the skill of a perfect model with infinitesimal errors in initial conditions. For infinite-member ensemble perfect model forecasts, it can be shown that the average AC between forecasts at lead time τ and the corresponding verification is a function of the forecast signal-to-noise ratio [Sardeshmukh et al., 2000]. The distinction between the predictable forecast signal and unpredictable noise is explicit in 1 and yields explicit expressions for the predictable forecast signal covariance and forecast error covariance matrices which can be used to determine the potential LIM skill in terms of both AC skill [Newman et al., 2003] and RMS skill scores (see supporting information).
3 Tropical Indo-Pacific SST Forecast Skill and Predictability
Figure 1 shows Month 6 local AC skill of the LIM forecasts, each individual NMME model's ensemble mean forecasts, and the NMME mean forecasts. The pattern and amplitude of the NMME mean and LIM skills are very similar, with equatorial skill maxima east of the dateline, in the central Indian Ocean, and in a broad region east of the Philippines. Skill minima are found in a V-shaped region extending northeastward and southeastward from the equator at about 160°E and in the southeast Indian Ocean. The LIM and NMME mean skills are closer to each other than to any of the individual NMME models, for all forecast lead times (not shown). Note that this similar skill occurs despite initializing the LIM only within its limited EOF space.

Overall, the potential LIM skill (also in Figure 1) has a very similar pattern to that of the actual NMME and LIM skill and only slightly higher values. The highest potential skill is not in the eastern equatorial Pacific, where both the SST forecast signals and noise are large, but rather in the central Pacific where large forecast signals are accompanied by much weaker noise. Not surprisingly, the potential skill is smallest along nodal lines of the dominant ENSO pattern where the forecast signals are generally weakest but is larger again farther westward associated with stronger ENSO-related signals over the Indian Ocean where the noise is also weaker.
Along the equator, LIM and NMME mean RMS skill scores vary similarly with longitude and forecast lead time (Figure 2a). The LIM and NMME mean have nearly equal skill in the central Pacific and Indian Ocean. The LIM has less skill in the far eastern Pacific and far more skill in the western Pacific where the NMME mean skill (and the skill of all its constituent models; cf. Figure 1) is negative. The potential RMS skill score along the equator (dotted line in Figure 2a) closely tracks the LIM and NMME mean hindcast skill. These potential skill estimates are derived from a LIM trained on a limited observational record and so are uncertain to some extent. To estimate their sampling uncertainties, we generated 26,500 years of synthetic data using 1 [Penland and Matrosova, 1994; Newman et al., 2011a] and estimated the LIM parameters and associated potential skill using each of its five hundred 53 year segments to determine 90% confidence intervals for the potential skill, shown as the green band in Figure 2a. (See supporting information for details and Figure S2 for the confidence intervals of the skill estimates in Figure 1). The LIM's hindcast skill is within this confidence interval west of the dateline but not in the far eastern Pacific where it is significantly lower than both the potential skill and the NMME mean skill.

The LIM and the NMME mean also have very similar year-to-year skill variations, and again both are similar to the potential LIM skill variations. Figure 2b shows the 13 month running mean pattern correlations between the predicted and observed tropical Indo-Pacific SSTA fields for lead times of 3, 6, and 9 months. The LIM and NMME mean skill series are highly correlated for all these leads. The potential skill is largely realized by both forecast systems at Month 3, to a lesser extent at Month 6, but poorly at Month 9 after 2000. Overall, the NMME mean skill is slightly higher than the LIM skill. Note also that in a few cases, both the NMME mean and LIM skills are higher than the potential skill. This is not a contradiction. The potential skill represents an expected upper skill bound, but actual forecast skill in individual forecast cases can be higher by chance depending on the magnitude of the actually realized unpredictable noise in those cases.
There is a hint of lower skill for several years following the 1997–1998 ENSO event [Barnston et al., 2012], which some studies have suggested as possibly due to an altered tropical base state in a changing climate, with reduced ENSO predictability [Zhao et al., 2016]. Note, however, that since our LIM's dynamical operator B is fixed, all variations of potential LIM skill result only from random variations in initial conditions and consequent variations in predictable anomaly growth. Given the correspondence between potential skill and model skill, as well as the similarity between LIM and NMME mean skill, the realized skill variations in Figure 2b are therefore likely due to differing noise realizations as opposed to base state changes, consistent with other recent model and forecast studies [Wittenberg et al., 2014; Kumar et al., 2015; Lee et al., 2016].
It is striking that our B matrix, despite having no seasonal variation, captures the essence of the seasonal variation of the NMME mean skill, shown in Figure 3 for three commonly used ENSO indices. Note that for most lead times, the LIM and NMME mean skills vary more with verification month than initialization month. For example, Niño3 and Niño4 forecasts for the summer months are generally much less skillful than for the winter months. The LIM also captures the substantially reduced eastern Pacific (Niño1.2 and Niño3) skill, at any given lead time, of forecasts for spring compared to forecasts for winter, sometimes referred to as the springtime predictability barrier [e.g., Levine and McPhaden, 2015].

The NMME's poor skill just west of the dateline (Figure 2a) contributes to its poorer Niño4 skill (Figure 3). The basic reason for this poor skill is that the NMME mean predicted pattern of eastern Pacific warming during El Niño events (and cooling during La Niña events) extends too far west of the dateline compared to observations. This is confirmed by comparing the dominant pattern (leading EOF) of the observed SSTA with that in the Month 6 LIM and NMME mean SSTA hindcasts (Figure 4). While a typical ENSO pattern in a LIM forecast matches observations, with SSTAs of opposite signs in the western and eastern Pacific, the NMME mean typically predicts an ENSO anomaly with the same sign across the entire equatorial Pacific. This pattern is particularly dominant for the individual NMME models with the lowest western Pacific SSTA skill (not shown, but see Figure S3a), which also generally have the highest central Pacific SSTA skill. While the erroneous westward extension of ENSO variability is common in long climate model simulations [e.g., Joseph and Nigam, 2006; Li and Xie, 2014], it is notable that this error develops rapidly in 1–2 months from initial conditions in the NMME forecasts (Figure S3c) but not in the LIM forecasts (Figure S3b).

Finally, we depict in Figure 5 on a Taylor diagram [Taylor, 2001] the overall decrease of skill with increasing forecast lead time as color-coded curves for the LIM, NMME mean, and individual NMME model hindcasts. The angular coordinate of the plotted values for each model represents the average pattern correlation of the hindcast SSTA fields with the verification fields over the Indo-Pacific domain, and the polar coordinate (distance from the origin) represents the RMS magnitude of the hindcast fields standardized by that of the verification fields. The plotted point for the verification fields is, by definition, the reference point (REF). The distance of the model points from REF is the standardized RMS forecast error. With increasing forecast lead time, the skill trajectory of all model hindcasts recedes monotonically from REF.

It can be shown (see supporting information) that the skill trajectory of any perfect model whose forecast errors are on average orthogonal to its forecast signals should follow a universal semicircular trajectory on a Taylor diagram (the blue curve in Figure 5), as is indeed true for the potential LIM skill trajectory. The actual LIM hindcast LIM skill is very close to this curve at short lead times but diverges from it at longer lead times. The NMME mean skill roughly parallels the LIM skill, with slightly higher AC and larger forecast RMS magnitudes, which indeed exceed even the observed magnitudes [Barnston et al., 2015] for leads of a few months. The net effect is that the RMS forecast error of the NMME mean is slightly better (~3–4%) than the LIM, but both are larger than the potential LIM error.
4 Discussion
The similarity of the NMME mean skill to the LIM skill could conceivably be due to NMME model misrepresentations of predictable nonlinear dynamics. Some model errors that are plain in long climate runs have persisted through several generations of CGCMs, along with continuing difficulties in capturing the correct balance of ENSO processes [Bellenger et al., 2014]. Averaging across multiple imperfect CGCMs, as in the NMME mean forecasts, may wash out a possibly large nonlinear part of the forecast signals on which there is model disagreement and emphasize the linear part, yielding skill similar to the LIM skill. However, the fact that both the LIM and the NMME mean outperform all the individual models in the NMME ensemble would then suggest that such predictable nonlinear signals are not captured by any of the individual models, which seems unlikely given extensive CGCM developments over the past several decades. A much simpler interpretation of our results is that the nonlinear interactions are not necessarily weak but are essentially unpredictable (as is indeed assumed by the LIM) and cause random forecast errors that are averaged out better in the NMME mean forecasts than in the individual-model ensemble mean forecasts. Still, if future CGCM improvements do lead to a better consensus on the predictable nonlinear signals, then the NMME mean skill could exceed LIM skill in cases in which those signals are large, possibly in the eastern equatorial Pacific.
On the other hand, our LIM could also be improved. As an empirical model, it is limited by the availability of adequate training data, particularly of SSH and related measures of the ocean state. The LIM could also account better for seasonality, especially in the far eastern Pacific [Mitchell and Wallace, 1992], although the available observational record is likely too short for this [Johnson et al., 2000; Newman et al., 2009], and the LIM already largely captures seasonal variation of NMME mean skill. Including additional variables in the LIM to improve its representation of coupled tropical dynamics poses similar data challenges. It is also important to remember that our particular LIM is derived from a single 53 year observed segment of an essentially stochastically driven multivariate linear process whose parameters might require hundreds of years of data to be determined with sufficient accuracy [Wittenberg, 2009], with consequent uncertainty in the estimated potential skill as shown in Figure 2.
But these issues do not diminish our central result: two fundamentally different forecast systems, one a low-dimensional empirical linear inverse model and the other a multimodel ensemble of comprehensive high-dimensional nonlinear coupled GCMs, have very similar spatially and temporally varying tropical Indo-Pacific SST skill. This is strong evidence that the predictable dynamics of tropical Indo-Pacific SST variations are essentially linear, and a linear model can therefore be used to quantify that predictability. When used for this purpose, our LIM predicts upper bounds on the spatial/temporal skill variations of both the NMME and LIM forecast systems as a direct consequence of the spatial/temporal variations of its forecast signal-to-noise ratio, that is, of its (effectively infinite-member) ensemble mean forecasts and ensemble spread. The fact that the NMME mean skill and the LIM skill are consistently just below these upper bounds suggests that we may be near the limit of tropical Indo-Pacific SST predictability in most regions, except in the western equatorial Pacific where the NMME mean skill is currently much lower than the LIM skill. One might thus reasonably expect the impacts of future improvements in seasonal tropical Indo-Pacific SST forecast skill to be small, unless the current NMME model error in the western Pacific has a large negative impact on global seasonal predictions.
Acknowledgments
We acknowledge the agencies that support the NMME-Phase II system, and we thank the participating climate modeling groups (Environment Canada, NASA, NCAR, NOAA/GFDL, NOAA/NCEP, and University of Miami) for producing and making their model output available. NOAA/NCEP, NOAA/CTB, and NOAA/CPO jointly provided coordinating support and led development of the NMME-Phase II system. All NMME hindcast data used in this paper are available from the IRI/LDEO Data Library, https://iridl.ldeo.columbia.edu. LIM hindcasts will be made available from NOAA/ESRL/PSD, https://www.esrl.noaa.gov/psd. This work was supported by NOAA/CPO-MAPP and NSF AGS 1463643.





