Winter Precipitation Forecast in the European and Mediterranean Regions Using Cluster Analysis
Abstract
The European climate is changing under global warming, and especially the Mediterranean region has been identified as a hot spot for climate change with climate models projecting a reduction in winter rainfall and a very pronounced increase in summertime heat waves. These trends are already detectable over the historic period. Hence, it is beneficial to forecast seasonal droughts well in advance so that water managers and stakeholders can prepare to mitigate deleterious impacts. We developed a new cluster-based empirical forecast method to predict precipitation anomalies in winter. This algorithm considers not only the strength but also the pattern of the precursors. We compare our algorithm with dynamic forecast models and a canonical correlation analysis-based prediction method demonstrating that our prediction method performs better in terms of time and pattern correlation in the Mediterranean and European regions.
Plain Language Summary
We have applied a new forecasting technique to the problem of seasonal prediction that involves machine learning. By recognizing related and reoccurring patterns in both the predictors and the predictands our new technique shows improved accuracy in predicting winter precipitation in the European and Mediterranean regions. Our demonstrated technique outperforms both statistical and dynamical models over comparable historical periods.
1 Introduction
European climate is characterized by four distinct climate zones: Mediterranean climate in southern Europe, continental climate in eastern Europe, maritime climate in western Europe, and a hybrid maritime/continental climate in central Europe (Hess & Tasa, 2011). Those climate regions are sensitive to large-scale circulations of the atmosphere. Even relatively minor modifications of the general circulation results in shifts of the midlatitude storm tracks and substantial changes in the climate (Giorgi, 2006; Kröner et al., 2017; Lionello et al., 2006; Mbengue & Schneider, 2013). The Mediterranean climate is especially vulnerable to climate change due to its unique topography and geographical location and is therefore considered as one of the primary climate change hot spots since strong climate changes are projected in this region (Diffenbaugh & Giorgi, 2012; Dubrovsky et al., 2014; Giorgi, 2006; Intergovernmental Panel on Climate Change (IPCC), 2013). The Mediterranean climate is generally characterized by hot and dry summers as well as by mild winters, with winter rainfall more than 3 times larger than summer rainfall (Ducrocq et al., 2014; Flaounas et al., 2013). Winter rainfall is mostly determined by synoptic storms coming from the North Atlantic (Giorgi & Lionello, 2008). In general, the winter Mediterranean precipitation is related to the North Atlantic Oscillation (NAO) over its western areas, the East Atlantic (EA), and other patterns over its northern and eastern areas (Giorgi & Lionello, 2008; Kröner et al., 2017; Lionello et al., 2006; Mbengue & Schneider, 2013; Ullbrich et al., 2006). Other studies argue that also the El Niño Southern Oscillation (ENSO) affects precipitation in the Mediterranean region (Park & Leovy, 2004; Shaman & Tziperman, 2011). Both Park and Leovy (2004) as well as Shaman and Tziperman (2011) suggest that a Rossby wave train originating from the Pacific in autumn might affect the Mediterranean winter climate (Gurmy et al., 2012). Other potentially important factors are sea ice concentration and snow cover extent with Eurasian snow cover in autumn significantly correlating with the wintertime Arctic Oscillation and mean sea level pressure (Cohen & Jones, 2011; Furtado et al., 2016).
Hoerling et al. (2012) show that the Mediterranean region is already drying. In a time period between 1988 and 2008 the region had 10 of the 12 driest winter seasons (Hoerling et al., 2012). Anomalously high seasonal temperatures or low precipitation in the Mediterranean region have been recently observed decimating agricultural yields and causing damages up to hundreds of millions of dollars each year (Ducrocq et al., 2014). Future climate simulations project an even stronger trend toward drying (Giorgi, 2006; IPCC, 2013; Paeth et al., 2017; Xoplaki et al., 2004). Although it is well known that anthropogenic greenhouse gas forcing leads to enhanced warming in the Mediterranean region, there is uncertainty in projected changes in precipitation due to large internal variability and a relatively small forced signal (Hoerling et al., 2012; Mariotti et al., 2015). However, both Hoerling et al. (2012) and Mariotti et al. (2015) show that the probability distribution of future precipitation anomalies is shifted toward dryer conditions.
Early forecast techniques of seasonal drought like multimodel ensemble predictions, dynamical or statistical downscaling, and empirical forecast approaches could play an important role in mitigating possible future impacts. Examples for empirical forecast approaches are multiple linear regressions or canonical correlation analysis (CCA), which is a form of linear multiple regression applied to multivariate pattern predictands (Barnett & Preisenberger, 1987; Barnston et al., 1996; Chu et al., 2008; Doblas-Reyes et al., 2000; Eden et al., 2015; Hwang et al., 2001; Yatagai et al., 2014). However, global circulation models have little or no skill in predicting European precipitation during December-January-February (DJF) (Doblas-Reyes et al., 2009; Weisheimer & Palmer, 2014). Weisheimer and Palmer classify seasonal forecasts of wintertime rainfall over the Mediterranean region as only marginally useful for decision makers and policymakers (Weisheimer & Palmer, 2014). Likewise, previous empirical forecast approaches such as multiple linear regressions or CCA have essentially no skill over this region for predicting rainfall (Barnston & Smith, 1996; Eden et al., 2015). For example, Eden et al. developed a simple empirical system for predicting seasonal surface air temperature and precipitation across the globe using global and local atmospheric and oceanic fields (Eden et al., 2015). In particular, they used CO2 concentration to predict the climate change signal and additional predictors describing large-scale modes of variability in the climate system (e.g., ENSO) to forecast the variability in the climate system. The hindcast-observation correlation for the time range 1961–2013 is generally low over the globe with some parts of northern Eurasia with positive skill. The mean correlation over Europe and the Mediterranean region is almost zero. Also Barnston et al. (1996) algorithm using CCA to forecast the global Northern Hemisphere has only weak skill over Europe and the Mediterranean region. Barnston et al. (1996) used reconstructed sea surface temperature (SST) data set as the only predictor to hindcast near-global SST and seasonal mean surface temperature and precipitation based on the 1950–1992 period. The hindcast skills for Europe are generally poor, and the average skill for DJF is roughly 0.1 for zero season lead time. The weak skills for Europe do not imply that the statistical methods are not suitable for seasonal forecasts. Possible reasons include also chosen predictors, the chosen predictor regions, or the chosen time lags.
Here we propose a novel empirical prediction system that is more skillful and hence could possibly ease the decision-making process of stakeholders interested in seasonal prediction (Barriopedro et al., 2011). The novelty of our method to forecast winter European and Mediterranean precipitation is that it accounts not only for the amplitude of predictors but also for the geographical patterns using clustering techniques. Similar to CCA, clusters in our algorithm were used to describe the dominant patterns of the precipitation anomalies over Europe and the Mediterranean region with the advantage that those states do not have to be orthogonal to each other. The forecast algorithm calculates precipitation anomalies in winter with the analyzed precursors in autumn.
2 Data and Methods
2.1 Data
In this study, we calculated detrended precipitation anomalies from a gridded data set of precipitation provided by the “European Climate Assessment and Data Set Project” (Haylock et al., 2008). This data set is on a 0.5° × 0.5° grid over the area between 25° to 75° latitude and −20° to 45° longitude for the winter time period (December-January-February, DJF) 1967 to 2016. The anomaly fields are smoothed using a Gaussian filter (σx=2.7,σy=2.7). In addition, we used several detrended precursor fields in autumn (September-October-November, SON) for the overlapping period 1996 to 2015, including sea ice concentration (sic) from the Met Office Hadley Centre (at 2.5° × 2.5°) for the area between 60° to 90° latitude and 0° to 180° longitude. We further include snow cover extent (sce) provided by NOAA (Robinson et al., 2012) using the area between 30° to 60° latitude and 0° to 180° longitude. The choice of this area is motivated by the snow advance index, which computes the snow cover extent over the same area (Cohen & Jones, 2011). Furthermore, we include sea surface temperatures (sst) from the Met Office Hadley Centre (at 1° × 1°) using three different regions: the tropical Pacific (−40° to 20° latitude and 130° to 290° longitude), the North Atlantic (0° to 65° latitude and −35° to 6° longitude), and the Mediterranean region (30° to 50° latitude and −6° to 45° longitude) (Rayner et al., 2013). In addition, we include geopotential height (gph) at 500 mb using the area between −20° to 90° latitude and 0° to 360° longitude (Kalnay et al., 1996). The same area is used for sea level pressure (slp). Atmospheric data are from National Centers for Environmental Prediction (NCEP)/National Center for Atmospheric Research (NCAR; at 2.5° × 2.5°) (Kalnay et al., 1996). In addition, we calculated the ensemble model mean for nine models of hindcast experiments provided by the North American Multimodel Ensemble (NMME: CMC1-CanCM3, CMC2-CanCM4, NCAR-CESM1, NCEP-CFSv2, COLA-RSMAS-CCSM3, COLA-RSMAS-CCSM4, NASA-GMAO, IRI- ECHAMP4p5-DirectCoupled, and IRI-ECHAMP4p5-AnomalyCoupled) (Kirtman et al., 2014). NMME System Phase II data (https://www.earthsystemgrid.org/search.html?Project=NMME) were used in these analyses. The NMME is a multiagency project under the guidance of the United States National Oceanic and Atmospheric Administration (NOAA). The NMME System is designed to leverage coupled models from a number of United States and Canadian modeling centers in an ensemble of opportunity supporting seasonal forecasting experiments (Kirtman et al., 2014). NMME models are coupled to ocean models, and most of the NMME models have an ice component model. Some models use also a land component model including soil moisture and snow cover (Collins et al., 2005; De Witt, 2005; Gent et al., 2011; Merryfield et al., 2013; Saha et al., 2014). The real-time and retrospective forecasts are issued on the fifteenth of each month, for example, a November 2010 monthly mean forecast is the 0.5 month lead issued on 15 November 2010, and the December 2010 monthly mean forecast issued on 15 November is the 1.5 month lead and so on. The hindcast start times should include all 12 calendar months. However, the specific day of the month or the ensemble generation strategy is dedicated to the forecast provider. Hence, different models are initialized at a different start day, for example, the model CMC2-CanCAM5 initializes all ensemble models at the first of a month, whereas CFSv2 initializes all four members (0000, 0600, 1200, and 1800 UTC) every fifth day. In the present work we evaluated NMME forecasts issued on 15 November and 15 December for the DJF period.
2.2 Clustering-Based Forecast Approach
To obtain the prediction for the winter precipitation anomaly, we proceed as follows (further details as well as an example using a toy problem are provided in the supporting information, SI). We calculate the cluster structures by applying hierarchical clustering to the winter precipitation anomalies over the domain of interest. Hierarchical clustering (e.g., Cheng & Wallace, 1993; Feldstein & Lee, 2014; Horton et al., 2015; Kretschmer et al., 2017; Lee & Feldstein, 2013) is a common and powerful clustering analysis procedure. The precipitation anomaly at each season is arranged as a vector data point. The algorithm then constructs a hierarchy of clusters by merging one pair of nearest data points or clusters of points at each step (Wilks, 2011). Standard measures are used to determine when to stop the merging and therefore the appropriate number of clusters, Nclusters (Figure S3 in the SI).
. We use bold upper case variable names to denote clusters and composites, and lower case bold variable names to denote time series data. Given the current state of the precursors, we now produce the prediction as follows. First, we find the projection of the state of the predictors averaged over the autumn of year t (and denoted precursorSON(t)) on the predictor composites. Each combined predictor composite is associated with a precipitation cluster and therefore provides information about the amount and spatial structure of winter precipitation anomaly expected given the autumn predictor composite. This allows us to calculate the expected precipitation pattern due to the projection of the current state of predictors on each cluster. We expand the current precursor state in terms of the precursor composites as
(1)The expansion may only be approximate because the composites are not necessarily a complete set of vectors. To find the expansion coefficients ai(t), we multiply equation 1 by precursor composite COMPOSITEj and solve the equation for the coefficients ai(t) at every time step (year t) in the data using the SVD-based pseudoinverse (SI).
(2)2.3 Canonical Correlation Analysis
CCA is a statistical technique that identifies the linear associations among two data sets of variables, that is, it relates variations in predictor fields to variations in predictand fields (Barnett & Preisenberger, 1987; Barnston et al., 1996; Wilks, 2011; Xoplaki et al., 2004). By construction, the identified linear combinations of variables are maximally correlated. We apply this method in order to compare our algorithm with the already established pattern-based method CCA. For comparison, we use the same input predictors as used for the clustering-based method.
3 Results and Discussion
3.1 Clusters and Composites
The appropriate number of clusters is found to be three, based on standard measures (Figure S3), and the clusters are shown in Figure 1, ordered by their frequency: 48% of the winter seasons fall within cluster 1 (Figure 1a), 28% within cluster 2 (Figure 1b), and 24% in cluster 3 (Figure 1c).

To explore the possible precursors associated with each cluster pattern, composites for sea ice concentration (sic), snow cover extent (sce), sea surface temperature in the Mediterranean region (sstMedi), Atlantic (sstAtl), and tropics (sstTropics), and geopotential height (gph) and sea level pressure (slp) are calculated (Figures S5–S7). Examples are shown for the precursors sic, sce, sstMedi, and sstAtl for cluster two, because (as shown below) those three precursors give the best forecast skill across all clusters (Figure 2). We calculated the precursor anomalies to show the different patterns. In the algorithm we do not use precursor anomalies but the actual precursor values. The prediction of seasonal precipitation anomalies is then obtained by the procedure that is described in section 2.3 and schematically shown in Figure S8.

All three clusters of precipitation anomalies exhibit distinct properties: Cluster 1 reveals a weak drying structure with positive precipitation anomalies across the north of Europe, whereas cluster 2 corresponds primarily to a positive NAO (Figure 1). The corresponding composites of cluster 2 reveal patterns, which are associated with a positive NAO pattern. The composites of cluster 3 reveal patterns that are associated with negative NAO pattern. The typical patterns of the precursors for a positive NAO pattern are shown in Figure 2: sce exhibits more negative snow anomalies, sic exhibits more positive sea ice concentration anomalies, sstMedi exhibits more positive temperature anomalies, and sstAtl shows a tripole temperature anomaly pattern.
Other precursors were investigated but found to be less skillful than the set of precursors sic, sce, sstMedi, and sstAtl achieving the highest skill score.
The physical mechanisms of the three different precursors leading to more precipitation are the following: The Mediterranean Sea is a major moisture source (Lionello et al., 2006). In late October and early November low pressure systems develop in the Mediterranean region due to the convergence of maritime tropical air from the Atlantic, maritime polar air from the North Atlantic and northwest Europe, maritime Arctic and continental Arctic air from the Arctic and northern Russia, and continental tropical air from the Sahara. The cyclogenesis is energized by the sea surface temperature that enhances the evaporation and atmospheric transport and brings the winter precipitation (Smithson et al., 2013). The Alps deflect the water saturated wind, which can lead to more rainfall.
Positive North Atlantic SST anomalies across the midlatitudes and negative North Atlantic SST anomalies in the subtropics lead to a southward shift of storm tracks from western Europe toward the Mediterranean region. The combination of both the shift of the storm tracks and the local cyclogenesis produces the spatial distribution of the precipitation pattern (positive precipitation anomalies over the western and central Mediterranean region) (Xoplaki et al., 2004).
Sea ice loss in September and October warms the atmosphere and leads to an increase of the geopotential height, which forces the jet stream southward over east Siberia. This southward shift of the jet stream is associated with a southward shift of the storm tracks leading to more Eurasian snow cover in October. In addition, the ice-free ocean contributes to an increased moisture flux in the atmosphere, which precipitates as snow southward over Siberia. The anomalously high Eurasian snow cover cools the surface, which increases the surface pressure and reduces the geopotential heights in the lower and middle troposphere. This planetary wave configuration enhances vertical wave propagation from the troposphere into the stratosphere, which weakens the stratosphere and results in a stratosphere warming event. In January and February, the lower stratospheric anomalies propagate downward into the troposphere inducing a negative phase of the NAO and hence a shift of the polar jet and storm track equatorward. These displacements are followed by a southward shift in the storm tracks across the midlatitudes and wetter conditions across the Mediterranean region (Cohen et al., 2014).
3.2 Forecast Using Cluster Analysis and Comparison With NMME and CCA
We calculated the cross-validated correlation between the hindcasts and observations of winter precipitation anomalies for the time period 1967 to 2016, as well as for the time period 1982 to 2010 in order to compare the results with the NMME ensemble that is provided for these years. We also present the results of the CCA empirical prediction method. Therefore, we used all data to compute clusters, composites, and finally, the hindcast, not using the data from the year we would like to predict. Such a prediction is performed for all years. Figure 3a shows the correlation for 1967 to 2016. Significant values (P < 0.05) according to the two-sided Student t test are shown in hatches. The correlation is in general positive, except for some parts of southern Sweden, Morocco, some regions of northern Algeria and Libya, as well as Georgia. The mean correlation is 0.22.

In contrast, the correlation between the CCA forecast and observations is mixed, with weak positive correlations in Central Europe, weak negative correlations at the margins of Europe and the Mediterranean region (Figure 3b), and a mean correlation of 0.05. Both the clustering and CCA approaches use the entire fields of the predictor and predictand, so one might expect them to perform similarly in terms of prediction skill. It is possible that the CCA performed less well because it is based on an empirical orthogonal function (EOF) expansion of the predictor and predictand, while the clusters represent common patterns that are not necessarily orthogonal and are thus less restrictive. We truncated the expansion of the predictor and predictand at three EOFs, although the skill with only two EOFs was nearly as good. There are possibly additional refinements to the CCA analysis that could have been used, but a more thorough analysis of the difference between the two approaches is beyond the scope of this paper.
The mean correlation of the NMME is 0.13 whereas the cluster-based method for the same time range exhibits a mean correlation of 0.20 (compare Figure 3c and Figure 3d). The cluster-based method has high skill over Central Europe, the Iberian Peninsula, and the Eastern Mediterranean. The NMME forecast shown in Figure 3d is based on an initialization on 15 November, while the CCA and cluster forecast, being based on seasonal averages, use data from September, October, and November. An NMME forecast similarly issued on 1 December is not available, and we present instead the results of the forecast issued on 15 December in the supporting information (Figure S10). The mean correlation in that case is 0.21, marginally better than that of the cluster analysis (0.20) although in this case the NMME prediction is based on December data and provides a December prediction, explaining the good skill in this case.
To investigate whether the Gaussian filter plays a role in our method, we show in Figure S19a the correlation for the precursors sic, sce, sstMedi, and sstAtl, but without using the Gaussian filter. It indicates that the correlation structure in central Europe is almost the same, but the Gaussian filter smooths the field leading to higher-correlation values. Most negative correlations vanish due to the smoothing. We also show correlations for other precursors sstMedi and sstAtl (Figure S9b), sce and sic (Figure S10c), and all three sst regions (Figure S9d) for the years 1967 to 2016.
Those plots reveal that the precursors sstMedi and sstAtl are likely more important in predicting prcp anomalies for southern, central, and eastern Europe, whereas the precursors sic and sce are more relevant to predict prcp anomalies in the southeastern and northern part of the Mediterranean region (compare Figure S9b and Figure S9c). While the sstMedi and sstAtl as precursors have moderate correlation with observational data, all three chosen sst regions have a low correlation and are less skillful than sic and sce in predicting prcp anomalies.
Finally, we compared the pattern correlation of our method for two different time ranges, 1967–2016 and 1982–2010, with the NMME and CCA (Figure 4). It is clearly visible that the pattern correlation using the cluster-based method is mostly positive for the longer time range with a mean pattern correlation of 0.20 and only for some years negative (black line in Figure 4a, red line represents the mean value). The dashed black line shows the mean pattern correlation of CCA with a mean pattern correlation of 0.01. The gray line exhibits the pattern correlation of the NMME with mean pattern correlation of 0.05, and the red dashed line exhibits the mean correlation of our hindcast method for the same time range mean pattern correlation of 0.18. Also the pattern correlation of NMME hindcasts issued on 15 December has a lower value (0.14) than the pattern correlation of the clustering method and is shown in Figure S10.

The results of the pattern correlations show that the cluster method resembles the observations more closely than NMME or CCA. We plotted the hindcast with the highest pattern correlation (year 2010) and the hindcast with the lowest pattern correlation (year 2003) as well as the observed data for those hindcasts (Figures 4b–4e).
Comparing the time plot of the clusters in Figure S3 with the pattern correlation plot (Figure 4a) reveals that clusters 2 and 3 have the best forecast skill. This likely stems from the fact that these clusters represent, respectively, a clear positive and negative NAO state, whereas the other one has a more complicated geographically distributed structure. This result suggests that extreme NAO states have better predictability than intermediate states.
4 Conclusion
This study presents a new cluster-based method to predict the precipitation anomalies in the European and Mediterranean regions using autumn precursors. The advantage of this approach is that both the magnitude and spatial structure of the precursors are utilized in generating the predictions. Applying hierarchical clustering, we identified three clusters describing the dominant patterns of the precipitation anomalies over Europe. From those clusters we calculated the composites of different precursors. To predict precipitation anomalies, we first computed the projection of the current state of the predictors onto the composites. Each predictor composite is associated with a precipitation cluster and provides information about the amount and spatial structure of winter precipitation expected given the autumn predictor composite. Thus, we can calculate the expected precipitation pattern by multiplying the projections of each cluster and summing up all products.
The cluster-based method achieves higher forecast skill in time and pattern correlation than a CCA-based prediction algorithm using the same predictor fields for both methods. In addition, the cluster-based method performs better than the NMME models in terms of pattern and time correlation.
Our algorithm achieves also higher skill than other empirical methods used in the past such as the multiregression model developed by Eden et al. (2015) or the CCA-based algorithm used by Barnston et al. (1996).
The method could be applied to temperature and precipitation anomalies in other regions or even possibly to forecast extreme weather.
Acknowledgments
The work was supported by the German Federal Ministry of Education and Research, grant 01LN1304A, (S. M. and D. C.). J. C. is supported by the National Science Foundation grants AGS-1303647 and PLR-1504361. E. T. is supported by the National Science Foundation climate dynamics program, grant AGS-1622985, and thanks to the Weizmann Institute for their hospitality during parts of this work. This work was supported by the National Science Foundation Large-Scale and Climate Dynamics Program (grants AGS-1303647 and AGS-1303604) and the National Science Foundation Division of Polar Programs (grant PLR-1504361). The research project resulted from S. M. visiting J. C., and S. M. would like to thank AER and Harvard for hosting. NCEP Reanalysis-derived data provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, are available through their website at http://www.esrl.noaa.gov/psd/. SCE is available through the website https://climate.rutgers.edu/snowcover/. NMME System Phase II data are available through the website.





