# Probabilistic downscaling approaches: Application to wind cumulative distribution functions

## Abstract

[1] A statistical method is developed to generate local cumulative distribution functions (CDFs) of surface climate variables from large-scale fields. Contrary to most downscaling methods producing continuous time series, our “probabilistic downscaling methods” (PDMs), named “CDF-transform”, is designed to deal with and provide local-scale CDFs through a transformation applied to large-scale CDFs. First, our PDM is compared to a reference method (Quantile-matching), and validated on a historical time period by downscaling CDFs of wind intensity anomalies over France, for reanalyses and simulations from a general circulation model (GCM). Then, CDF-transform is applied to GCM output fields to project changes in wind intensity anomalies for the 21st century under A2 scenario. Results show a decrease in wind anomalies for most weather stations, ranging from less than 1% (in the South) to nearly 9% (in the North), with a maximum in the Brittany region.

## 1. Introduction

[2] A robust general circulation or “climate” model (GCM) is characterized (at least) by its ability to simulate key climate variables with correct statistical properties: modes, variability, extreme event return levels or periods, et cetera. Although GCMs are useful tools to generate spatially and temporally coherent large-scale statistics, computational limitations currently prohibit GCMs from performing global simulations at the high spatial resolution required to generate useful climate information at regional- or local-scales [*Wilks and Wilby*, 1999], indispensable to drive climate impact studies [*Giorgi et al.*, 1990]. Dynamical or statistical downscaling methods aim at bridging this gap. Regional Climate Models (RCMs) constitute the dynamical approach [*Chen et al.*, 2003]. Resolving physical equations of the atmospheric regional dynamics, RCMs are meteorologically consistent [*Wood et al.*, 2004] but are also computationally expensive and therefore restricted in their applications to few runs. On the opposite, because of their computational properties and their flexibility (e.g., for extremes, uncertainty), statistical downscaling methods (SDMs) have recently received an outburst of interest. Transfer functions [*Wilby et al.*, 2002; *Cannon and Whitfield*, 2002], stochastic weather generators [*Wilks and Wilby*, 1999; *Semenov et al.*, 1998], and weather typing approaches [*Vrac et al.*, 2007] are the main three SDM categories. Those are usually applied to GCM outputs or reanalyses to statistically generate local climate variables such as temperature or precipitation. However, they can also be applied to large-scale climate statistics to provide local-scale climate statistics [*Pryor et al.*, 2005]. This latter context is retained for this study. Hence, since our goal is here to downscale statistical characteristics, and not directly to provide local-scale values as in a usual SDM approach, we will speak of probabilistic downscaling methods (PDMs). While classical SDMs assume direct relationships between large- and local-scale climate, PDMs model relationships between their associated statistical properties. In this present work, cumulative distribution functions (CDFs) are used. In other words, the basic question we are trying to answer is: from a CDF describing a climate variable (say the wind intensity) at a large (GCM) scale, can we model the equivalent CDF at a lower scale, say at a weather station? If so, how to proceed? Remark that, if its statistical characteristics can be downscaled – i.e., CDFs in this work – local values can be easily generated to create realistic local-scale time series.

[3] Modeling this link between large- vs. local- statistics brings up two problems: (1) it can be highly non-linear and difficult to build; (2) predictands and predictors are often non-trivial and generally do not belong to a well-known distribution family such as the Gaussian family. Thus, an idea shared by the two methods presented in this work is to make assumption neither on the shape of the relationship to be modeled, nor on the family of the CDFs, but rather to use non-parametric correspondences between the predictor and predictand CDFs.

[4] In the next section, we first remind the reader of a known PDM generating local-scale quantiles, and extend it to a non-parametric approach capable of modeling stationwise CDFs based on large-scale CDFs. In section 3, the data used in this work are introduced, and the two PDMs are validated on present climate, before applying the extension method to a future climate simulation. Some conclusions and perspectives are then given in the last section.

## 2. Two PDMs for CDFs

[5] Two probabilistic downscaling approaches, with the same philosophy, are presented. In this section, two time periods are considered; one corresponding to the calibration period, and the other one to the validation period for which local-scale CDFs have to be downscaled. In a climate change context, these time periods would correspond respectively to present and future periods.

### 2.1. Quantiles-Matching Method

[6] The Quantiles-matching approach (hereafter “Q-matching”) has been known for a while [see, e.g., *Panofsky and Brier*, 1958; *Haddad and Rosenfeld*, 1997] but only a few climate studies applied it [e.g., *Déqué*, 2007]. This method is used as reference in the present study.

[7] Let *F*_{S} stand for the CDF of a climate random variable, the predictand, observed at a given weather station during the calibration time period, and *F*_{G} for the CDF of the predictor variable from GCM outputs or reanalyses bi-linearly interpolated at the station location during the same time period. For simplicity in this paper, we assume that the predictor and the predictand are the same climate variable *X* for both methods (e.g., temperature, amount of precipitation, or wind velocity). *F*_{S}(*x*) and *F*_{G}(*x*) are non-linear and give the probability that *X* is below or equal to a given value *x*, i.e., *F*(*x*) = *Pr*(*X *≤ *x*), respectively in the real data and GCM spaces.

*x*

_{G}, the basic idea of this method is to select a local-scale value

*x*

_{S}based on the assumption that:

*F*

_{S}

^{−1}defined from [0,1] is the inverse function of

*F*

_{S}. Applying relationship (2) to large-scale simulated data for a new (e.g., validation or future) time period, allows to build a new local-scale time series. Although the Q-matching method directly allows to provide local-scale values, it is considered as a PDM in the sense given in the introduction section, since the downscaled values are local-scale quantiles, i.e., statistical characteristics. However, this method does not take into account the information on the distribution of the future modeled dataset. To overcome this potential issue, we propose a new probabilistic downscaling approach extending Q-matching.

### 2.2. CDF-Transform Method

[9] This approach (hereafter “CDF-t”) can be perceived as an extension of Q-matching, directly dealing with and providing CDFs. It is based on the assumption that there exists a transformation *T* allowing to “translate” the CDF of a GCM variable (such as temperature, precipitation or wind intensity), i.e., the predictor, into the CDF representing the local-scale climate variable, i.e., predictand, at a given weather station.

*F*

_{Sh}stand for the CDF of observed local data at a weather station for the historical calibration period, and

*F*

_{Gh}for the CDF of GCM outputs bi-linearly interpolated at the station location for the same time period.

*F*

_{Sf}and

*F*

_{Gf}are the CDFs equivalent to

*F*

_{Sh}and

*F*

_{Gh}but for a future (or simply different) time period. Then, assuming that we know

*F*

_{Gf}(that can be modeled through future GCM outputs), and that there exists a transformation

*T*: [0,1] → [0,1] such that

*F*

_{Sf}by applying

*T*to

*F*

_{Gf}?

*T*. A simple way to do so is to replace

*x*by

*F*

_{Gh}

^{−1}(

*u*) in equation (3), where

*u*belongs to [0,1]. We then obtain

*T*. Hence, assuming that relationship (4) will remain valid in the future, i.e., that

*F*

_{Sf}(

*x*) =

*T*(

*F*

_{Gf}(

*x*)), the researched CDF is provided by

[12] From a technical/algorithmic point of view, the CDF transform approach is defined in two steps:

[13] 1. The estimates of *F*_{Sh}, *F*_{Gh}^{−1} and *F*_{Gf}, respectively _{Sh}, _{Gh}^{−1} and _{Gf}, are empirically modeled respectively from the historical observations and the historical and future large-scale simulated data.

[14] 2. Then, by combining them according to equation (5), we dispose of _{Sf}, an estimation of *F*_{Sf}.

[15] However, *F*_{Sf}(*x*) defined through equation (5) is only valid for *x* in [*m*_{f}; *M*_{f}], where *m*_{f} and *M*_{f} are respectively the minimum and maximum values of the future simulation dataset. Indeed, let's take *x* lower (resp. higher) than *m*_{f} (resp. *M*_{f}). It leads to _{Gf}(*x*) = 0 (resp. 1) and to _{Gh}^{−1}(_{Gf}(*x*)) = *m*_{h} (resp. = *M*_{h}), where *m*_{h} (resp. *M*_{h}) is the minimum (resp. maximum) of the historical simulated dataset. Hence, for all *x *≤ *m*_{f}, _{Sf}(*x*) is constant and equal to _{Sh}(*m*_{h}), and for all *x *≥ *M*_{f}, _{Sf}(*x*) is constant and equal to _{Sh}(*M*_{h}). Therefore, depending on the historical station dataset, *m*_{f} and *M*_{f}, _{Sh}(*m*_{h}) and _{Sh}(*M*_{h}) can be respectively different from 0 and 1. So, how to deal with *x* out of [*m*_{f}; *M*_{f}]?

[16] To answer this question, the method suggested by *Déqué* [2007] is retained: outside [*m*_{f}; *M*_{f}], a constant correction is applied. For example, if *F*_{Sf}(*mf*) = *p*/100 (i.e., is the *p*^{th} percentile), and that it represents an increase of 2 m/s compared to the *p*^{th} percentile of the historical local-scale CDF, any wind anomaly value below *m*_{f} is corrected by +2 m/s for this station. An equivalent procedure is applied for *x* > *M*_{f}. *Déqué* [2007] assumed that “*more sophisticated methods would lack robustness and might introduce unphysical extreme values after correction*”. In consequences, although this method can be a bit restrictive for extremes, it should not provide totally aberrant extreme values.

[17] We insist that the portion of the *F*_{Sf} domain such that *x* is outside [*m*_{f}; *M*_{f}] is very small in the application presented in section 3: the vast majority of the downscaled CDF will not come from the “constant correction” part but from equation (5).

[18] Although the two methods clearly have a similar philosophy, CDF-t takes into account the change in the large-scale CDF from the historical to the future time period, while Q-matching does not and only projects the simulated large-scale values onto the historical CDF to compute and match quantiles. Moreover, Q-matching cannot provide local-scale quantiles outside the range of the historical observations. This can be a clear restriction in a changing climate context, whereas CDF-t allows one to overcome this problem by taking advantage of the simulated future large-scale CDF.

## 3. Application to Wind Downscaling

[19] In order to test the two PDMs detailed above, CDFs of monthly mean 10m wind velocity (w10m hereafter) are downscaled on 26 stations spread among France.

### 3.1. Observed and Modeled Wind Data

[20] Three time series of monthly w10m are available for each station: the observed one (1958–2005), a second one extracted from NCEP/NCAR reanalyses (1958–2005), and a third one (1958–2100) extracted from a IPSL-CM4 GCM climate simulation [*Marti et al.*, 2005]. The model is forced by the historical 20c3m scenario and the SRESA2 greenhouse gas emission climate scenario [*Nakićenović et al.*, 2000] respectively for the 20th and 21st centuries. NCEP/NCAR reanalyses have a 1.875° × 1.9° spatial resolution whereas IPSL outputs have a 3.75° × 2.5° resolution. For each station, the NCEP/NCAR and GCM time series are obtained from bi-linear interpolations at the station location.

[21] Although the Q-matching and CDF-t methods can be calibrated on the whole data signal, in applications below, data are detrended and deseasonalized. Moreover, the PDMs are not applied to a particular season but to the whole year. Indeed, preliminary analyses showed that results are slightly better when working on winter data only; and of slightly lower quality for summer only. Working on the whole year (i.e., without separating the seasons) provides a suitable intermediary for illustration purposes.

### 3.2. Validation on Historical Statistical Characteristics

[22] The validation of the two PDMs is based on the three time series presented above, and is done on the so-called “historical” period (1958–2005) which is cut into two chronologically following time periods: 1958–1989 (calibration period) and 1990–2005 (validation period). For both methods, the evaluation is performed in three steps:

[23] 1. Calibration: the observed, NCEP/NCAR, and IPSL w10m CDFs are estimated from 75% randomly chosen data from the calibration period.

[24] 2. Downscaling: the downscaling process is applied to 75% randomly chosen data from the validation period.

[25] 3. Evaluation: the resulting local-scale CDFs are compared to the observed ones through the Kolmogorov-Smirnov statistics (KS hereafter) and the Cramér-von Mises statistics (hereafter CvM) [*Darling*, 1957].

[26] KS provides the maximum difference between two CDFs, whereas CvM is a kind of “integrated” squared error. Hence, KS and CvM can be seen as “distances” beetwen CDFs. These three steps are repeated a hundred times to produce confidence intervals. The boxplots of the obtained KS and CvM values for the verification period are presented in Figure 1. Figure 1 (top) gives KS results and Figure 1 (bottom) provides CvM results. In Figures 1 (top) and 1 (bottom), white boxplots correspond to large-scale (IPSL and NCEP/NCAR) CDFs scores, grey boxplots to downscaled CDFs scores. Each boxplot is made of 2600 values and the critical level below which two CDFs are considered as significantly similar for KS and CvM is shown as a vertical line (significance at *α* = 0.05). In general, KS and CvM values indicate the same results. For both criteria, results are better for NCEP/NCAR (downscaled or not) than for IPSL, which one could expect since NCEP/NCAR data are reanalyses and have a higher spatial resolution than IPSL data. Referring to the critical levels, the two downscaling methods provide good results (i.e., clear improvements) for NCEP/NCAR and IPSL, even though the gain of the two PDMs is much more visible when working with IPSL outputs. However, although the downscaling results are equivalently good (in terms of KS and CvM statistics) for both PDMs applied to NCEP/NCAR data, the improvement is better for CDF-t than for Q-matching when applied to IPSL simulations. For IPSL (Figure 1), about 78% of the dowscaled CDFs can be considered as equal to the observed ones for CDF-t (i.e., below the critical level *α* = 0.05 significance with KS) whereas we have 68% for Q-matching (respectively 83% and 76% with CvM). A potential explanation is that the CDF-t uses the large-scale CDF of the validation (i.e., target) period whereas the Q-matching does not. Boxplots displayed in Figure 1 show spreads that can be relatively large, mostly for the scores of the large-scale CDFs. For the raw (i.e., not downscaled) CDFs, it is mainly due to differences between stations, whereas, for the downscaled CDFs, the spread comes from the random sampling of the validation procedure. This is illustrated in Figure S1 of the auxiliary material and showing the 100 validation KS values for each station.

[27] Based on the KS and CvM statistics, we conclude that the CDF-t approach provides better downscaling results over the validation time period. Moreover, as explained previously, CDF-t takes advantage of the future CDF of the simulated data (whereas the Q-matching does not) and Q-matching cannot provide local-scale quantiles outside the range of the historical observations. Those are two prevailing reasons for preferring CDF-t to Q-matching when projecting future climate. Hence, only CDF-t is retained in the following.

### 3.3. Climate Projections for the 21st Century

[28] CDF-t is now applied to downscale w10m anomalies CDFs at the 26 stations for the 21st century based on the IPSL simulations for 2006–2100 under the SRESA2 IPCC greenhouse gas emission scenario. The downscaling is applied on detrended and deseasonalized anomalies. Time series are generated from the CDF by applying a Quantile-matching approach between future large-scale and future (downscaled) local-scale CDFs. To reconstruct the projected signal after the downscaling of anomalies, we add (1) the future large-scale seasonal cycle where the historical bias (i.e., large- minus local-scale historical cycle) has been removed, and (2) the future large-scale trend (from bi-linear interpolations at each station location). Calibrations are performed on the 1958–2005 period and projected CDFs are estimated on three periods, 2006–2040, 2041–2070 and 2071–2100. Those projections are done for illustration purpose: the aim of the paper is not to fully investigate the impact of climate change on w10m over France. It would be hazardous to conclude with only one climate model and a single scenario. The results are presented in Figure 2, where the Figure 2a shows the 10m wind climatology for 1958–2005, and Figures 2b–2d the evolution for the three future time periods relatively to 1958–2005 and: colors correspond to the change in the mean 10m wind intensity relatively to 1958–2005; the radius of the circles is proportional to the CvM value between 1958–2005 future time period wind anomaly CDFs; and bold lined circles correspond to stations where the future anomaly CDFs are significantly different from 1958–2005 CDFs (*α* = 0.05 significance). For the 26 stations, the 10m wind intensity is decreasing during the 21st century. The relative decrease ranges from −0.5% to −9%. A separation is visible between Northwest and Southeast stations: the former see a larger decrease than the latter. Changes in the anomaly CDFs (radius of the circles) are not as geographically divided, even if for the 2071–2100 period, Northwest stations show larger changes. Note that only few stations for each future period have a significantly different anomaly CDF relatively to the historical period (bold circles – 1 in Brittany for 2006–2040, 2 in Brittany and 2 in Provence for 2041–2070, 1 in the North for 2071–2100). Diagnostics are not pushed further.

## 4. Conclusions and Perspectives

[29] A new probabilistic downscaling method, named CDF-transform (or CDF-t), has been developed in this paper, and can be seen as an extension of the Quantiles-matching method [*Déqué*, 2007]. Beyond its simplicity of use, CDF-t gives promising results. First, CDF-t was validated and compared to Q-matching on monthly mean 10m wind velocity anomaly data from reanalyses and historical GCM simulations for 1958–2005. While both methods provided equivalently good results when downscaling NCEP/NCAR data, CDF-t was found more efficient than Q-matching (in terms of Kolmogorov-Smirnov and Cramér-von Mises statistics) when applied to the IPSL GCM outputs. For this reason and the fact that CDF-t takes advantage of the CDF of the simulated future data whereas Q-matching does not, CDF-t was preferred to Q-matching. For illustration purpose, CDF-t was applied to IPSL-CM4 climate simulation of the 21st century under the SRESA2 scenario. In this specific case (model and emission scenario), the results show a decrease between −0.5% and −9% of the 10m wind velocity, depending on the location, with a stronger signal for the Northwest than for the Southeast stations.

[30] This study brings out some perspectives and remarks. If the distribution family of the downscaled variable is (supposed to be) known, one could work with parametric CDFs which would smooth the resulting CDFs. Moreover, predictor and predictand variables are identical in this paper but there is no *a priori* restriction to work with a large-scale variable different from the local-scale one to be predicted. Nevertheless, this point remains to be tested. This is true also for a potential extension in a multivariate framework (i.e., several predictors and/or predictands). However, that context is a more difficult challenge and will require adaptations of the proposed approach. Besides, other variables can lead to different problems. Although CDF-t is virtually directly applicable to any variable (e.g., temperature, pressure), it can need some carefulness with specific variables such as precipitation intensity. Indeed, the “no-precipitation” events may require adjustments (e.g., Dirac mass) to fit into the proposed PDM. More developments are necessary to allow CDF-transform to deal with such variables, bringing new practical research prospects within the downscaling field.

## Acknowledgments

[31] This work was supported by the French Agency for the Environment and Energy Management (ADEME) under the contract 06 05 C 0050 and by the ANR-Assimilex and GIS-REGYNA projects. The authors thank Eric Periano for its support, Michael Stein for the comments and discussions at the very beginning of this work and the anonymous reviewers for their constructive comments. An R package will be available on the CRAN website (already upon request to M. Vrac).