A stochastic daily weather generator for skewed data
Abstract
[1] To simulate multivariate daily time series (minimum and maximum temperatures, global radiation, wind speed, and precipitation intensity), we propose a weather state approach with a multivariate closed skew‐normal generator, WACS‐Gen, that is able to accurately reproduce the statistical properties of these five variables. Our weather generator construction takes advantage of two elements. We first extend the classical wet and dry days dichotomy used in most past weather generators to the definition of multiple weather states using clustering techniques. The transitions among weather states are modeled by a first‐order Markov chain. Second, the vector of our five daily variables of interest is sampled, conditionally on these weather states, from a closed skew‐normal distribution. This class of distribution allows us to handle nonsymmetric behaviors. Our method is applied to the 20 years of daily weather measurements from Colmar, France. This example illustrates the advantages of our approach, especially improving the simulation of radiation and wind distributions.
1. Introduction
[2] Stochastic weather generators [Katz, 1996; Semenov and Barrow, 1997; Qian et al., 2005] aim at reproducing the statistical distributional properties of meteorological variables. They have been applied to a wide range of hydrological, ecological, and agricultural studies. For example agronomical models and more specifically crop models need a large variety of daily weather data as inputs [Wilks, 1997; Brisson et al., 2003, 2009], to model past, present and future variability for yields. Such daily inputs have to be simulated quickly and easily for long time periods at a given station. In this paper we focus on five variables: minimum and maximum temperatures (Tn and Tx), precipitation P, wind speeds at two meters V and radiation R. The choice of these variables was motivated by the inputs required for the crop models used in a research project (french CLIMATOR project) aimed at exploring the impact of climate change on agriculture in the 21st century. Most other variables that hydrological, ecological and agronomical models may need can be computed from these variables using physically based relations, e.g., relative humidity and potential evapotranspiration. One year typical time series are presented for these variables in Figures 1 and 2.


[3] Conceptually, the majority of statistical weather generators [Richardson, 1981; Richardson and Wright, 1984; Semenov and Barrow, 1997; Rajagopalan et al., 1997] can be classified into two categories. The first one consists in pooling out analog days from a database of past observations according to a given criterion, e.g., with a k‐nearest neighbors algorithm [Rajagopalan and Lall, 1999]. The main advantage of this nonparametric approach is that the statistical properties of the given database are adequately reproduced. An important drawback resides in the incapability of creating new time series, i.e., unobserved meteorological situations. To alleviate this undesirable feature, the second category of weather generators is based on stochastically drawing random realizations from a statistical model whose parameters have been estimated on a database of past observations. If such parametric or semiparametric models are well built, then most of the distributional characteristics of the studied variables can be reproduced. For example, WGen and LARS‐WG, introduced by Richardson [1981] and Semenov and Barrow [1997], respectively, belong to this class of weather generators. Apipattanavis et al. [2007] attempted to combine both categories in a single semiparametric approach. By construction, analog or nonparametric methods are not well adapted to the climate change context. We thus decided to opt for a parametric approach, in which climate change could be accounted for by making the parameters varying. In this paper we present the weather generator for a stationary climate. By this we mean that, even though the parameters of probabililty distributions depend upon the season, they do not change from year to year. Adaptation of the weather generator to climate change is left to further works.
[4] Most parametric weather generators work by defining two daily precipitation states: dry or wet days. The state transitions are classically modeled by a Markov chain [Semenov et al., 1998]. Conditionally on the precipitation state, the other meteorological variables are often assumed to be independently and identically distributed (iid) (e.g., CLIMGEN [Stockle et al., 1998]). More complex models have also been proposed. For example, Furrer and Katz [2007] studied a generalized linear model conditioned on rainfall occurrences in order to integrate the ENSO index as a prior information. In contrast to these models for which only two states were defined, we have chosen to extend the number of daily states. This strategy allows us to better capture the complexity of weather changes. This concept of daily states has been successfully applied in downscaling large information to local scales. Boé et al. [2006] and Boé and Terray [2008] used for example weather types defined in terms of large‐scale circulation similarities based upon the 500 hPa geopotential height resulting from the downscaled ARPEGE atmospheric model [Gibelin and Déqué, 2003]. Vrac and Naveau [2007] built precipitation‐related patterns from a set of observed local precipitation records. In order to differentiate our approach from the large scale, we will use the term of weather state.
[5] Concerning the distribution of the variables of interest, daily precipitation amounts have been either fitted by a gamma or an half‐normal distribution [e.g., Semenov et al., 1998]. Gaussian distributions generally model temperatures and radiations. Semenov et al. [1998] emphasized that some variables such as radiations can strongly depart from Gaussianity (see, e.g., Figures 8g and 9g). To overpass this problem, Young [1994] implemented a mixture of distribution. In this paper we also propose a mixture of distribution but with two major differences. First, each cluster of the mixture corresponds to one weather state and, second, the distribution within each cluster (i.e., within each weather state) belongs to the family of multivariate closed skew‐normal (CSN) distributions [Genton, 2004; Pewsey and González‐Farías, 2007]. This class of distribution offers a general framework to fit both non‐Gaussian and Gaussian variables. Conditionally to weather states, CSN distributions will be fitted to our five variables.
[6] The present paper describes in section 2 the general structure of our weather state approach with a multivariate closed skew‐normal generator (named WACS‐Gen) and briefly recalls the main properties of the closed skew‐normal distribution. In section 3 an algorithm is proposed to estimate the parameters of the model and then in section 4 a real meteorological series measured in Colmar (France) is compared to series simulated by WACS‐Gen with parameters estimated on a subset of this series.
2. WACS‐Gen: A Weather Generator Based on Weather States and Skew‐Normal Distributions
[7] We first explain how seasonality is accounted for. When a within‐year trend is detected on a variable, the median and the average absolute deviation (defined as the mean of absolute difference between the variable and its median) are computed for each day and smoothed by a spline function [Green and Silverman, 1994]. This smoothed median is then subtracted to the studied variable and the difference is rescaled by the smoothed average absolute deviation. This normalization procedure is preferred to the classical mean and standard deviation based technique because rank statistics like median are more robust in presence of a departure from symmetry (see, e.g., the radiation). For the example of the Colmar series studied below, temperatures and radiations depend highly upon the day in the year (Figure 1). Figures 1a, 1b, and 1c correspond to temperature minima and maxima and radiations, respectively. No significant trend could be detected for precipitation intensity and wind speed (Figure 2). After transformation, and given a season and a weather state (see below), these temperature and radiation residuals are assumed to be stationary. They are the main object of this study. They will be studied independently within the four following seasons: December‐January‐February (DJF), March‐April‐May (MAM), June‐July‐August (JJA) and September‐October‐November (SON) [Semenov et al., 1998].
[8] In the last decade, weather types have been frequently used to analyze various physical and stochastic climate models outputs at large scale [Boé et al., 2006; Boé and Terray, 2008; Vrac and Naveau, 2007]. Weather types are classically defined for each season and their number varies from eight to ten types per season [Bubnova et al., 1995].


k and ν ∈
l are both location vectors, Σ ∈
k×k and Δ ∈
l×l are both covariance matrices, D ∈
k×l, ϕk(y; μ, Σ) and Φk(y; μ, Σ) are the probability distribution function (pdf) and cumulative distribution function (cdf), respectively, of the k‐dimensional normal distribution with mean vector μ and covariance matrix Σ, and Dt is the transpose of the matrix D. In the particular case D = 0 then Y is the usual k‐dimensional normal distribution with mean μ and variance covariance matrix Σ. The difference between Gaussian and skew‐normal densities are illustrated Figures 3 and 4. Clearly, adding a skewness parameter through the skew‐normal distribution provides flexibility for modeling skewness on the margins but also in the bivariate density. González‐Farías et al. [2004] noticed that the CSN distributions defined by (2) are overparameterized and that without loss of generality ν can be set equal to 0. In practice, the normalizing constant cl−1 defined in (2) can be difficult to compute. To simplify its expression, we assume, without loosing the skew‐normal flexibility, that k = l, D =
S and Δ = Ik − S2 where
, Ik is the k‐dimensional identity matrix and S is a diagonal matrix with elements in [−1, 1]. With this parameterization, equation (2) becomes

= Φ−1(G(P)) where G represents the fit by a Gamma cdf and Φ−1 corresponds to the inverse of the standardized Normal cdf.
is thus modeled as a Gaussian random variable; for a given season and a given weather state,
will be considered as a CSN in order to account for possible asymmetries within clusters. This allows us to assume that, for a given season and a given weather state w, the vector (
, R, V, Tn, Tx) follows a CSN*5(μw, Σw, Sw) with




w is reduced to four since precipitation is always equal to zero. Equation (3) allows to model the temporal evolution from
w to
w′ only throughout their marginals. As a second step, the pairwise structure between the ith components of
w and
w′ is assumed to be a bivariate CSN2,2(0,
, Dw,w′(i), 0, Δw,w′(i)) where


3. Parameter Estimation and Weather Generator Scheme

[13] Concerning the inference of the marginal CSN* parameters, Azzalini and Capitanio [1999] studied the classical maximum likelihood estimation (mle) approach and Flecher et al. [2009] proposed a weighted moment method. Conditionally to the weather state w, a mle approach ignoring temporal dependence is implemented to estimate the parameters of CSN*5(μw, Σw, Sw). The estimates are only slightly changed if the temporal dependence is taken into account in the estimation procedure. It has a larger impact on the covariance matrix of the estimators of the parameters, but since this matrix is not used in the weather generator, this point is simply ignored for the sake of ease of use.
is estimated via a weighted moment approach [Flecher et al., 2009], i.e., by solving the following equation in ρi

w(i)(t) corresponds to the ith component of the vector
w at time t,

(Φ2((
w(i)(t),
w′(i)(t + 1)); 0, I2)) is replaced by its empirical estimator.
(0) is randomly chosen (e.g., with an analog method).
2. The transition probabilities estimated with (4) are used to generate a Markov chain sequence of weather states.
3. Given
(t) = xt and two consecutive weather states, w and w′, a realization of the vector
(t + 1) defined by (3) is drawn according to (see Lemma 2)


is multiplied by
and μw′ is added.
5. To add back trends and seasonal effects, we inverse the steps of the standardization based on the median and the absolute deviation described in the first paragraph of section 2.
4. Weather Data in Colmar, France
[16] Colmar, a city in the north east part of France, is located at 48°05′N latitude, 7°21′E longitude and has an altitude of 175 m. A 20 year series is available from 1973 to 1992 for the five daily variables under study. Annual precipitation amounts are about 530 mm and the frequency of rainy days is about 1/4. The climate is characterized by warm summers from June to September and cold winters (the annual temperature cycle is well marked with a 25°C mean in July and 2°C in January). Both oceanic and continental climate trades can affect this site. This produces an important variability on the daily meteorology.
[17] For each season a maximum of eight different weather states is allowed. The BIC criterion provides a number of regime clusters that is equal to five for the JJA season and six for the other seasons. The repartition per season appears to be fairly homogeneous throughout a year (see Figure 5).

[18] The estimation of the parameters of the CSN* distributions is illustrated during dry days of the JJA season on the pair of variables (Tx, R) (Figure 6). The bivariate distributions in each cluster (i.e., weather state) are well modeled by their corresponding CSN* densities. The two marginal densities resulting from the mixture are fairly well reproduced. Although presented here with a pair of variables for the ease of representation, similar results are obtained for the full vector of 5 variables, and for all seasons.

[19] After estimating our CSN* model parameters, thirty runs of 20 years are simulated. As a first step to compare our simulated time series to Colmar measurements, Figure 7 shows the dry spell length distribution computed from the Colmar observations (solid lines), the average on thirty simulations (dashed line) and the values of the thirty simulations (crosses), for all seasons pooled together. Similar results were obtained when considering each season separately. The left plot focuses on dry spell lengths shorter than 15 days and the right plot zooms on longer dry spells. Both graphs indicate that this variable is fairly well reproduced by our Markov chain. Concerning the five variables of interest, their densities are shown in Figures 8 and 9. For each season, Figures 8a, 8b, 9a, and 9b display minimal temperatures, Figures 8c, 8d, 9c, and 9d display maximal temperatures, Figures 8e, 8f, 9e, and 9f display radiation, and Figures 8g, 8h, 9g, and 9h display wind speed densities. The black solid lines correspond to the measurements and the grey dashed lines represent the thirty realizations obtained with our weather generator. In Figure 8, the left and right plots correspond to DJF and MAM, respectively. Figure 9 shows the same information for JJA and SON, respectively. These plots exemplify the advantage of the CSN distribution, which is able to capture the skewness exhibited in radiations and wind speed distributions. Note that the probability density function of radiation is not very well fitted during JJA (Figure 9e). The fit can be significantly improved by defining two additional weather states (figure not shown). The difference between the BIC criterion for 6 and 8 clusters is positive, but small. In this paper we focus on the general presentation of the generator; the problem of finding other criteria than BIC for selecting the number of clusters will be the subject of further work. We therefore maintain the current fit with 6 clusters.



[20] Concerning precipitation intensity, its distribution is represented by a quantile‐quantile plot (QQ plot) in Figure 10. This QQ plot is defined as the sorted simulated rainfalls versus the sorted historical record. A good fit corresponds to the first diagonal. This graph indicates that precipitation amounts are well reproduced by the generator. On this site, the highest observed precipitation is 70 mm, while the highest simulated precipitations are in the range 40–68 mm. Note however that the mean precipitation is well reproduced and that on other sites the opposite situation (higher simulated highest precipitations than measured ones) can be observed (results not reported here).

[21] In Figure 11, star plots represent correlations between our five variables for each season. Large positive correlations are near the star plot border whereas large negative one are near the center. Such graphs provide a graphical way to view a correlation matrix. The correlations between precipitation and other variables are only computed for wet days. Figure 11 shows that our model is capable to reproduce the observed cross correlations. For each variable, the correlations between two consecutive days is represented with the same star plot graph in Figure 12. The persistence between two consecutive days is well reproduced except for the winter season (DJF), which provides the most severe discrepancy between observations and simulations, mainly for temperature variables.


[22] Wind speed and precipitation are variables known to be difficult to model. To study the improvement brought by the introduction of multiple weather states, thirty additional simulations of 20 year length are also obtained by forcing our generator to only have the two classical wet and dry weather states. In Figure 13, wind speed boxplots and densities are obtained with a classical two weather states (wet and dry) and for six weather states, as defined in Figure 5, respectively. Figure 14 are quantile‐quantile plots of the amount of precipitation for the Colmar series and for each of the 30 simulations in the two cases: multiple weather states case (Figure 14a) and classical wet/dry case (Figure 14b). For both variables, introducing multiple weather states improves significantly the fitting of the distribution. Current stochastic weather generators are for example known to underestimate the probability of small precipitations as explained by Semenov et al. [1998]. In the case of Colmar, the precision of the data is 0.1 mm. The overall frequency of precipitation less than or equal to 2 mm is 6.1% on the data. On simulations with two wet/dry weather states it ranges from 1.6% to 2.3%. On simulations with a BIC optimized number of weather states, it ranges from 5.6% to 7.1%.


[23] Figure 15 displays the wind speed autocorrelation boxplot for each season computed with a two or six weather states. The horizontal black lines correspond to the observed wind speed autocorrelation per season. Despite the incapacity for both generators to reproduce the wind speed autocorrelation in the MAM season, Figures 13 and 15 clearly show the improvement brought by the introduction of additional weather states for wind speed distribution and autocorrelation modeling. Concerning precipitation, Figure 16 indicates that generators with two or more weather states do provide decent but not excellent results.


5. Conclusion
[24] We have presented WACS‐Gen, a new weather generator, which presents several improvements compared to previous ones: (1) the number of weather states is no longer limited to the dry/wet states, but is fitted to the variability of the observed data using a model‐based clustering algorithm on detrended data and (2) conditionally on the season and the weather state, the multivariate data are modeled using CSN distributions, thus allowing for residual skewness; correlation between variables and along time is also modeled, including between successive days with different weather states.
[25] Allowing for multiple weather states is a major improvement, but it raises the question of defining a good criterion for selecting the correct number of clusters. Here, we have proposed to use BIC, a widely used criterion in model based clustering [Fraley and Raftery, 2003]. It provides most of the time a very good fit of the probability densities. In one situation (radiation during JJA), the fit could be improved by increasing the number of clusters, as compared to the BIC criterion. Finding better criteria than BIC will be the subject of future work.
[26] This generator has been tested on different weather series measured in contrasted climatic regimes across France. Although we only have reported results on the Colmar series due to space constraints, our results showed consistently that WACS‐Gen substantially improves the reproduction of histograms, cross and temporal correlations as compared to generators with only dry/wet weather states. Histograms are also very well reproduced thanks to the mixture of CSN distributions inherited from the multiple weather states. Of particular interest is the ability of our generator to model the correlation between the amount of precipitation and the other variables instead of only conditioning these variables to the precipitation event.
[27] We still have some difficulties in reproducing some statistics, in particular correlations with wind speeds and extreme events. Wind speed is known to be a difficult variable, with strongly nonlinear correlation to other variables. Although being able to account for asymmetrical distributions, CSN are not targeted at modeling extreme data. Future improvements on weather generators should be focused on integrating extreme values theory to better reproduce extreme events, and on modeling nonlinear relationship between variables. The impact of these improvements on crop models still needs to be assessed, which will be our very next task.
[28] In the framework of climate change studies, we not only need to consider that the parameters will change with time, but we also need to consider the change of support (i.e., downscaling) problem. General climate models provide output variables varying with time. at very large scale, while crop models need weather variables at very small scale; we therefore need to provide models to estimate small‐scale parameters from large‐scale data. This should be treated in a forthcoming paper.
Acknowledgments
[30] This work was supported by the ANR‐CLIMATOR project and the ANR‐AssimilEx project. Part of Philippe Naveau's work has been supported by the EU‐FP7ACQWA project (http://www.acqwa.ch/) under contract 212250, by the PEPER‐GIS project (http://www.gisclimat.fr/projet/peper), by the ANR‐MOPERA project, and by the NICE RTN project. The authors wish to express their gratitude to the editor and referees for their very helpful comments and suggestions. They would also like to credit the contributors of the R project.
Appendix A
= Σ−1/2(Y − μ) follows a CSNn,n(0, In, S, 0, In − S2).
Lemma 1 indicates that the skewness parameter remains unchanged after standardization.
Lemma 2. Let X = (X1, X2) be a CSN2,2(0, Σ, D, 0, Δ) with


References
Citing Literature
Number of times cited according to CrossRef: 18
- Yuxiao Li, Ying Sun, A multi-site stochastic weather generator for high-frequency precipitation using censored skew-symmetric distribution, Spatial Statistics, 10.1016/j.spasta.2020.100474, (100474), (2020).
- Gregory P. Bopp, Benjamin A. Shaby, Chris E. Forest, Alfonso Mejía, Projecting Flood-Inducing Precipitation with a Bayesian Analogue Model, Journal of Agricultural, Biological and Environmental Statistics, 10.1007/s13253-020-00391-6, (2020).
- Vasiliki D. Agou, Emmanouil A. Varouchakis, Dionissios T. Hristopulos, Geostatistical analysis of precipitation in the island of Crete (Greece) based on a sparse monitoring network, Environmental Monitoring and Assessment, 10.1007/s10661-019-7462-8, 191, 6, (2019).
- Sangchul Lee, Carlington Wallace, Ali Sadeghi, Gregory McCarty, Honglin Zhong, In-Young Yeo, Impacts of Global Circulation Model (GCM) bias and WXGEN on Modeling Hydrologic Variables, Water, 10.3390/w10060764, 10, 6, (764), (2018).
- J C JOSHI, K TANKESHWAR, Sunita SRIVASTAVA, Hidden Markov Model for quantitative prediction of snowfall and analysis of hazardous snowfall events over Indian Himalaya, Journal of Earth System Science, 10.1007/s12040-017-0810-6, 126, 3, (2017).
- Minhyuk Jung, Moonseo Park, Hyun-Soo Lee, Hyunsoo Kim, Weather-Delay Simulation Model Based on Vertical Weather Profile for High-Rise Building Construction, Journal of Construction Engineering and Management, 10.1061/(ASCE)CO.1943-7862.0001109, 142, 6, (04016007), (2016).
- Julie Bessac, Pierre Ailliot, Julien Cattiaux, Valerie Monbet, Comparison of hidden and observed regime-switching autoregressive models for (<i>u</i>, <i>v</i>)-components of wind fields in the northeastern Atlantic, Advances in Statistical Climatology, Meteorology and Oceanography, 10.5194/ascmo-2-1-2016, 2, 1, (1-16), (2016).
- Julie Carreau, Christophe Bouvier, Multivariate density model comparison for multi-site flood-risk rainfall in the French Mediterranean area, Stochastic Environmental Research and Risk Assessment, 10.1007/s00477-015-1166-6, 30, 6, (1591-1612), (2015).
- Anastassia Baxevani, Jan Lennartsson, A spatiotemporal precipitation generator based on a censored latent Gaussian field, Water Resources Research, 10.1002/2014WR016455, 51, 6, (4338-4358), (2015).
- Erwan Koch, Philippe Naveau, A frailty-contagion model for multi-site hourly precipitation driven by atmospheric covariates, Advances in Water Resources, 10.1016/j.advwatres.2015.01.001, 78, (145-154), (2015).
- Mathieu Vrac, Petra Friederichs, Multivariate—Intervariable, Spatial, and Temporal—Bias Correction*, Journal of Climate, 10.1175/JCLI-D-14-00059.1, 28, 1, (218-237), (2015).
- Amanda Hering, Karen Kazor, William Kleiber, A Markov-Switching Vector Autoregressive Stochastic Wind Generator for Multiple Spatial and Temporal Scales, Resources, 10.3390/resources4010070, 4, 1, (70-92), (2015).
- Denis Allard, Marc Bourotte, Disaggregating daily precipitations into hourly values with a transformed censored latent Gaussian process, Stochastic Environmental Research and Risk Assessment, 10.1007/s00477-014-0913-4, 29, 2, (453-462), (2014).
- S Parey, TTH Hoang, D Dacunha-Castelle, Validation of a stochastic temperature generator focusing on extremes, and an example of use for climate change, Climate Research, 10.3354/cr01201, 59, 1, (61-75), (2014).
- P. Yiou, AnaWEGE: a weather generator based on analogues of atmospheric circulation, Geoscientific Model Development, 10.5194/gmd-7-531-2014, 7, 2, (531-543), (2014).
- P. Yiou, AnaWEGE: a weather generator based on analogues of atmospheric circulation, Geoscientific Model Development Discussions, 10.5194/gmdd-6-4745-2013, 6, 3, (4745-4774), (2013).
- D. Allard, S. Soubeyrand, Skew-normality for climatic data and dispersal models for plant epidemiology: When application fields drive spatial statistics, Spatial Statistics, 10.1016/j.spasta.2012.03.001, 1, (50-64), (2012).
- Cédric Flecher, Denis Allard, Philippe Naveau, Truncated skew-normal distributions: moments, estimation by weighted moments and application to climatic data, METRON, 10.1007/BF03263543, 68, 3, (331-345), (2012).





