Ensemble-based air quality forecasts: A multimodel approach applied to ozone
Abstract
[1] The potential of ensemble techniques to improve ozone forecasts is investigated. Ensembles with up to 48 members (models) are generated using the modeling system Polyphemus. Members differ in their physical parameterizations, their numerical approximations, and their input data. Each model is evaluated during 4 months (summer 2001) over Europe with hundreds of stations from three ozone-monitoring networks. We found that several linear combinations of models have the potential to drastically increase the performances of model-to-data comparisons. Optimal weights associated with each model are not robust in time or space. Forecasting these weights therefore requires relevant methods, such as selection of adequate learning data sets, or specific learning algorithms. Significant performance improvements are accomplished by the resulting forecasted combinations. A decrease of about 10% of the root-mean-square error is obtained on ozone daily peaks. Ozone hourly concentrations show stronger improvements.
1. Introduction
[2] Though sparsely evaluated, the uncertainty in chemistry transport models is a major limitation of air quality forecasting. The source of this uncertainty lies in input fields (emissions, deposition velocities, land data, meteorological fields, etc.), as detailed by Hanna et al. [1998, 2001], and in the models themselves [Russell and Dennis, 2000; Mallet and Sportisse, 2006]. The uncertainty is so high that the reliability of model results should be carefully assessed, and ensemble forecast is relevant to address this issue. Straume et al. [1998], Dabberdt and Miller [2000], Galmarini et al. [2004], Straume [2001], and Warner et al. [2002] have estimated the uncertainty in dispersion modeling using ensemble forecasts. Dealing with ozone exposure, Hanna et al. [1998] and Beekmann and Derognat [2003] accounted for uncertainties in input fields with Monte Carlo simulations to check the efficiency of emission reductions. Hanna et al. [2001], Hanna and Davis [2002], and Mallet and Sportisse [2006] estimated the uncertainty in photochemical forecasts based on Monte Carlo simulations and a multimodel approach, respectively.
[3] With respect to day-to-day photochemical forecasts, only few developments have been undertaken in order to associate uncertainties with the forecasts or in order to overtake the limitations of uncertain processes or data. Improvements in air quality forecasts have been sought in modeling developments, input data refinements and increasing computational resources. Unfortunately, the performances have only slightly increased [Russell and Dennis, 2000]. A reasonable explanation is that the high uncertainties hide modeling efforts and that models are usually tuned to deliver satisfactory forecasts (the latter is also suggested by Russell and Dennis [2000]). Taking into account the uncertainty could help in enhancing the forecasts. A promising technique is to perform ensemble forecasts and to combine the ensemble members.
[4] A brute force approach is the use of ensemble mean [Delle Monache and Stull, 2003; McKeen et al., 2005]. Underlying (and strong) assumptions are that the ensemble provides an accurate approximation of the output concentrations probability density function and that the mean of this probability density function is close to the true state. Because of the limited number of models and the unsatisfactory description of the uncertainty, it is hard to satisfy the first assumption. Moreover there is no study supporting the second assumption. More sophisticated methods have been used, mainly in other fields, such as superensembles in meteorology [Krishnamurti et al., 2000] or for ozone forecasts [Pagowski et al., 2005], or Bayesian model averaging [Hoeting et al., 1999] (in many fields).
[5] In this paper, we investigate several methods to build optimal combinations of ensemble members. The objective is to increase day-to-day forecast performances (estimated through comparison against field measurements). The methods are applied to ozone hourly concentrations and daily peaks at European scale during summer 2001 and over hundreds of stations from three monitoring networks. The involved ensembles include up to 48 members, which allows us to study the characteristics of efficient ensembles. Section 2 gives further details about the ensemble members and the system used to generate these forecasts. In section 3 we introduce the methods that we investigated, and we review their potential, that is, the quality of their a posteriori (i.e., knowing all observations) combinations. In section 4, methods to forecast optimal ensemble combinations are described and tested. Selection of the best suited members is also addressed.
2. Ensemble Forecasts
2.1. Forecasting System Polyphemus
[6] Polyphemus [Mallet et al., 2005] is an air quality modeling system with ensemble-forecasting abilities based on multiple configurations. These configurations define almost all components of the modeling system so that each configuration should be viewed as a new model. Polyphemus is primarily composed of (1) a library for physical parameterizations (and data processing), AtmoData [Mallet and Sportisse, 2005], that includes several parameterizations for major processes; (2) a chemistry transport model, Polair3D [Boutahar et al., 2004], whose gas-phase version is basically a numerical solver for the reactive-dispersion equation; and (3) a set of programs that make calls to AtmoData in order to compute the input data to the chemistry transport model.
[7] Contrary to most modeling systems that rely on an “all-in-one chemistry transport model,” Polyphemus splits the numerical solver from physical parameterizations and data management. The programs that compute input data to the chemistry transport model provide flexibility. They propose several options supported by the multiple physical parameterizations available in AtmoData. Polair3D is also versatile enough to propose several chemical mechanisms and numerical approximations. In addition independence of the components eases the work flow control, such as corrections in input fields to Polair3D or incorporation of new data sets. These features enable to build ensembles with a high number of members. Moreover very different models can be built so as to produce an ensemble with a wide spread in output concentrations (see section 2.2).
[8] In this paper, the system is run at European scale ([40.25°N, 10.25° W] × [56.75°N, 22.25°E]) during summer 2001 (27 April 2001 to 31 August 2001). It aims at forecasting ozone concentrations (hourly concentrations and daily peaks). We define a reference configuration (the reference model or reference ensemble member, not necessarily the best model when compared to observations) in the following way: (1) meteorological data, European Centre for Medium-Range Weather Forecasts (ECMWF) fields (resolution of 0.36° × 0.36°, TL511 spectral resolution in the horizontal, 60 levels, time step of 3 hours, 12 hours forecast cycles starting from analyzed fields); (2) land use coverage, U.S. Geological Survey (USGS) land cover map (24 categories, 1 km Lambert); (3) chemical mechanism, RACM [Stockwell et al., 1997]; (4) emissions, the Co-operative Programme for Monitoring and Evaluation of the Long-range Transmission of Air Pollutants in Europe (EMEP) inventory, converted according to Middleton et al. [1990]; (5) biogenic emissions, computed as proposed by Simpson et al. [1999]; (6) deposition velocities, the revised parameterization from Zhang et al. [2003]; (7) vertical diffusion, within the boundary layer, the Troen and Mahrt parameterization described by Troen and Mahrt [1986], with the boundary layer height computed by ECMWF; above the boundary layer, the Louis parameterization of Louis [1979]; (8) boundary conditions, output of the global chemistry transport model Mozart 2 [Horowitz et al., 2003] run over a typical year; and (9) numerical schemes, a first-order operator splitting, the sequence being advection-diffusion-chemistry; a direct space-time third-order advection scheme with a Koren flux limiter; a second-order Rosenbrock method for diffusion and chemistry [Verwer et al., 2002].
[9] Since ensemble forecasting is computationally consuming, we kept a low vertical resolution. The first layer is located between 0 and 50 m. The thickness of the other layers is about 600 m with the top of the last layer at 3000 m.
2.2. Ensembles Description
[10] We introduce three ensembles:
[11] 1. Ensemble 1 is composed of the reference simulation and 21 similar simulations but for one change in the physical parameterizations, in the raw input data (to Polyphemus), in the numerical approximations or in uncertain input data computed in the system work flow. Table 1 lists all changes.
No. | Model | Reference | Alternative | Comment |
---|---|---|---|---|
Physical Parameterizations | ||||
1^{b} | chemistry | RACM | RADM 2 [Stockwell et al., 1990] | |
2 | vertical diffusion | Troen and Mahrt | Louis [Louis, 1979] | |
3 | Louis in stable conditions | Troen and Mahrt kept in unstable conditions | ||
4 | deposition velocities | Zhang [Zhang et al., 2003] | Wesely [Wesely, 1989] | |
5 | surface flux | heat flux^{c} | momentum flux^{c} | for the aerodynamic resistance (in deposition velocities) |
6 | cloud attenuation | RADM method [Chang et al., 1987; Madronich, 1987] | Esquif^{d} | |
7 | critical relative humidity | depends on σ | two layers | used in the RADM method to compute cloud attenuation |
Raw Input Data | ||||
8 | emissions vertical distribution | all in the first cell | all in the two first cells | |
9 | land use coverage | USGS | GLCF | for deposition velocities |
10 | land use coverage | USGS | GLCF | for biogenic emissions |
11 | exponent p in Troen and Mahrt | 2 | 3 | |
12 | photolysis constants | JPROC (from EPA Models 3) | dependent on the zenith angle (only) | |
Numerical Approximations | ||||
13 | time step | 600 s | 100 s | |
14 | 1800 s^{e} | |||
15 | vertical resolution | 5 layers | 9 layers | first layer height remains 50 m |
16 | first layer height | 50 m | 40 m | top height of every other layer does not change |
17 | continuity equation | div(V) = 0 | div (ρV) = 0 | |
Perturbed Input Data | ||||
18 | boundary layer height | ECMWF | increased by 10% | |
19 | NO emissions | EMEP | increased by 25% | including biogenic emissions |
20 | biogenic emissions | [Simpson et al., 1999] | increased by 100% | excluding NO biogenic emissions |
21 | ozone boundary conditions | Mozart 2 | decreased by 10% |
- a Each model has the same configuration as the reference model but for one change (column “alternative”).
- b The reference model is referred to as model 0.
- c Computed using Louis formulae.
- d ESQUIF final report 2001 (available at http://climserv.lmd.polytechnique.fr/esquif).
- e The advection is integrated over submultiples of 1800 s so as to satisfy the Courant-Friedrichs-Lewy (CFL) condition.
[12] 2. Ensemble 2 is built with the changes involved in models 17, 8, 4, 2, 1 (numbers from Table 1). All possible combinations of these changes are included in the ensemble. There are therefore 32 members in ensemble 2.
[13] 3. Ensemble 3 collects all members from ensembles 1 and 2. Ensembles 1 and 2 have six common members (0, 1, 2, 4, 8, and 17); hence there are 48 members in ensemble 3.
[14] Ensembles similar to ensembles 1 and 2 were introduced by Mallet and Sportisse [2006] in order to estimate uncertainties in output ozone concentrations. One may refer to this paper for a detailed description of the ensembles and of their spread. A rough idea of the wide spread is given in Figure 1. The mean of hourly standard deviations of ensemble 3 profiles (shown in Figure 1) is 10.4 μg m^{−3}. To show the spatial distribution of ensemble spread, the standard deviation of ensemble 3 is computed in each cell and for each hour, then averaged relative standard deviations (time average) in each cell are plotted in Figure 2.
2.3. Comparisons With Observations
[15] We use ozone measurements from three monitoring networks (described below). All stations in these networks have observations for at least 30% of possible measurements during the 127 simulated days (for both hourly concentrations and peaks). Here is a description of the networks:
[16] 1. Network 1 is composed of 241 urban and regional stations over Europe. A large part of the stations are in France (116 stations) and in Germany (81 stations). It provides about 619,000 hourly concentrations and 27,500 peaks.
[17] 2. Network 2 includes 85 EMEP stations, that is regional stations distributed over Europe, with about 240,000 hourly observations and 10,400 peaks.
[18] 3. Network 3 includes 356 urban and regional stations in France from BDQA (“Banque de Données sur la Qualité de l'Air”, managed by Agence de l'Environnement et de la Maîtrise de l'Énergie (ADEME) and gathering 40 approved associations for monitoring air quality). It provides 997,000 hourly measurements and 42,000 peaks. Note that it includes most French stations of network 1.
Ensemble | Hourly Concentrations | Daily Peaks | ||||
---|---|---|---|---|---|---|
RMSE | Correlation | Bias | RMSE | Correlation | Bias | |
Network 1 | ||||||
Ensemble 1 | ||||||
Best member | 27.0 | 66.1 | 1.8 | 22.7 | 73.8 | 0.1 |
Mean statistics | 29.0 | 63.8 | 11.3 | 24.2 | 71.1 | 2.9 |
Ensemble 2 | ||||||
Best member | 26.7 | 67.9 | 1.8 | 23.0 | 74.8 | 0.1 |
Mean statistics | 29.1 | 64.8 | 13.4 | 26.4 | 69.1 | 6.2 |
Ensemble 3 | ||||||
Mean statistics | 29.0 | 64.4 | 12.6 | 25.6 | 69.8 | 4.9 |
Worst member | 32.1 | 60.8 | 27.1 | 33.5 | 62.2 | 17.2 |
Network 2 | ||||||
Ensemble 1 | ||||||
Best member | 25.7 | 63.6 | 0.5 | 21.5 | 69.7 | 0.1 |
Mean statistics | 26.8 | 60.6 | 7.8 | 22.6 | 67.4 | 2.7 |
Ensemble 2 | ||||||
Best member | 26.3 | 63.9 | 0.2 | 21.6 | 70.2 | 0.4 |
Mean statistics | 28.9 | 59.9 | 12.9 | 25.4 | 64.4 | 6.6 |
Ensemble 3 | ||||||
Mean statistics | 28.1 | 60.1 | 11.0 | 24.4 | 65.5 | 5.1 |
Worst member | 35.1 | 54.4 | 28.7 | 32.1 | 56.7 | 17.3 |
Network 3 | ||||||
Ensemble 1 | ||||||
Best member | 29.4 | 65.5 | 3.2 | 24.9 | 72.2 | 0.2 |
Mean statistics | 32.5 | 61.6 | 15.3 | 26.5 | 67.8 | 2.9 |
Ensemble 2 | ||||||
Best member | 29.0 | 67.8 | 0.2 | 25.1 | 74.4 | 0.5 |
Mean statistics | 31.2 | 62.9 | 12.8 | 29.1 | 65.4 | 6.8 |
Ensemble 3 | ||||||
Mean statistics | 31.7 | 62.4 | 13.8 | 28.2 | 66.2 | 5.4 |
Worst member | 35.8 | 58.8 | 26.0 | 37.5 | 55.4 | 17.7 |
- a RMSE is in μg m^{−3}, correlation in %, and bias in %. Best results, for each network, are in bold. Mean statistics are averaged statistics of individual models.
3. Combining Forecasts: Methods and Potentialities
3.1. Introduction
[20] For day-to-day forecasts, the modeler is usually able to choose a model with performances close to the best model. In particular, it means that the performances of the reference configuration (section 2.1) are similar to those of the best model. The objective is therefore to deliver a forecast with higher performances than the best available model. In other words, a method to determine the best model would not be of great help. Hence ensemble members should be combined. Knowing that ensemble forecasting is computationally consuming, a satisfactory model combination has to bring significant improvements. We consider that a decrease by 10% of the root-mean-square error of the best model (that is, about 2–3 μg m^{−3}) is required for an ensemble method to be interesting. This threshold is arbitrary, but there is some background to support it. The best model is usually a tuned model, that is, a favorable configuration found by the modeler. Improving a well tuned model, so as to decrease the root-mean-square error by 10%, is not an easy task, especially for day-to-day forecasts.
3.2. Notations
[21] An ensemble is denoted or _{i}. For instance, _{3} = ℰ_{1} ∪ ℰ_{2}. A network is denoted or _{i}. The cardinal of a network (number of stations) or of an ensemble (number of models) is denoted by ∣ ∣. Output concentrations of a model are denoted M_{t,x} or M_{m,t,x} (if the model is indexed by m), where t is the time step and x denotes a station. Time and spatial averages are denoted _{x}^{t} and _{t}^{x}, respectively. The mean over all stations and during the whole simulation period is ^{t,x}. Observations are denoted O_{t,x} and C_{t,x} are combined concentrations.
3.3. Introduction to Combining Methods
3.3.1. Ensemble Mean and Ensemble Median
3.3.2. Models Selection
[23] At each station, the best model is selected. The resulting model is denoted EB^{s} (‘B’ stands for “best” and ‘s’ stands for “station”). In the same way, selecting the best model for each date (but for all stations) defines the metamodel EB^{d} (‘d’ stands for “date”).
3.3.3. Least Squares Methods
3.4. Potentials
[26] In previous formulae, weights are computed on the basis of all observations. In operational forecasts, these weights should be forecasted, that is, on the basis of past observations. However, in this section, methods are assessed through their a posteriori (i.e., with all observations known) performances. This gives the potential of all methods. All statistical measures are provided in Table 3.
Ensemble | Hourly Concentrations | Daily Peaks | ||||
---|---|---|---|---|---|---|
RMSE | Correlation | Bias | RMSE | Correlation | Bias | |
Network 1 | ||||||
Ensemble 1 | ||||||
EULS^{d} | 16.7 | 87.3 | 2.6 | 13.5 | 91.6 | 1.4 |
Ensemble 2 | ||||||
EULS^{d} | 16.3 | 87.9 | 2.5 | 13.3 | 91.9 | 1.4 |
Ensemble 3 | ||||||
EULS^{s} | 16.5 | 87.7 | 2.0 | 10.9 | 94.5 | 1.0 |
EULS^{d} | 14.5 | 90.6 | 2.0 | 11.6 | 93.9 | 1.1 |
Network 2 | ||||||
Ensemble 1 | ||||||
EM | 25.9 | 61.9 | 6.3 | 22.0 | 68.7 | 0.7 |
EMD | 26.4 | 60.9 | 7.7 | 22.1 | 68.0 | 1.0 |
EB^{s} | 23.1 | 70.6 | 2.4 | 19.7 | 75.3 | 2.4 |
EB^{d} | 24.2 | 67.0 | 2.6 | 19.9 | 74.8 | 2.4 |
ELS | 23.7 | 68.0 | 0.8 | 18.7 | 78.2 | 2.5 |
EULS | 23.4 | 68.8 | 0.0 | 18.5 | 78.7 | 3.2 |
ELS^{s} | 16.4 | 86.3 | 0.7 | 12.9 | 90.3 | 1.2 |
EULS^{s} | 16.0 | 86.8 | 0.2 | 12.5 | 90.9 | 1.4 |
ELS^{d} | 17.1 | 84.8 | 0.5 | 12.5 | 90.9 | 1.3 |
EULS^{d} | 16.7 | 85.5 | 0.2 | 12.1 | 91.4 | 1.4 |
Ensemble 2 | ||||||
EM | 25.2 | 64.4 | 5.5 | 23.1 | 70.5 | 4.6 |
EMD | 25.3 | 64.0 | 4.9 | 23.3 | 69.6 | 4.6 |
EB^{s} | 22.4 | 72.5 | 0.9 | 19.1 | 77.1 | 2.0 |
EB^{d} | 24.0 | 67.3 | 1.3 | 19.6 | 75.6 | 2.1 |
ELS | 24.3 | 66.2 | 0.7 | 19.6 | 75.8 | 2.7 |
EULS | 24.0 | 66.9 | 0.4 | 19.4 | 76.2 | 3.4 |
ELS^{s} | 17.3 | 84.6 | 0.9 | 12.8 | 90.4 | 1.0 |
EULS^{s} | 16.9 | 85.3 | 0.1 | 12.3 | 91.2 | 1.4 |
ELS^{d} | 15.9 | 87.1 | 0.3 | 11.4 | 92.4 | 1.1 |
EULS^{d} | 15.4 | 87.9 | 0.1 | 11.0 | 93.1 | 1.2 |
Ensemble 3 | ||||||
EM | 24.9 | 64.2 | 1.3 | 22.3 | 70.7 | 2.7 |
EMD | 25.7 | 61.5 | 4.5 | 22.2 | 69.2 | 0.7 |
EB^{s} | 22.1 | 73.2 | 0.8 | 18.9 | 77.6 | 1.7 |
EB^{d} | 23.8 | 68.1 | 1.6 | 19.4 | 76.2 | 2.0 |
ELS | 23.5 | 68.8 | 0.9 | 18.3 | 79.3 | 2.4 |
EULS | 23.2 | 69.6 | 0.0 | 18.1 | 79.7 | 3.0 |
ELS^{s} | 15.5 | 87.8 | 0.6 | 10.5 | 93.7 | 0.8 |
EULS^{s} | 15.2 | 88.3 | 0.2 | 10.1 | 94.1 | 1.0 |
ELS^{d} | 11.9 | 93.0 | 0.1 | 8.3 | 96.1 | 0.6 |
EULS^{d} | 11.6 | 93.3 | 0.0 | 8.0 | 96.3 | 0.6 |
Network 3 | ||||||
Ensemble 1 | ||||||
EULS^{d} | 16.9 | 88.3 | 3.1 | 13.9 | 91.9 | 1.4 |
Ensemble 2 | ||||||
EULS^{d} | 16.4 | 89.0 | 3.0 | 13.3 | 92.6 | 1.3 |
Ensemble 3 | ||||||
EULS^{d} | 15.0 | 90.8 | 2.6 | 11.9 | 94.1 | 1.1 |
- a RMSE is in μg m^{−3}, correlation in %, and bias in %. For networks 1 and 3, only the best combinations with respect to RMSE are shown. Conclusions drawn from results over network 2 are very similar for networks 1 and 3.
3.4.1. Ensemble Mean and Ensemble Median
[27] For every ensemble, results of EM and EMD are better than the averaged statistics of the ensemble. However, they often have lower performances than the best member. No ensemble mean or ensemble median has a RMSE below 90% of the best RMSE of the same ensemble. Ensemble mean and ensemble median therefore show poor performances. This is in contradiction with the results from Delle Monache and Stull [2003] and McKeen et al. [2005]. Nonetheless, the former study implied only four models, during 6 days and with five stations, which limits the reliability of the conclusions, as also pointed out by the authors. The latter study is also limited with seven models.
3.4.2. Models Selection
[28] Performances of EB^{s} and EB^{d} are satisfactory, especially on the peaks. RMSE are then below 90% of the best model RMSE.
3.4.3. Least Squares Methods
[29] All comments are made for both the regular least squares version and the corresponding unbiased version since their performances are very similar. Least squares method applied with a single combination, over the whole network and at all dates, brings significant improvement in results. RMSE are usually well below 90% of the best model RMSE.
[30] Meanwhile, the best performances are reached by far with the least squares methods per station and per date. Over network 2, EULS^{d} based on ensemble 3 even reaches a RMSE of 8 μg m^{−3} and a correlation of 96.3% for daily peaks.
[31] Combinations based on ensemble 3 logically show the best results since ensemble 3 includes all simulations. Least squares combinations based on ensemble 2 are slightly better than ensemble 1, which may be due to the number of members, the wider spread or a favorable configuration. Least squares combinations per date usually perform better than combinations per station. The ratio between the number of available stations per date and the number of measurements per station might be an explanation for hourly concentrations. However, it is likely that the spatiotemporal structure of the computed fields plays an important role. Ozone daily peaks at a representative station (with respect to the statistics) illustrate the improvements in Figure 3.
4. Forecasting Ensemble Combinations and Selecting Ensemble Members
[32] The previous results show a strong potential for least squares methods. The objective is to use them for forecasts, that is to forecast the weights associated with every model based on the weights computed in the past days. This may be viewed as a data assimilation procedure constrained by the ensemble structure. Unless specified, the following tests are performed with ensemble 1, over network 2 and with ozone daily peaks.
4.1. Weights Stability
[33] Since ELS^{s} and ELS^{d} both show promising performances, the combinations may be forecasted at each station (and over a given period) or for each time step (and for all stations). In order to ease weights forecasting, combinations with a low time dependency are of high interest. It is also useful to have spatially robust weights, that is, weights that may be applied to another network or to other grid cells. With such weights, the whole ground field may be forecasted, which is a key feature of three-dimensional (3-D) chemistry transport models.
[34] It is noteworthy that (1) there exist constant weights over the whole period (127 days) for an efficient combination (ELS^{s}) and (2) there also exist uniform weights (over a network) associated with high performances (ELS^{d}). The question is primarily to know whether these coefficients can be forecasted.
4.1.1. Least Squares Method per Date
[35] Time evolution of weights for ELS^{d}, for three models, is shown in Figure 4. These weights are highly variable. Even the highest weights (in absolute value) which constitute the main part of the combination are highly unstable. It makes the combination very hard to forecast.
[36] Another property is that it is not easy to use these weights over another network or in other cells. Applying the weights computed for network 3 (and ensemble 1, ozone peaks) to network 2 leads to a RMSE of 55.8 μg m^{−3}. This is the least favorable extension since the two networks contain stations of different nature and their spatial extent strongly differ. A more favorable experiment is to compute the weights over network 2 (Europe) and to apply them to network 3 (France). The resulting RMSE is 24.6 μg m^{−3} (correlation of 74.7%) which is reasonable but similar to the RMSE of the best model (24.9 μg m^{−3}, correlation of 72.2%). There is a more favorable experiment. Like network 2, network 1 has stations over whole Europe but it has regional and urban stations, including stations from network 3. Network 3 is therefore closer to network 1 than to network 2. Applying weights computed over network 1 to network 3 gives better performances, with a RMSE of 17.4 μg m^{−3} and a correlation of 87.1%. This promising result tends to show that it is possible to apply suited weights to cells without observations.
4.1.2. Least Squares Method per Station
[37] Weights (for ELS^{s}) associated with each model are highly variable over the network, as shown in Figure 5. In addition, there is no subset of stations over which the weights are similar. This is not surprising because setting a single weight per model (for all stations and all dates) does not provide very strong improvements (Table 3, ELS).
4.2. Using Weights From the Previous Days
[38] An obvious method is to use weights computed in the previous days. In this section, the statistics are computed in the last 96 days so that up to 30 days may be used as a learning period. A learning period of n days includes the n preceding days of the forecasted day. This is a “moving learning period”.
4.2.1. Least Squares Method per Station
[39] Computing weights per station over a learning period of 22 to 30 days (22 is a minimum because there are 22 weights) fails to improve the forecasts. Best results are obtained based on a 30-day learning period with a RMSE of 40.7 μg m^{−3}. Extending the learning period should help (ELS^{s} performs well) but the test period would be too short to draw reliable conclusions. However, we report that a 60-day learning period allows to reach, during the last 36 simulated days, a RMSE of 22.3 μg m^{−3} (best model 21.6 μg m^{−3}) and a correlation of 76.5% (best model 74.6%). As a conclusion, the strategy is not satisfactory for this simulation. Nevertheless, further investigations, with a simulation during a longer period, are needed.
4.2.2. Least Squares Method per Date
[40] At each date, weights are the same for all stations. They are computed on the basis of a learning period ranging from 1 to 30 days. Figure 6 shows that this method performs well with a short learning period of about 5–7 days. Longer learning periods do not improve the results. Performances are close to the ones of ELS.
[41] With a 30-day learning period, RMSE of the forecasted combination is 19.2 μg m^{−3} (best model 21.9 μg m^{−3}) and correlation is 80.0% (best model 73.3%). The criteria on RMSE (below 90% of RMSE of the best model) is therefore fulfilled (for ensemble 1 and network 2). This is not the case with all ensembles and networks, as shown in Table 4. However, there are always significant improvements.
Ensemble | ELS^{d} | ELS | Best model | Forecast |
---|---|---|---|---|
Network 1 | ||||
Ensemble 1 | 14.1/91.7 | 19.6/83.3 | 22.4/78.0 | 20.5/81.7 |
Ensemble 2 | 13.9/92.0 | 20.5/81.5 | 22.4/78.1 | 21.3/80.0 |
Ensemble 3 | 12.0/94.1 | 19.2/84.0 | 22.4/78.1 | 20.2/82.2 |
Network 2 | ||||
Ensemble 1 | 12.8/91.6 | 18.7/81.1 | 21.9/73.1 | 19.2/80.0 |
Ensemble 2 | 11.6/93.1 | 19.6/79.0 | 21.9/73.8 | 20.4/77.2 |
Ensemble 3 | 8.4/96.4 | 18.2/82.3 | 21.9/73.8 | 19.0/80.4 |
Network 3 | ||||
Ensemble 1 | 14.6/91.8 | 21.1/81.8 | 24.0/76.4 | 21.8/80.6 |
Ensemble 2 | 13.9/92.5 | 21.1/81.9 | 23.9/76.6 | 22.1/80.0 |
Ensemble 3 | 12.4/94.1 | 20.2/83.5 | 23.9/76.6 | 21.2/81.6 |
- a In each column, RMSE in μg m^{−3} is followed by correlation in %.
[42] A key point to explain these improvements is the time evolution of the weights. Figure 4 shows strong variations which explain that a 1-day learning period has little chance to be suited. Indeed, as shown in Figure 6, a 1-day learning period gives poor performances (RMSE of 23.7 μg m^{−3} and correlation of 73%). Coefficients computed over a 30-day learning period are more stable, see Figure 7.
4.2.3. Hourly Concentrations
[43] Hourly forecasts may also be improved using weights learned in the previous 30 days and estimated per date as in section 4.2.2. In order to forecast the weights at a given hour h, only concentrations computed and observed at hour h during the learning period are included. Including all hourly concentrations lowers performances.
[44] All results are collected in Table 5. Performances are significantly improved, especially over networks 1 and 2. Note that these performances are similar to the performances of ELS (for which one can say that the learning period is the whole simulation).
Ensemble | ELS^{d} | ELS | Best Model | Forecast ELS^{d} |
---|---|---|---|---|
Network 1 | ||||
Ensemble 1 | 17.2/87.3 | 22.9/75.9 | 26.8/68.4 | 22.7/76.6 |
Ensemble 2 | 16.8/87.9 | 24.0/73.2 | 26.7/69.9 | 23.3/75.2 |
Ensemble 3 | 14.9/90.6 | 22.7/76.5 | 26.7/69.9 | 22.5/77.1 |
Network 2 | ||||
Ensemble 1 | 17.3/85.5 | 23.9/70.1 | 25.9/65.6 | 23.6/71.0 |
Ensemble 2 | 16.1/87.7 | 24.6/67.9 | 26.7/65.7 | 24.6/68.0 |
Ensemble 3 | 11.9/93.4 | 23.6/71.0 | 25.9/65.7 | 23.4/71.5 |
Network 3 | ||||
Ensemble 1 | 17.2/88.4 | 23.3/77.5 | 28.7/68.0 | 22.9/78.4 |
Ensemble 2 | 16.7/89.2 | 24.9/73.7 | 28.5/69.9 | 23.7/76.8 |
Ensemble 3 | 15.3/91.0 | 22.9/78.4 | 28.5/69.9 | 22.8/78.7 |
- a The weights associated with a given hour h are estimated with the computed and observed concentrations at hour h during the learning period. RMSE (in μg m^{−3})/correlation (in %) are given for each entry.
4.3. Learning Algorithms
[47] Table 6 shows results of the gradient descent algorithm. Performances (gradient descent column) are slightly better than performances of the least squares method with weights computed at each date (introduced in section 4.2.2, forecast ELS^{d} column). The learning algorithm succeeds while applying weights computed during the previous days fails (section 4.2.1). Knowing that there are many variants of learning algorithms (with updates that differ from equation (10)), this is certainly a promising direction for further improvements.
Ensemble | ELS^{d} | Forecast ELS^{d} | Best Model | Gradient descent |
---|---|---|---|---|
Network 1 | ||||
Ensemble 1 | 13.8/91.9 | 20.3/81.5 | 22.4/77.5 | 20.1/82.1 |
Ensemble 2 | 14.2/91.4 | 21.0/80.0 | 22.4/77.7 | 19.5/83.0 |
Ensemble 3 | 11.2/94.7 | 20.0/82.1 | 22.4/77.7 | 19.6/83.0 |
Network 2 | ||||
Ensemble 1 | 13.0/91.1 | 18.8/80.0 | 21.8/72.5 | 18.8/80.6 |
Ensemble 2 | 13.1/90.9 | 20.2/76.8 | 21.8/73.5 | 18.2/81.7 |
Ensemble 3 | 10.6/94.2 | 18.8/80.2 | 21.8/73.5 | 18.2/81.6 |
Network 3 | ||||
Ensemble 1 | 14.7/91.7 | 21.8/80.6 | 24.2/76.2 | 21.7/81.0 |
Ensemble 2 | 15.0/91.4 | 22.1/80.1 | 24.1/76.4 | 22.7/82.9 |
Ensemble 3 | 12.6/94.0 | 21.3/81.6 | 24.1/76.4 | 21.0/82.3 |
- a Forecasted weights for ELS are computed as in section 4.2.2 (same as Table 4). The first 30 days constitute a minimum learning period. All forecasted concentrations are preceded by at least 30 contiguous peak observations. This is the reason why the comparisons with observations slightly differ from Table 4 (whose forecast column corresponds the forecast ELS column of this table). RMSE (in μg m^{−3})/correlation (in %) are given for each entry.
4.4. Members Selection
[48] In Table 4, ensemble 1 shows better performances than ensemble 2 even if ensemble 1 has less members (22 against 32) and is less spread. Because of computational costs, it is useful to reduce the number of models to be included. Figure 8 shows performances of ELS^{d} against the number of models, where the models from ensemble 3 are included one by one in the optimization. Even if the impact of additional models decreases with the number of models, performances are still significantly improved.
[49] Another question is whether there are models that contribute more than others to performance improvement. In Figure 9, contributions of several models from ensemble 2 (to four subensembles based on ensemble 1 and of size 5, 10, 15 and 20) are shown. Contributions are primarily distinguishable in small ensembles. We also report that contributions of the models from ensemble 1 (to four subensembles based on ensemble 2 and of size 5, 10, 15 and 20) are less distinguishable. In addition, correlation between the models RMSE and their contribution to the combined model RMSE is below 30%. It seems that the best models do not necessarily bring the best contributions.
[50] There is no clear reason why forecasted combinations based on ensemble 2 show lower performances than those based on ensemble 1. Ensemble 2 includes simulations with multiple changes (see section 2.2) and only five choices are involved, which might be poorer than involving 21 different single changes.
5. Conclusion
[51] The forecasting system Polyphemus has the ability to generate ensemble forecasts with a wide spread in output concentrations and with a high number of different members. Combining the models in an optimal way has a strong potential. While ensemble mean and ensemble median barely improve the performances, results may be dramatically enhanced by linear combinations with optimal weights in some sense.
[52] It was shown that weights computed over a given network do not necessarily apply to another network and consequently to other grid cells. This low spatial robustness of the weights should be studied since gridded forecasts are an important feature of 3-D chemistry transport models.
[53] Daily forecasts also require to forecast the weights of an optimal combination. Weights appear to be highly unstable from one day to another or from one station to another. More stable weights are found in combinations constant over a 30-day period and over a whole network. These weights can be reasonably forecasted and the associated combinations provide significant improvements on hourly concentrations and on daily peaks. A decrease of about 10% of the RMSE is achieved on daily peaks. Hourly concentrations even show better improvements.
[54] In addition, there is a promising application of learning algorithms (machine learning) which do not need to introduce weights computed over the numerous stations of a monitoring network. The gradient descent algorithm shows good performances when applied to each station, while applying weights computed over a 30-day learning period fails.
[55] An ensemble with less members and less spread than another can lead to better combinations. Member selection was therefore discussed. Additional models always bring improvements, but slightly related to their individual performances.
[56] Future work should address this issue. Additional sources of uncertainty could be introduced. Meteorological ensemble forecasts and Monte Carlo simulation on other input data are necessary steps to account for all uncertainties. The computational costs will be a crucial point. Relevant strategies are needed for the introduction of Monte Carlo methods together with discrete changes (in the model formulation, as performed in this paper through changes in physical parameterizations and numerical approximations).
[57] An obvious future work lies in forecasting the weights. As shown in this paper, the potential of model combination is very high and it is much higher than what is achieved with the forecasted combinations tested so far. Specific learning algorithms should be involved.
[58] Ensemble forecasting may also deliver probabilistic forecasts. It would be an improvement of the forecasts through additional information. It would help in assessing uncertainties and it would allow reasonable integrated uses of air quality models, e.g., for risk assessment.
[59] Finally, an open question is the relations between ensemble forecast and classical data assimilation. Would sequential or variational data assimilation perform better than ensemble-based forecast? How could both strategies be combined? Data assimilation may be performed on each member of an ensemble, or only on a reference member (to be determined) whose updates (from the data assimilation procedure) would be applied to other members (e.g., optimized emissions or corrected initial conditions). In addition, the ensemble spread may be valuable information in the data assimilation procedure.
Acknowledgments
[60] The first author is partially supported by the Île-de-France region. We thank the monitoring networks that provided the numerous observations. We also thank Gilles Stoltz (CNRS) for his introduction to machine learning algorithms.