Volume 111, Issue D18
Composition and Chemistry
Free Access

Ensemble-based air quality forecasts: A multimodel approach applied to ozone

Vivien Mallet

Vivien Mallet

Centre d'Enseignement et de Recherche en Environnement Atmosphérique, École Nationale des Ponts et Chaussées/Électricité de France Recherche et Développement, Marne la Vallée, France

Also at CLIME, Joint Team Inria/École Nationale des Ponts et Chaussées, Paris, France.

Search for more papers by this author
Bruno Sportisse

Bruno Sportisse

Centre d'Enseignement et de Recherche en Environnement Atmosphérique, École Nationale des Ponts et Chaussées/Électricité de France Recherche et Développement, Marne la Vallée, France

Also at CLIME, Joint Team Inria/École Nationale des Ponts et Chaussées, Paris, France.

Search for more papers by this author
First published: 21 September 2006
Citations: 43

Abstract

[1] The potential of ensemble techniques to improve ozone forecasts is investigated. Ensembles with up to 48 members (models) are generated using the modeling system Polyphemus. Members differ in their physical parameterizations, their numerical approximations, and their input data. Each model is evaluated during 4 months (summer 2001) over Europe with hundreds of stations from three ozone-monitoring networks. We found that several linear combinations of models have the potential to drastically increase the performances of model-to-data comparisons. Optimal weights associated with each model are not robust in time or space. Forecasting these weights therefore requires relevant methods, such as selection of adequate learning data sets, or specific learning algorithms. Significant performance improvements are accomplished by the resulting forecasted combinations. A decrease of about 10% of the root-mean-square error is obtained on ozone daily peaks. Ozone hourly concentrations show stronger improvements.

1. Introduction

[2] Though sparsely evaluated, the uncertainty in chemistry transport models is a major limitation of air quality forecasting. The source of this uncertainty lies in input fields (emissions, deposition velocities, land data, meteorological fields, etc.), as detailed by Hanna et al. [1998, 2001], and in the models themselves [Russell and Dennis, 2000; Mallet and Sportisse, 2006]. The uncertainty is so high that the reliability of model results should be carefully assessed, and ensemble forecast is relevant to address this issue. Straume et al. [1998], Dabberdt and Miller [2000], Galmarini et al. [2004], Straume [2001], and Warner et al. [2002] have estimated the uncertainty in dispersion modeling using ensemble forecasts. Dealing with ozone exposure, Hanna et al. [1998] and Beekmann and Derognat [2003] accounted for uncertainties in input fields with Monte Carlo simulations to check the efficiency of emission reductions. Hanna et al. [2001], Hanna and Davis [2002], and Mallet and Sportisse [2006] estimated the uncertainty in photochemical forecasts based on Monte Carlo simulations and a multimodel approach, respectively.

[3] With respect to day-to-day photochemical forecasts, only few developments have been undertaken in order to associate uncertainties with the forecasts or in order to overtake the limitations of uncertain processes or data. Improvements in air quality forecasts have been sought in modeling developments, input data refinements and increasing computational resources. Unfortunately, the performances have only slightly increased [Russell and Dennis, 2000]. A reasonable explanation is that the high uncertainties hide modeling efforts and that models are usually tuned to deliver satisfactory forecasts (the latter is also suggested by Russell and Dennis [2000]). Taking into account the uncertainty could help in enhancing the forecasts. A promising technique is to perform ensemble forecasts and to combine the ensemble members.

[4] A brute force approach is the use of ensemble mean [Delle Monache and Stull, 2003; McKeen et al., 2005]. Underlying (and strong) assumptions are that the ensemble provides an accurate approximation of the output concentrations probability density function and that the mean of this probability density function is close to the true state. Because of the limited number of models and the unsatisfactory description of the uncertainty, it is hard to satisfy the first assumption. Moreover there is no study supporting the second assumption. More sophisticated methods have been used, mainly in other fields, such as superensembles in meteorology [Krishnamurti et al., 2000] or for ozone forecasts [Pagowski et al., 2005], or Bayesian model averaging [Hoeting et al., 1999] (in many fields).

[5] In this paper, we investigate several methods to build optimal combinations of ensemble members. The objective is to increase day-to-day forecast performances (estimated through comparison against field measurements). The methods are applied to ozone hourly concentrations and daily peaks at European scale during summer 2001 and over hundreds of stations from three monitoring networks. The involved ensembles include up to 48 members, which allows us to study the characteristics of efficient ensembles. Section 2 gives further details about the ensemble members and the system used to generate these forecasts. In section 3 we introduce the methods that we investigated, and we review their potential, that is, the quality of their a posteriori (i.e., knowing all observations) combinations. In section 4, methods to forecast optimal ensemble combinations are described and tested. Selection of the best suited members is also addressed.

2. Ensemble Forecasts

2.1. Forecasting System Polyphemus

[6] Polyphemus [Mallet et al., 2005] is an air quality modeling system with ensemble-forecasting abilities based on multiple configurations. These configurations define almost all components of the modeling system so that each configuration should be viewed as a new model. Polyphemus is primarily composed of (1) a library for physical parameterizations (and data processing), AtmoData [Mallet and Sportisse, 2005], that includes several parameterizations for major processes; (2) a chemistry transport model, Polair3D [Boutahar et al., 2004], whose gas-phase version is basically a numerical solver for the reactive-dispersion equation; and (3) a set of programs that make calls to AtmoData in order to compute the input data to the chemistry transport model.

[7] Contrary to most modeling systems that rely on an “all-in-one chemistry transport model,” Polyphemus splits the numerical solver from physical parameterizations and data management. The programs that compute input data to the chemistry transport model provide flexibility. They propose several options supported by the multiple physical parameterizations available in AtmoData. Polair3D is also versatile enough to propose several chemical mechanisms and numerical approximations. In addition independence of the components eases the work flow control, such as corrections in input fields to Polair3D or incorporation of new data sets. These features enable to build ensembles with a high number of members. Moreover very different models can be built so as to produce an ensemble with a wide spread in output concentrations (see section 2.2).

[8] In this paper, the system is run at European scale ([40.25°N, 10.25° W] × [56.75°N, 22.25°E]) during summer 2001 (27 April 2001 to 31 August 2001). It aims at forecasting ozone concentrations (hourly concentrations and daily peaks). We define a reference configuration (the reference model or reference ensemble member, not necessarily the best model when compared to observations) in the following way: (1) meteorological data, European Centre for Medium-Range Weather Forecasts (ECMWF) fields (resolution of 0.36° × 0.36°, TL511 spectral resolution in the horizontal, 60 levels, time step of 3 hours, 12 hours forecast cycles starting from analyzed fields); (2) land use coverage, U.S. Geological Survey (USGS) land cover map (24 categories, 1 km Lambert); (3) chemical mechanism, RACM [Stockwell et al., 1997]; (4) emissions, the Co-operative Programme for Monitoring and Evaluation of the Long-range Transmission of Air Pollutants in Europe (EMEP) inventory, converted according to Middleton et al. [1990]; (5) biogenic emissions, computed as proposed by Simpson et al. [1999]; (6) deposition velocities, the revised parameterization from Zhang et al. [2003]; (7) vertical diffusion, within the boundary layer, the Troen and Mahrt parameterization described by Troen and Mahrt [1986], with the boundary layer height computed by ECMWF; above the boundary layer, the Louis parameterization of Louis [1979]; (8) boundary conditions, output of the global chemistry transport model Mozart 2 [Horowitz et al., 2003] run over a typical year; and (9) numerical schemes, a first-order operator splitting, the sequence being advection-diffusion-chemistry; a direct space-time third-order advection scheme with a Koren flux limiter; a second-order Rosenbrock method for diffusion and chemistry [Verwer et al., 2002].

[9] Since ensemble forecasting is computationally consuming, we kept a low vertical resolution. The first layer is located between 0 and 50 m. The thickness of the other layers is about 600 m with the top of the last layer at 3000 m.

2.2. Ensembles Description

[10] We introduce three ensembles:

[11] 1. Ensemble 1 is composed of the reference simulation and 21 similar simulations but for one change in the physical parameterizations, in the raw input data (to Polyphemus), in the numerical approximations or in uncertain input data computed in the system work flow. Table 1 lists all changes.

Table 1. Physical Parameterizations, Raw Input Data (to Polyphemus), Numerical Approximations, and Perturbed Input Data Involved in Ensemble 1a
No. Model Reference Alternative Comment
Physical Parameterizations
1b chemistry RACM RADM 2 [Stockwell et al., 1990]
2 vertical diffusion Troen and Mahrt Louis [Louis, 1979]
3 Louis in stable conditions Troen and Mahrt kept in unstable conditions
4 deposition velocities Zhang [Zhang et al., 2003] Wesely [Wesely, 1989]
5 surface flux heat fluxc momentum fluxc for the aerodynamic resistance (in deposition velocities)
6 cloud attenuation RADM method [Chang et al., 1987; Madronich, 1987] Esquifd
7 critical relative humidity depends on σ two layers used in the RADM method to compute cloud attenuation
Raw Input Data
8 emissions vertical distribution all in the first cell all in the two first cells
9 land use coverage USGS GLCF for deposition velocities
10 land use coverage USGS GLCF for biogenic emissions
11 exponent p in Troen and Mahrt 2 3
12 photolysis constants JPROC (from EPA Models 3) dependent on the zenith angle (only)
Numerical Approximations
13 time step 600 s 100 s
14 1800 se
15 vertical resolution 5 layers 9 layers first layer height remains 50 m
16 first layer height 50 m 40 m top height of every other layer does not change
17 continuity equation div(V) = 0 div (ρV) = 0
Perturbed Input Data
18 boundary layer height ECMWF increased by 10%
19 NO emissions EMEP increased by 25% including biogenic emissions
20 biogenic emissions [Simpson et al., 1999] increased by 100% excluding NO biogenic emissions
21 ozone boundary conditions Mozart 2 decreased by 10%
  • a Each model has the same configuration as the reference model but for one change (column “alternative”).
  • b The reference model is referred to as model 0.
  • c Computed using Louis formulae.
  • d ESQUIF final report 2001 (available at http://climserv.lmd.polytechnique.fr/esquif).
  • e The advection is integrated over submultiples of 1800 s so as to satisfy the Courant-Friedrichs-Lewy (CFL) condition.

[12] 2. Ensemble 2 is built with the changes involved in models 17, 8, 4, 2, 1 (numbers from Table 1). All possible combinations of these changes are included in the ensemble. There are therefore 32 members in ensemble 2.

[13] 3. Ensemble 3 collects all members from ensembles 1 and 2. Ensembles 1 and 2 have six common members (0, 1, 2, 4, 8, and 17); hence there are 48 members in ensemble 3.

[14] Ensembles similar to ensembles 1 and 2 were introduced by Mallet and Sportisse [2006] in order to estimate uncertainties in output ozone concentrations. One may refer to this paper for a detailed description of the ensembles and of their spread. A rough idea of the wide spread is given in Figure 1. The mean of hourly standard deviations of ensemble 3 profiles (shown in Figure 1) is 10.4 μg m−3. To show the spatial distribution of ensemble spread, the standard deviation of ensemble 3 is computed in each cell and for each hour, then averaged relative standard deviations (time average) in each cell are plotted in Figure 2.

Details are in the caption following the image
Ozone daily profiles of the 48 models (ensemble 3). The dashed lines correspond to the models that are in ensemble 2 and not in ensemble 1. The concentrations are averaged over the whole domain (excluding a three-cell band around the domain borders) and over the 127 simulated days.
Details are in the caption following the image
Spatial distribution of ensemble 3 spread. The standard deviation of the ensemble is computed in each cell and for each hour. Then resulting standard deviations are averaged (time averages) in each cell and divided by the mean concentration of the cell, which gives a relative standard deviation.

2.3. Comparisons With Observations

[15] We use ozone measurements from three monitoring networks (described below). All stations in these networks have observations for at least 30% of possible measurements during the 127 simulated days (for both hourly concentrations and peaks). Here is a description of the networks:

[16] 1. Network 1 is composed of 241 urban and regional stations over Europe. A large part of the stations are in France (116 stations) and in Germany (81 stations). It provides about 619,000 hourly concentrations and 27,500 peaks.

[17] 2. Network 2 includes 85 EMEP stations, that is regional stations distributed over Europe, with about 240,000 hourly observations and 10,400 peaks.

[18] 3. Network 3 includes 356 urban and regional stations in France from BDQA (“Banque de Données sur la Qualité de l'Air”, managed by Agence de l'Environnement et de la Maîtrise de l'Énergie (ADEME) and gathering 40 approved associations for monitoring air quality). It provides 997,000 hourly measurements and 42,000 peaks. Note that it includes most French stations of network 1.

[19] Networks 1 and 2 have a large extent while network 3 provides a large amount of measurements in a single country. Network 2 allows us to test combining methods only with regional stations. Three statistical measures are introduced in order to estimate model performances: the root-mean-square error (RMSE), the correlation and a bias factor, defined as
equation image
equation image
equation image
where y is the vector of model outputs, o is the vector of the corresponding observations. Both vectors have n components. Their means are equation image and equation image; equation image is the vector of observations above 40 μg m−3 and equation image is the corresponding computed concentrations. Both vectors have equation image components. Table 2 shows the performances of the three ensembles against the measurements from the three networks.
Table 2. Performances of Ensembles Against Field Observations From the Three Networksa
Ensemble Hourly Concentrations Daily Peaks
RMSE Correlation Bias RMSE Correlation Bias
Network 1
Ensemble 1
Best member 27.0 66.1 1.8 22.7 73.8 0.1
Mean statistics 29.0 63.8 11.3 24.2 71.1 2.9
Ensemble 2
Best member 26.7 67.9 1.8 23.0 74.8 0.1
Mean statistics 29.1 64.8 13.4 26.4 69.1 6.2
Ensemble 3
Mean statistics 29.0 64.4 12.6 25.6 69.8 4.9
Worst member 32.1 60.8 27.1 33.5 62.2 17.2
Network 2
Ensemble 1
Best member 25.7 63.6 0.5 21.5 69.7 0.1
Mean statistics 26.8 60.6 7.8 22.6 67.4 2.7
Ensemble 2
Best member 26.3 63.9 0.2 21.6 70.2 0.4
Mean statistics 28.9 59.9 12.9 25.4 64.4 6.6
Ensemble 3
Mean statistics 28.1 60.1 11.0 24.4 65.5 5.1
Worst member 35.1 54.4 28.7 32.1 56.7 17.3
Network 3
Ensemble 1
Best member 29.4 65.5 3.2 24.9 72.2 0.2
Mean statistics 32.5 61.6 15.3 26.5 67.8 2.9
Ensemble 2
Best member 29.0 67.8 0.2 25.1 74.4 0.5
Mean statistics 31.2 62.9 12.8 29.1 65.4 6.8
Ensemble 3
Mean statistics 31.7 62.4 13.8 28.2 66.2 5.4
Worst member 35.8 58.8 26.0 37.5 55.4 17.7
  • a RMSE is in μg m−3, correlation in %, and bias in %. Best results, for each network, are in bold. Mean statistics are averaged statistics of individual models.

3. Combining Forecasts: Methods and Potentialities

3.1. Introduction

[20] For day-to-day forecasts, the modeler is usually able to choose a model with performances close to the best model. In particular, it means that the performances of the reference configuration (section 2.1) are similar to those of the best model. The objective is therefore to deliver a forecast with higher performances than the best available model. In other words, a method to determine the best model would not be of great help. Hence ensemble members should be combined. Knowing that ensemble forecasting is computationally consuming, a satisfactory model combination has to bring significant improvements. We consider that a decrease by 10% of the root-mean-square error of the best model (that is, about 2–3 μg m−3) is required for an ensemble method to be interesting. This threshold is arbitrary, but there is some background to support it. The best model is usually a tuned model, that is, a favorable configuration found by the modeler. Improving a well tuned model, so as to decrease the root-mean-square error by 10%, is not an easy task, especially for day-to-day forecasts.

3.2. Notations

[21] An ensemble is denoted equation image or equation imagei. For instance, equation image3 = ℰ1 ∪ ℰ2. A network is denoted equation image or equation imagei. The cardinal of a network (number of stations) or of an ensemble (number of models) is denoted by ∣ ∣. Output concentrations of a model are denoted Mt,x or Mm,t,x (if the model is indexed by m), where t is the time step and x denotes a station. Time and spatial averages are denoted equation imagext and equation imagetx, respectively. The mean over all stations and during the whole simulation period is equation imaget,x. Observations are denoted Ot,x and Ct,x are combined concentrations.

3.3. Introduction to Combining Methods

3.3.1. Ensemble Mean and Ensemble Median

[22] The ensemble mean is defined as
equation image
The ensemble median is defined as
equation image
If there is an even number of models, the mean of the two middle models is used.

3.3.2. Models Selection

[23] At each station, the best model is selected. The resulting model is denoted EBs (‘B’ stands for “best” and ‘s’ stands for “station”). In the same way, selecting the best model for each date (but for all stations) defines the metamodel EBd (‘d’ stands for “date”).

3.3.3. Least Squares Methods

[24] The best linear combination, in the least squares (LS) sense, is
equation image
where ELS is ensemble LS and α is a vector of unconstrained weights that minimizes
equation image
An unbiased version is
equation image
where EULS is ensemble unbiased LS and α (still unconstrained) minimizes
equation image
EULS may be referred as superensemble [following Krishnamurti et al., 2000].
[25] Weights (α) may be computed for each station or for each time step. The corresponding combinations are denoted with the superscripts “s” (station) and “d” (date), e.g., ELSs and ELSd. Averages are adapted to the new target; for instance,
equation image
where the vector αxs = (α1, xs, α2, xs, α3, xs, …) minimizes
equation image

3.4. Potentials

[26] In previous formulae, weights are computed on the basis of all observations. In operational forecasts, these weights should be forecasted, that is, on the basis of past observations. However, in this section, methods are assessed through their a posteriori (i.e., with all observations known) performances. This gives the potential of all methods. All statistical measures are provided in Table 3.

Table 3. Potential Performances of Model Combinations Against Field Observations From the Three Networksa
Ensemble Hourly Concentrations Daily Peaks
RMSE Correlation Bias RMSE Correlation Bias
Network 1
Ensemble 1
   EULSd 16.7 87.3 2.6 13.5 91.6 1.4
Ensemble 2
   EULSd 16.3 87.9 2.5 13.3 91.9 1.4
Ensemble 3
   EULSs 16.5 87.7 2.0 10.9 94.5 1.0
   EULSd 14.5 90.6 2.0 11.6 93.9 1.1
Network 2
Ensemble 1
   EM 25.9 61.9 6.3 22.0 68.7 0.7
   EMD 26.4 60.9 7.7 22.1 68.0 1.0
   EBs 23.1 70.6 2.4 19.7 75.3 2.4
   EBd 24.2 67.0 2.6 19.9 74.8 2.4
   ELS 23.7 68.0 0.8 18.7 78.2 2.5
   EULS 23.4 68.8 0.0 18.5 78.7 3.2
   ELSs 16.4 86.3 0.7 12.9 90.3 1.2
   EULSs 16.0 86.8 0.2 12.5 90.9 1.4
   ELSd 17.1 84.8 0.5 12.5 90.9 1.3
   EULSd 16.7 85.5 0.2 12.1 91.4 1.4
Ensemble 2
   EM 25.2 64.4 5.5 23.1 70.5 4.6
   EMD 25.3 64.0 4.9 23.3 69.6 4.6
   EBs 22.4 72.5 0.9 19.1 77.1 2.0
   EBd 24.0 67.3 1.3 19.6 75.6 2.1
   ELS 24.3 66.2 0.7 19.6 75.8 2.7
   EULS 24.0 66.9 0.4 19.4 76.2 3.4
   ELSs 17.3 84.6 0.9 12.8 90.4 1.0
   EULSs 16.9 85.3 0.1 12.3 91.2 1.4
   ELSd 15.9 87.1 0.3 11.4 92.4 1.1
   EULSd 15.4 87.9 0.1 11.0 93.1 1.2
Ensemble 3
   EM 24.9 64.2 1.3 22.3 70.7 2.7
   EMD 25.7 61.5 4.5 22.2 69.2 0.7
   EBs 22.1 73.2 0.8 18.9 77.6 1.7
   EBd 23.8 68.1 1.6 19.4 76.2 2.0
   ELS 23.5 68.8 0.9 18.3 79.3 2.4
   EULS 23.2 69.6 0.0 18.1 79.7 3.0
   ELSs 15.5 87.8 0.6 10.5 93.7 0.8
   EULSs 15.2 88.3 0.2 10.1 94.1 1.0
   ELSd 11.9 93.0 0.1 8.3 96.1 0.6
   EULSd 11.6 93.3 0.0 8.0 96.3 0.6
Network 3
Ensemble 1
   EULSd 16.9 88.3 3.1 13.9 91.9 1.4
Ensemble 2
   EULSd 16.4 89.0 3.0 13.3 92.6 1.3
Ensemble 3
   EULSd 15.0 90.8 2.6 11.9 94.1 1.1
  • a RMSE is in μg m−3, correlation in %, and bias in %. For networks 1 and 3, only the best combinations with respect to RMSE are shown. Conclusions drawn from results over network 2 are very similar for networks 1 and 3.

3.4.1. Ensemble Mean and Ensemble Median

[27] For every ensemble, results of EM and EMD are better than the averaged statistics of the ensemble. However, they often have lower performances than the best member. No ensemble mean or ensemble median has a RMSE below 90% of the best RMSE of the same ensemble. Ensemble mean and ensemble median therefore show poor performances. This is in contradiction with the results from Delle Monache and Stull [2003] and McKeen et al. [2005]. Nonetheless, the former study implied only four models, during 6 days and with five stations, which limits the reliability of the conclusions, as also pointed out by the authors. The latter study is also limited with seven models.

3.4.2. Models Selection

[28] Performances of EBs and EBd are satisfactory, especially on the peaks. RMSE are then below 90% of the best model RMSE.

3.4.3. Least Squares Methods

[29] All comments are made for both the regular least squares version and the corresponding unbiased version since their performances are very similar. Least squares method applied with a single combination, over the whole network and at all dates, brings significant improvement in results. RMSE are usually well below 90% of the best model RMSE.

[30] Meanwhile, the best performances are reached by far with the least squares methods per station and per date. Over network 2, EULSd based on ensemble 3 even reaches a RMSE of 8 μg m−3 and a correlation of 96.3% for daily peaks.

[31] Combinations based on ensemble 3 logically show the best results since ensemble 3 includes all simulations. Least squares combinations based on ensemble 2 are slightly better than ensemble 1, which may be due to the number of members, the wider spread or a favorable configuration. Least squares combinations per date usually perform better than combinations per station. The ratio between the number of available stations per date and the number of measurements per station might be an explanation for hourly concentrations. However, it is likely that the spatiotemporal structure of the computed fields plays an important role. Ozone daily peaks at a representative station (with respect to the statistics) illustrate the improvements in Figure 3.

Details are in the caption following the image
Ozone daily peaks at Harwell (station in network 2) over the 127 days (120 available measurements). The best model is extracted from ensemble 1. The combination EULSd is based on ensemble 1. The best model is associated with a RMSE of 22.0 μg m−3 and a correlation of 63.4%. EULSd is associated with a RMSE of 12.1 μg m−3 and a correlation of 90.6%.

4. Forecasting Ensemble Combinations and Selecting Ensemble Members

[32] The previous results show a strong potential for least squares methods. The objective is to use them for forecasts, that is to forecast the weights associated with every model based on the weights computed in the past days. This may be viewed as a data assimilation procedure constrained by the ensemble structure. Unless specified, the following tests are performed with ensemble 1, over network 2 and with ozone daily peaks.

4.1. Weights Stability

[33] Since ELSs and ELSd both show promising performances, the combinations may be forecasted at each station (and over a given period) or for each time step (and for all stations). In order to ease weights forecasting, combinations with a low time dependency are of high interest. It is also useful to have spatially robust weights, that is, weights that may be applied to another network or to other grid cells. With such weights, the whole ground field may be forecasted, which is a key feature of three-dimensional (3-D) chemistry transport models.

[34] It is noteworthy that (1) there exist constant weights over the whole period (127 days) for an efficient combination (ELSs) and (2) there also exist uniform weights (over a network) associated with high performances (ELSd). The question is primarily to know whether these coefficients can be forecasted.

4.1.1. Least Squares Method per Date

[35] Time evolution of weights for ELSd, for three models, is shown in Figure 4. These weights are highly variable. Even the highest weights (in absolute value) which constitute the main part of the combination are highly unstable. It makes the combination very hard to forecast.

Details are in the caption following the image
Time evolution of the three less stable weights in ELSd (ensemble 1, network 2), that is, weights associated with the highest standard deviations. Weights of other models are also highly variable.

[36] Another property is that it is not easy to use these weights over another network or in other cells. Applying the weights computed for network 3 (and ensemble 1, ozone peaks) to network 2 leads to a RMSE of 55.8 μg m−3. This is the least favorable extension since the two networks contain stations of different nature and their spatial extent strongly differ. A more favorable experiment is to compute the weights over network 2 (Europe) and to apply them to network 3 (France). The resulting RMSE is 24.6 μg m−3 (correlation of 74.7%) which is reasonable but similar to the RMSE of the best model (24.9 μg m−3, correlation of 72.2%). There is a more favorable experiment. Like network 2, network 1 has stations over whole Europe but it has regional and urban stations, including stations from network 3. Network 3 is therefore closer to network 1 than to network 2. Applying weights computed over network 1 to network 3 gives better performances, with a RMSE of 17.4 μg m−3 and a correlation of 87.1%. This promising result tends to show that it is possible to apply suited weights to cells without observations.

4.1.2. Least Squares Method per Station

[37] Weights (for ELSs) associated with each model are highly variable over the network, as shown in Figure 5. In addition, there is no subset of stations over which the weights are similar. This is not surprising because setting a single weight per model (for all stations and all dates) does not provide very strong improvements (Table 3, ELS).

Details are in the caption following the image
Distribution over the 85 stations of the three less stable weights in ELSs (ensemble 1, network 2), that is, weights associated with the highest standard deviations. Weights of other models are also highly variable.

4.2. Using Weights From the Previous Days

[38] An obvious method is to use weights computed in the previous days. In this section, the statistics are computed in the last 96 days so that up to 30 days may be used as a learning period. A learning period of n days includes the n preceding days of the forecasted day. This is a “moving learning period”.

4.2.1. Least Squares Method per Station

[39] Computing weights per station over a learning period of 22 to 30 days (22 is a minimum because there are 22 weights) fails to improve the forecasts. Best results are obtained based on a 30-day learning period with a RMSE of 40.7 μg m−3. Extending the learning period should help (ELSs performs well) but the test period would be too short to draw reliable conclusions. However, we report that a 60-day learning period allows to reach, during the last 36 simulated days, a RMSE of 22.3 μg m−3 (best model 21.6 μg m−3) and a correlation of 76.5% (best model 74.6%). As a conclusion, the strategy is not satisfactory for this simulation. Nevertheless, further investigations, with a simulation during a longer period, are needed.

4.2.2. Least Squares Method per Date

[40] At each date, weights are the same for all stations. They are computed on the basis of a learning period ranging from 1 to 30 days. Figure 6 shows that this method performs well with a short learning period of about 5–7 days. Longer learning periods do not improve the results. Performances are close to the ones of ELS.

Details are in the caption following the image
(top) RMSE and (bottom) correlation of a combination with weights (same at all stations) computed at each step with a least squares optimization over a learning period of x days (abscissa). The dashed lines are the performances of the best model in the ensemble and of ELSd (ensemble 1). The dotted line is the performance of ELS.

[41] With a 30-day learning period, RMSE of the forecasted combination is 19.2 μg m−3 (best model 21.9 μg m−3) and correlation is 80.0% (best model 73.3%). The criteria on RMSE (below 90% of RMSE of the best model) is therefore fulfilled (for ensemble 1 and network 2). This is not the case with all ensembles and networks, as shown in Table 4. However, there are always significant improvements.

Table 4. Performances on Ozone Daily Peaks Over the Last 96 Simulated Days for ELSd, ELS, the Best Model (in the Ensemble) and the Combination With “Least Squares Weights” Computed With 30-day Learning Periods Preceding Each Forecasted Daya
Ensemble ELSd ELS Best model Forecast
Network 1
Ensemble 1 14.1/91.7 19.6/83.3 22.4/78.0 20.5/81.7
Ensemble 2 13.9/92.0 20.5/81.5 22.4/78.1 21.3/80.0
Ensemble 3 12.0/94.1 19.2/84.0 22.4/78.1 20.2/82.2
Network 2
Ensemble 1 12.8/91.6 18.7/81.1 21.9/73.1 19.2/80.0
Ensemble 2 11.6/93.1 19.6/79.0 21.9/73.8 20.4/77.2
Ensemble 3 8.4/96.4 18.2/82.3 21.9/73.8 19.0/80.4
Network 3
Ensemble 1 14.6/91.8 21.1/81.8 24.0/76.4 21.8/80.6
Ensemble 2 13.9/92.5 21.1/81.9 23.9/76.6 22.1/80.0
Ensemble 3 12.4/94.1 20.2/83.5 23.9/76.6 21.2/81.6
  • a In each column, RMSE in μg m−3 is followed by correlation in %.

[42] A key point to explain these improvements is the time evolution of the weights. Figure 4 shows strong variations which explain that a 1-day learning period has little chance to be suited. Indeed, as shown in Figure 6, a 1-day learning period gives poor performances (RMSE of 23.7 μg m−3 and correlation of 73%). Coefficients computed over a 30-day learning period are more stable, see Figure 7.

Details are in the caption following the image
Time evolution of three weights computed over a 30-day learning period (preceding each date), for ensemble 1 and network 2. This figure should be compared to Figure 4 which shows more variable weights. Both figures have the same range of values along y.

4.2.3. Hourly Concentrations

[43] Hourly forecasts may also be improved using weights learned in the previous 30 days and estimated per date as in section 4.2.2. In order to forecast the weights at a given hour h, only concentrations computed and observed at hour h during the learning period are included. Including all hourly concentrations lowers performances.

[44] All results are collected in Table 5. Performances are significantly improved, especially over networks 1 and 2. Note that these performances are similar to the performances of ELS (for which one can say that the learning period is the whole simulation).

Table 5. Performances on Ozone Hourly Concentrations Over the Last 96 Simulated Days for ELSd, ELS, the Best Model (in the Ensemble) and the Combination With “Least Squares Weights” Computed With 30-Day Learning Periods Preceding Each Forecasted Daya
Ensemble ELSd ELS Best Model Forecast ELSd
Network 1
Ensemble 1 17.2/87.3 22.9/75.9 26.8/68.4 22.7/76.6
Ensemble 2 16.8/87.9 24.0/73.2 26.7/69.9 23.3/75.2
Ensemble 3 14.9/90.6 22.7/76.5 26.7/69.9 22.5/77.1
Network 2
Ensemble 1 17.3/85.5 23.9/70.1 25.9/65.6 23.6/71.0
Ensemble 2 16.1/87.7 24.6/67.9 26.7/65.7 24.6/68.0
Ensemble 3 11.9/93.4 23.6/71.0 25.9/65.7 23.4/71.5
Network 3
Ensemble 1 17.2/88.4 23.3/77.5 28.7/68.0 22.9/78.4
Ensemble 2 16.7/89.2 24.9/73.7 28.5/69.9 23.7/76.8
Ensemble 3 15.3/91.0 22.9/78.4 28.5/69.9 22.8/78.7
  • a The weights associated with a given hour h are estimated with the computed and observed concentrations at hour h during the learning period. RMSE (in μg m−3)/correlation (in %) are given for each entry.

4.3. Learning Algorithms

[45] Applying an optimal combination computed over a learning period may be efficient (sections 4.2.2 and 4.2.3), but more sophisticated algorithms were designed in machine learning. A classical algorithm is, for instance, the gradient descent algorithm for online regression [Cesa-Bianchi et al., 1996]. In our case, this method is independently applied at each station and the objective is to minimize a loss function defined as
equation image
[46] Weights αt−1 = (α1,t−1, α2,t−1, α3,t−1, …) are updated according to
equation image
where η is the learning rate. Results are sensitive to this parameter (tests not reported here). We chose η = 5 × 10−7 for ensembles 1 and 2, and η = 2.5 × 10−7 for ensemble 3. Results are stable in the vicinity of these parameters (e.g., ±50%). Initial weights are set to 1/N, where N is the number of models (this corresponds to the ensemble mean).

[47] Table 6 shows results of the gradient descent algorithm. Performances (gradient descent column) are slightly better than performances of the least squares method with weights computed at each date (introduced in section 4.2.2, forecast ELSd column). The learning algorithm succeeds while applying weights computed during the previous days fails (section 4.2.1). Knowing that there are many variants of learning algorithms (with updates that differ from equation (10)), this is certainly a promising direction for further improvements.

Table 6. Performances on Ozone Daily Peaks Over the Last 96 Simulated Days for ELSd, ELSd With Forecasted Weights, the Best Model (in the Ensemble) and the Combination Computed With the Gradient Descent Algorithma
Ensemble ELSd Forecast ELSd Best Model Gradient descent
Network 1
Ensemble 1 13.8/91.9 20.3/81.5 22.4/77.5 20.1/82.1
Ensemble 2 14.2/91.4 21.0/80.0 22.4/77.7 19.5/83.0
Ensemble 3 11.2/94.7 20.0/82.1 22.4/77.7 19.6/83.0
Network 2
Ensemble 1 13.0/91.1 18.8/80.0 21.8/72.5 18.8/80.6
Ensemble 2 13.1/90.9 20.2/76.8 21.8/73.5 18.2/81.7
Ensemble 3 10.6/94.2 18.8/80.2 21.8/73.5 18.2/81.6
Network 3
Ensemble 1 14.7/91.7 21.8/80.6 24.2/76.2 21.7/81.0
Ensemble 2 15.0/91.4 22.1/80.1 24.1/76.4 22.7/82.9
Ensemble 3 12.6/94.0 21.3/81.6 24.1/76.4 21.0/82.3
  • a Forecasted weights for ELS are computed as in section 4.2.2 (same as Table 4). The first 30 days constitute a minimum learning period. All forecasted concentrations are preceded by at least 30 contiguous peak observations. This is the reason why the comparisons with observations slightly differ from Table 4 (whose forecast column corresponds the forecast ELS column of this table). RMSE (in μg m−3)/correlation (in %) are given for each entry.

4.4. Members Selection

[48] In Table 4, ensemble 1 shows better performances than ensemble 2 even if ensemble 1 has less members (22 against 32) and is less spread. Because of computational costs, it is useful to reduce the number of models to be included. Figure 8 shows performances of ELSd against the number of models, where the models from ensemble 3 are included one by one in the optimization. Even if the impact of additional models decreases with the number of models, performances are still significantly improved.

Details are in the caption following the image
Performances ((top) RMSE and (bottom) correlation) of ELSd against the number of models in the ensemble. Models are taken from ensemble 3.

[49] Another question is whether there are models that contribute more than others to performance improvement. In Figure 9, contributions of several models from ensemble 2 (to four subensembles based on ensemble 1 and of size 5, 10, 15 and 20) are shown. Contributions are primarily distinguishable in small ensembles. We also report that contributions of the models from ensemble 1 (to four subensembles based on ensemble 2 and of size 5, 10, 15 and 20) are less distinguishable. In addition, correlation between the models RMSE and their contribution to the combined model RMSE is below 30%. It seems that the best models do not necessarily bring the best contributions.

Details are in the caption following the image
Four ensembles are built with the first 5, 10, 15, and 20 members of ensemble 1. A single model from ensemble 2 (abscissa) is added to these ensembles, and RMSE of ELSd (ordinate) is computed. Twenty-six models from ensemble 2 (i.e., all models that are in ensemble 2 but not in ensemble 1) are included this way. It shows the contribution that each model can make to overall performances.

[50] There is no clear reason why forecasted combinations based on ensemble 2 show lower performances than those based on ensemble 1. Ensemble 2 includes simulations with multiple changes (see section 2.2) and only five choices are involved, which might be poorer than involving 21 different single changes.

5. Conclusion

[51] The forecasting system Polyphemus has the ability to generate ensemble forecasts with a wide spread in output concentrations and with a high number of different members. Combining the models in an optimal way has a strong potential. While ensemble mean and ensemble median barely improve the performances, results may be dramatically enhanced by linear combinations with optimal weights in some sense.

[52] It was shown that weights computed over a given network do not necessarily apply to another network and consequently to other grid cells. This low spatial robustness of the weights should be studied since gridded forecasts are an important feature of 3-D chemistry transport models.

[53] Daily forecasts also require to forecast the weights of an optimal combination. Weights appear to be highly unstable from one day to another or from one station to another. More stable weights are found in combinations constant over a 30-day period and over a whole network. These weights can be reasonably forecasted and the associated combinations provide significant improvements on hourly concentrations and on daily peaks. A decrease of about 10% of the RMSE is achieved on daily peaks. Hourly concentrations even show better improvements.

[54] In addition, there is a promising application of learning algorithms (machine learning) which do not need to introduce weights computed over the numerous stations of a monitoring network. The gradient descent algorithm shows good performances when applied to each station, while applying weights computed over a 30-day learning period fails.

[55] An ensemble with less members and less spread than another can lead to better combinations. Member selection was therefore discussed. Additional models always bring improvements, but slightly related to their individual performances.

[56] Future work should address this issue. Additional sources of uncertainty could be introduced. Meteorological ensemble forecasts and Monte Carlo simulation on other input data are necessary steps to account for all uncertainties. The computational costs will be a crucial point. Relevant strategies are needed for the introduction of Monte Carlo methods together with discrete changes (in the model formulation, as performed in this paper through changes in physical parameterizations and numerical approximations).

[57] An obvious future work lies in forecasting the weights. As shown in this paper, the potential of model combination is very high and it is much higher than what is achieved with the forecasted combinations tested so far. Specific learning algorithms should be involved.

[58] Ensemble forecasting may also deliver probabilistic forecasts. It would be an improvement of the forecasts through additional information. It would help in assessing uncertainties and it would allow reasonable integrated uses of air quality models, e.g., for risk assessment.

[59] Finally, an open question is the relations between ensemble forecast and classical data assimilation. Would sequential or variational data assimilation perform better than ensemble-based forecast? How could both strategies be combined? Data assimilation may be performed on each member of an ensemble, or only on a reference member (to be determined) whose updates (from the data assimilation procedure) would be applied to other members (e.g., optimized emissions or corrected initial conditions). In addition, the ensemble spread may be valuable information in the data assimilation procedure.

Acknowledgments

[60] The first author is partially supported by the Île-de-France region. We thank the monitoring networks that provided the numerous observations. We also thank Gilles Stoltz (CNRS) for his introduction to machine learning algorithms.