Forecasting ultrafine particle concentrations from satellite and in situ observations

Recent innovations in remote sensing technologies and retrievals offer the potential for predicting ultrafine particle (UFP) concentrations from space. However, the use of satellite observations to provide predictions of near‐surface UFP concentrations is limited by the high frequency of incomplete predictor values (due to missing observations), the lack of models that account for the temporal dependence of UFP concentrations, and the large uncertainty in satellite retrievals. Herein we present a novel statistical approach designed to address the first two limitations. We estimate UFP concentrations by using lagged estimates of UFP and concurrent satellite‐based observations of aerosol optical properties, ultraviolet solar radiation flux, and trace gas concentrations, wherein an expectation maximization algorithm is used to impute missing values in the satellite observations. The resulting model of UFP (derived by using an autoregressive moving average model with exogenous inputs) explains 51 and 28% of the day‐to‐day variability in concentrations at two sites in eastern North America.


Introduction
Atmospheric ultrafine particles (UFPs, i.e., particles with diameters < 100 nm) significantly impact human health [Penttinen et al., 2001;Utell and Frampton, 2000] and the Earth's climate by providing a potential source of cloud condensation nuclei [Pierce et al., 2014]. New particle formation (NPF) is a major source of UFP [Kirkby et al., 2011;Kulmala et al., 2013], often occurs on regional scales, and exhibits high temporal autocorrelation Hussein et al., 2009;Jeong et al., 2010]. However, the subsequent growth of those particles to climate-relevant sizes, their removal, the importance relative to direct emissions of UFP number concentrations, and their role in the current and future radiation budget are incompletely understood [Boucher et al., 2013;Merikanto et al., 2009;Myhre et al., 2013], partly because of the limited availability, in both space and time, of measurements of the particle size distribution (PSD) and composition [Kulmala et al., 2011b;Kulmala et al., 2012]. Thus, there is interest in finding alternative approaches to estimate UFP concentrations in order to better understand their current spatiotemporal scales and infer their role in a changing climate.
1. Kulmala et al. [2011a] were the first to propose that boundary layer nucleation mode number concentrations could be predicted based on transfer functions, wherein the predictors were drawn from remotely sensed properties. In their analysis the predictors were once-daily satellite-based observations of ultraviolet radiation (UV), sulfur dioxide (SO 2 ), and aerosol optical depth (AOD) and their predictand was 30 min mean particle number concentrations in the diameter range (Dp) of 3-25 nm. They obtained correlations (R) of 0.25 (R 2 = 0.06) between nucleation mode number concentrations observed at the Hyytiälä site in Finland and the satellite proxy. 2.  extended that initial work by developing a proxy for daily mean total UFP number concentrations for Dp 6-100 nm based on satellite observations of SO 2 , UV, and ammonia (NH 3 ) (as proxies of ternary nucleation); AOD; and the Ångström exponent (AE) (as proxies for the condensational sink). They developed and applied the proxy only to days where all remote sensing observations were available and also considered only days when NPF was observed. The transfer function was developed by using multiple linear regression (MLR), and all predictors were found to have coefficients significantly different to 0 (i.e., associated P < 0.05) except for SO 2 . Given the importance of sulfuric acid to nucleation, the latter finding was ascribed to the high retrieval uncertainty for SO 2 . For the two measurement sites included (Morgan Monroe State Forest (MMSF) and Egbert) the R 2 between log 10 number concentrations of UFP from the measurements and proxy were 0.359 (sample size n = 63) and 0.172 (n = 53), respectively. 3. Sundstrom et al. [2015] sought to overcome the issues with highly uncertain satellite observations of SO 2 by instead using remote sensing estimates of columnar nitrogen dioxide (NO 2 ) as a proxy for sulfuric acid. Their proxy was developed and applied over Southern Africa and for nucleation mode concentrations (Dp < 30 nm). A proxy based on NO 2 /AOD exhibited an R 2 with observations at two sites of 0.24 and 0.09 (n = 49 and 124), respectively. A key finding in this research was the substantial degradation in proxy skill when a direct measurement of the condensational sink at the site is replaced by the satellite-derived AOD in the denominator. 4.  sought to expand the suite of predictors drawn from remote sensing observations to include a measure of biological activity and thus potential for release of biogenic volatile organic compounds (the cross-product of leaf area index (LAI) and surface temperature), SO 2 , NO 2 , formaldehyde (HCHO) (as a proxy for production of semivolatile organic oxidation products), UV, the cross product of AE and AOD (as a proxy for the condensational sink), and the cross product of UV and SO 2 (as a proxy for sulfuric acid production). They also sought to quantify other descriptors of nucleation events: the occurrence (modeled by using logistic regression), formation rates, growth rates, and survival probabilities (modeled by using MLR) using data from the MMSF site. They found that the proxy performance was poorer during leaf-on than leaf-off conditions and that many predictors exhibited regression coefficients that are not significantly different from zero. Nevertheless, when the data are conditionally sampled to extract only NPF event days, the R 2 for proxy algorithms of new particle formation rates ranged from 0.13 to 0.67 (for sample sizes of 18 to 55). 5.  extended this earlier work by including in situ data from five sites distributed across North America. They found that, relative to more simple proxies developed in the prior studies (e.g., based on AOD, AE, SO 2 , and UV), use of additional predictors (i.e., NO 2 , as a proxy for primary emitted particles, and NH 3 , HCHO, LAI, T, and O 3 ) increases the explained temporal variance of UFP concentrations at all sites. The R 2 for observed and predicted UFP number concentrations ranged from 0.05 to 0.38 during the leaf-on period and from 0.01 to 0.32 in the leaf-off season. However, the regression coefficients and hence the magnitude and even sign of the dependencies of N n À 100 nm (where n is the lower limit of measured particle diameter) on the suite of predictors considered (AOD × AE, SO 2 , UV, NO 2 , NH 3 , HCHO, LAI × T, and O 3 ) imply great challenges to generating a single generalizable proxy.
This prior work indicates the following major challenges to development of robust proxies of nucleation metrics and total UFP number concentrations: 1. The large number of missing data in the remote sensing observations (e.g., due to cloud screening). As discussed in , typically 50% of each variable obtained from satellite-borne observations are missing, even when spatially averaged to a 0.5°× 0.5°grid. This reduces the statistical power (leads to biased estimates of variance explanation, R 2 ) and restricts the analysis to a biased sample (representing only cloud-free conditions). 2. Large uncertainties in the retrievals of some atmospheric properties from the radiometer measurements degrade the explanatory power of the proxies (this is particularly true for SO 2 ). 3. The satellite measurements are columnar estimates, and thus, there is an implicit assumption of strong coupling between concentrations in the boundary layer where the particle size distribution measurements are made and total column (as represented by the satellite observations). Typically, the proxies are considerably less skillful when satellite-borne data instead of in situ measurements are used as predictors.
In this work, we develop a novel statistical model that is able both to account for the temporal variability of UFP and to predict UFP concentrations near the surface based on prior (temporally lagged) UFP concentrations and concurrent satellite retrievals of aerosol optical properties, nucleation precursors, and solar radiation. While a proxy based solely on satellite-based predictors could potentially be used to generate temporal and spatial fields of UFP concentrations (as in  and Kulmala et al. [2011a]), an algorithm that also employs UFP concentration measurements from the previous day might have applications to go-no-go flight decisions conducted as part of field campaigns to examine the spatiotemporal variability in nucleation events [Pryor et al., 2011] and/or their impacts on cloud conditions [Clarke and Kapustin, 2010].

10.1002/2016JD026021
The proposed UFP proxy algorithm is innovative because it 1. exploits all the information from satellite retrievals by including days when only part of the predictor suite is available and is more sophisticated than past work that has "infilled" missing observations by using a simple running mean; 2. provides a unifying framework to describe the temporal variability of UFP by accounting for past UFP observations and concurrent satellite retrievals; 3. exhibits high skill in capturing the temporal variability in daily mean UFP concentrations on all days (either with or without NPF) and provides more robust estimates of UFP than previous studies that had smaller (and biased) data samples and focused solely on conditions during NPF events; and 4. can be used to forecast UFP from satellite data either with or without past UFP observations and provides a measure of their relative contribution to predictive skill.
The paper is structured as follows. In section 2, we introduce the ground-and satellite-based observations used in this study. In section 3, we describe the expectation maximization (EM) algorithm used to impute missing values and the statistical model developed to capture the temporal structure of data. In section 4, we quantify the model performance and examine its generalizability by using data from two sites. Section 5 concludes with remarks on the applicability of this methodology to real-time monitoring of UFP from space.

Ground-Based Observations of Particle Size Distribution
In this work we focus on PSD measurements from two sites in eastern MMSF is a deciduous forest located in a regionally polluted environment, but that is not directly exposed to local primary anthropogenic sources, whereas Egbert is classified as semirural and is located around 70 km north of Toronto. These sites were selected to examine the flexibility of the proposed method to different emission contexts, to provide continuity with previous work, and to select sites with a substantial overlap in the measurement period and because there are relative few long-term measurements of UFP PSD measurements available in eastern North America . PSD measurements were undertaken by using Scanning Mobility Particle Spectrometers at MMSF during January 2007 to March 2009 [Pryor et al., 2010] and at Egbert during May 2007 to May 2008 [Riipinen et al., 2011]. In this work we focus on the daily mean number concentration of UFP, where UFPs are defined as particles with diameters in the range of 6-100 nm at MMSF and 10-100 nm for Egbert.

Satellite Observations of Aerosol Optical Properties and NPF Drivers
Nucleation frequency, intensity, and likelihood of occurrence are dictated by emissions of gaseous precursors (e.g., SO 2 and NH 3 ), drivers of photochemical reactions (e.g., ultraviolet solar irradiance), and of the amount of preexisting particles [Kirkby et al., 2011]. Based on the prior research described in section 1, the statistical proxy is designed to include the main variables indicative of secondary formation, which are crucial in the presence of regional NPF phenomena : AOD and AE as indicators of preexisting particles and SO 2 , NH 3 , and solar radiation as key players in new particle formation and growth processes in the atmosphere (Table 1).
AOD and AE: Given the critical importance of the condensational sink [Dal Maso et al., 2002] in dictating the formation and growth of particles in the atmosphere, in our model, we include once-daily AOD and AE level 3 (1°× 1°) observations from the Moderate Resolution Imaging Spectroradiometer (MODIS; Collection 5), as indicators of aerosol atmospheric columnar loading (AOD at wavelength λ = 550 nm) and size distribution (AE, at λ = 470-660 nm) [Levy et al., 2010]. We focus on observations from the Sun-synchronous Terra satellite (local equatorial overpass time~10:30 local solar time (LST)), which are thus indicative of conditions prior to the occurrence of highest UFP concentrations (typically 10:00-12:00 LST at MMSF [Pryor et al., 2010]). Use of Collection 5 is dictated by the absence of AE in the Collection 6 released products. The uncertainty in the satellite retrieval is ±0.05 ± 0.15 × AOD for AOD and ±0.4 for AE [Levy et al., 2013].
SO 2 : Sulfuric acid (H 2 SO 4 ) has been identified as one of the major players in regulating the rate of NPF [Sipilä et al., 2010]; thus, herein, we include SO 2 as a proxy for H 2 SO 4 concentrations. Monitoring SO 2 from Journal of Geophysical Research: Atmospheres 10.1002/2016JD026021 space is challenging due to the high-ozone UV absorption in the UV wavelengths from which SO 2 abundance is derived and to the relatively low concentrations over most developed countries [Krotkov et al., 2006;Vehkamäki et al., 2002]. However, given the importance of H 2 SO 4 (and thus SO 2 ) as a precursor of NPF, we use level 3 daily planetary boundary layer SO 2 at a spatial resolution of 0.25°× 0.25°from the Ozone Monitoring Instrument on board NASA's Aura satellite (OMI, OMSO2e product [Krotkov et al., 2006;Lee et al., 2009]). OMI-retrieved [SO 2 ] correlates well with ground-based and aircraft measurements, although the measurement uncertainty is largest when small spatial and temporal scales are considered [Lee et al., 2011;Lee et al., 2009]. The accuracy of nonvolcanic SO 2 retrievals is~50% [Brinksma et al., 2003].
NH 3 : There is evidence that ammonia is the dominant stabilizing molecule in nucleation within eastern North America [Pryor et al., 2011]. Thus, here we use level 2 NH 3 measured by the Tropospheric Emission Spectrometer (TES) on board the Aura satellite. It must be acknowledged that TES exhibits highest sensitivity in the lower to middle troposphere (700-900 hPa, i.e., ≈1-3 km above sea level) and that the measurements are subject to relatively large and variable uncertainty (~10%), since no a priori averaging kernel is used in the retrieval [Pinder et al., 2011;Shephard et al., 2011]. Further, the measurements are sparsely distributed in both space and time; thus, [NH 3 ] at MMSF and Egbert are derived by using an inverse squared distance weighted interpolation technique. UV: We also include level 3 (resolution ≈ 1°× 1°) OMI-Aura local noontime surface UV irradiances at 310 nm (UVB) due to its role as a main driver of photochemical reactions to produce nucleation precursors [Leibensperger et al., 2012;Zhang et al., 2012]. The accuracy of UV retrieval is 10% [Brinksma et al., 2003].
Although in previous work over eastern North America [e.g.,  the remote sensing data are spatially averaged to 0.5°× 0.5°, in the current work, level 3 (gridded at 1°× 1°resolution) satellite retrievals are used in the proxy. Given the large scale of spatial coherence in nucleation and also the predictors (e.g., over 1000 km for mean AOD and over 150 km for extreme AOD over most of eastern North America [Sullivan et al., 2015]), this difference in the predictor spatial averaging is not expected to substantially impact the analysis.

Methodology
UFP measurements in eastern North America exhibit strong autocorrelation Jeong et al., 2010;; thus, we propose a functional form that allows use of only concurrent satellite observations as predictors and also one in which past particle number concentrations are also included in the proxy algorithm to allow an investigation of the dependence of particle concentration from its past values. More specifically, the proxy algorithm is designed to estimate UFP at time t based on UFP up to time t À 1 and all five satellite-retrieved properties at time t, separately for the two sites. We further assume a linear dependence of UFP on the predictors to facilitate an easy examination of their contribution to the UFP concentration and to ensure continuity with past research. We thus propose a functional form that allows to incorporate both satellite observations and past particle number concentrations and allows data imputation with the expectation maximization (EM) algorithm [Dempster et al., 1977] detailed below.
Since all variables are positive and display a highly skewed distribution, we denote with UFP t , AOD t , AE t , UV t , SO2 t , and NH3 t the logarithm (in base 10, but equivalent results would hold for a different base) of the six quantities measured at time t and with Y t = (UFP t À 1 , AOD t , AE t , UV t , SO2 t , and NH3 t ).
A primary limitation of previous studies on the use of satellite data for UFP estimation is the very small number of occurrences where the vector Y t is fully observed. Indeed, for the satellite-derived predictors used herein, data availability is 25% to 94% at MMSF (26% to 96% at Egbert) with [NH 3 ] and [SO 2 ] exhibiting lowest data availability (Table 1). When a data completeness criterion of 90% of hours is applied to represent a valid daily mean UFP concentration, 84% of days are available for MMSF and 90% for Egbert. Similarly, UV measurements are almost complete, with 94% data availability for MMSF and 96% for Egbert.  proposed a list-wise deletion approach to develop a satellitebased proxy for UFP in which the model was built by using only days when NPF was observed and all satellite-retrieved predictors were available. Such approach uses only 4% of measurement days at MMSF and 9% of days at Egbert. However, the total percentage of available data including vectors with partially available information is considerably higher: 58% for MMSF and 56% for Egbert. Our novel statistical model is able to fully exploit the partially available information from an incomplete suite of predictors. Also, it accounts for the daily temporal dependence of the data and thus naturally includes both nucleation and non-nucleation days.
The predictor suite is initially completed by filling in the missing observations under the assumption that Y t follows a vector autoregressive process of order 1: where A is a 6 × 6 matrix of unknown coefficients and ε t e iid N 0; Σ ð Þ. A model selection procedure has shown that higher order autoregressive processes lead to severe overparameterization, as each new order of autoregression requires 6 × 6 new parameters. We impute Y t via an EM algorithm [Dempster et al., 1977], consisting of the following steps: 1. All missing values are initially set to the corresponding mean. Denote withŶ t the imputed vector. 2. Plug inŶ t in equation (1) and estimate A and Σ with least squares. 3. For each time t, whenever at least one entry of Y t is missing, the one-step-ahead predictor is computed: 4. and the missing values are imputed from the predictor; i.e., if an element of Y t is missing, then it is a set equal to the corresponding element in Ŷ t . 5. Iterate from point 2 until the maximum absolute difference between consecutive missing values estimates is smaller than a threshold (in practice, after three iterations, this difference is smaller than the epsilon machine). Once the data imputation is performed, we assume the following ARMAX (autoregressive moving average model with exogenous inputs) [Shumway and Stoffer, 2011] for UFP t : where ε t e iid N 0; σ 2 ð Þ. The model assumes an autoregressive dependence of UFP concentrations from the previous 2 days, as well as a dependence on the pseudosimultaneous retrievals of the five satellite-based variables. Figures 2a-2c show the original time series of daily mean UFP concentrations at both sites with the corresponding fitted values. The fitted values are less variable as they seek to capture the mean and not the intrinsic variability of the process, which should be reproduced via conditional simulations.

Fit and Prediction
Forecasting UFP t + 1 , UFP t + 2 , … based on observations up to time t in model (3) requires knowledge of both the past trajectory of UFP and the concurrent values of satellite observations. It is, however, of practical interest to also consider the case when no ground-based measurement of UFP is available for the current period, i.e., when α 1 = α 2 = 0 in equation (3). Thus, we consider two models: a full version as in equation (3) which we denote as ALL and a model based only on satellite retrievals (SAT). Excluding the terms that represent the autocorrelation of UFP concentrations naturally degrades the model performance. The coefficient of determination (R 2 ) for ALL at MMSF is 51%, while for SAT it is 13%, and the R 2 for ALL is 28% and for SAT is 13% at Egbert (Figures 2b and 2d). The degraded fit is also reflected in performance measures that use some of the data to condition the model, and some data are withheld for model evaluation. In particular, we computed the mean square prediction error (MSPE) for PSD observations in January-December 2008 at MMSF. For each of the first 5 days at the beginning of each calendar month, the model detailed in the previous section is fit by using all data available up to that day and the 1 day ahead prediction UFP t is performed with both ALL and SAT. The mean square prediction error is then computed as where M is the first day of each month in 2008 (see Table 2). On 2 September 2008 a very high value of UFP was observed at MMSF which caused the 5 day MSPE to be much higher than all the other values for both SAT and ALL (Table 2). Removing this day produces results in line with the other months. ALL produces more accurate predictions for 9 (10 when considering the outlier day) out of 12 calendar months, and the error distribution (median and interquartile range) is considerably closer to zero, a result that highlights the crucial role of the back trajectory (i.e., the autoregressive terms) in the model.

Comparison of the Sites
At both MMSF and Egbert, the past values of UFP (at time t À 1 for Egbert and up to t À 2 for MMSF) appear to be the main determinants of current UFP concentrations, which highlights the strong temporal autocorrelation of UFP and supports the postulate of regional-scale NPF events . Besides the intercept, other significant predictors of UFP are UV and AOD, which are the predictors that are characterized by higher accuracy (Table 1). The proposed model is able to capture 51% and 28% of the variability in daily mean UFP concentrations at MMSF and Egbert, respectively, as indicated by the coefficient of determination (R 2 ) between the observed and predicted UFP ( Figure 2 and Table 3), against 36% and 17% of the model in . However, it is important to note that the previous work considered only days with NPF and had a much smaller sample size (for the analysis of data from MMSF  used 63 days), while this research is based on 821 consecutive days and considers both events with NPF and those without. Besides, the proposed new approach exploits all information available from satellite observations and considers the temporal structure of the data.
The statistical model is also physically consistent with our a priori expectations and prior research: daily mean UFP concentrations are positively associated with UV and past UFP at both sites, whereas a negative association is found for AE and AOD at Egbert (only AE has a negative coefficient at MMSF), thus indicating the role played by photochemical reactions in enhancing particle formation and of preexisting particles in acting as a condensational sink (and thus suppressing NPF). The significant positive association between AOD and UFP at MMSF is consistent with prior work showing that a high AOD × AE is not necessarily associated with nonevent days during the leaf-dormant season and that other parameters, such as the availability of nucleation precursors, may play a major role in controlling NPF and therefore UFP at that site .

10.1002/2016JD026021
A key question in the development of satellite-based UFP proxy algorithms is the degree to which they are generalizable. Thus, the parameter coefficients from models fitted independently by using data from MMSF and Egbert were compared (Table 3). The coefficients for SO 2 and NH 3 vary in sign between the sites but are not identified as significant predictors, likely as a result of the high retrieval uncertainty. We tested if the coefficient estimates differ between the sites and found that, among the significant predictors, only the coefficients for UV are not significantly different between MMSF and Egbert. Additionally, we computed error magnitude when the estimated parameters from MMSF are used to predict conditions at Egbert (denoted by UFP t EGjMMSF ), relative to the predictions made by using the coefficients from Egbert (denoted by UFP t EGjEG ).
The ratio between the sum of squares is 69%, which indicates a substantial decrease in fitting performance when using the parameters estimated with the MMSF data set. As in prior research, this reemphasizes the challenge in developing a generalizable satellite-based proxy that can be robustly applied at the continental scale.

Discussion and Concluding Remarks
We propose a new approach for estimating daily near-surface UFP concentrations based on satellite retrievals of aerosol optical properties, trace gas concentrations, and solar irradiance which explicitly accounts for the temporal associations between the predictors and allows use of partial information from satellite retrievals by estimating missing values by using an EM approach. This new methodology is not computationally demanding and enables more efficient use of sparse (i.e., discontinuous) satellite observations and can thus be used to make projections for UFP concentrations over a wider array of atmospheric conditions. When the proxy includes temporally lagged UFP number concentrations the proposed model is able to capture 51% and 28% of the variability in daily mean UFP concentrations at MMSF and Egbert, respectively, and is also shown to be able to forecast UFP concentrations. When applied by using only satellite-based predictors the variance explanation (R 2 ) drops to 0.13 (which is still higher than, or equivalent to, the values reported in previous studies).
The approach we propose opens up important new directions for research in reconstructing UFP spatial scales and their associated uncertainty since it could be extended to also consider spatial interpolation. A spatiotemporal model would require robust definition of spatial dependence across all satellite observations and could allow production of near real-time predictions of UFP over eastern North America. Specifically, equation (3) could be generalized to account for a spatially varying linear model as UFP t s ð Þ ¼ α 0 s ð Þ þ α 1 s ð ÞUFP tÀ1 s ð Þ þ α 2 s ð ÞUFP tÀ2 s ð Þ þ β 1 s ð ÞAOD t s ð Þ þβ 2 s ð ÞAE t s ð Þ þ β 3 s ð ÞUV t s ð Þ þ β 4 s ð ÞSO2 t s ð Þ þ β 5 s ð ÞNH3 t s ð Þ þ ε t s ð Þ; where s indicates a generic site. The linear coefficients could be modeled as spatially varying coefficients [Gelfand et al., 2003], while the uncertainty could be modeled with a Matérn covariance function [Stein, 1999] or some more sophisticated nonstationary constructions in the presence of heterogeneous geographical features.