Improving Wind Forecasts in the Lower Stratosphere by Distilling an Analog Ensemble Into a Deep Neural Network

We discuss improving forecasts of winds in the lower stratosphere using machine learning to postprocess the output of the European Centre for Medium‐Range Weather Forecasts (ECMWF) Integrated Forecast System. We postprocess global three‐dimensional predictions and demonstrate distilling the analog ensemble (AnEn) method into a deep neural network, which reduces postprocessing latency to near zero maintaining increased forecast skill. This approach reduces the error with respect to ECMWF high‐resolution deterministic prediction between 2–15% for wind speed and 15–25% for direction and is on par with ECMWF ensemble (ENS) forecast skill to hour 60. Verifying with Loon data from stratospheric balloons, AnEn has 20% lower error than ENS for wind speed and 15% for wind direction, despite significantly lower real‐time computational cost to ENS. Similar performance patterns are reported for probabilistic predictions, with larger improvements of AnEn with respect to ENS. We also demonstrate that AnEn generates a calibrated probabilistic forecast.


Introduction
This paper discusses forecasting stratospheric winds by postprocessing numerical weather prediction models using machine learning techniques. Specifically, a new variant of the analog ensemble (AnEn; Delle Monache et al., 2013) algorithm that heavily leverages deep neural networks is proposed. The methodology is tested against analysis data and a data set of observations of stratospheric winds from Loon (https://www. loon.com) high-altitude balloons (Candido, 2020).
Our focus on winds in the lower stratosphere is driven by Loon's need to predict the trajectory of high-altitude balloons drifting through the stratosphere. Loon is a company that provides connectivity to people in underserved (often remote and rural) locations by placing telecommunications on these balloons. These high-altitude platforms can change altitude and are navigated using a machine learning approach to synthesize in situ observations of winds (from the balloon movements) and wind forecasts. Improved forecast accuracy and reliable uncertainty quantification of the forecasts, which are both key results of the approach we present, determine the navigation efficiency of balloons. Because this navigation system is a real-time operational system (that has navigated balloons for over 1 million hours of flight through the stratosphere), the amount of data to be downloaded from operational forecast centers, the compute needed to utilize that data in real-time operations, and the length of time required to do the postprocessing are also important factors that drive the quality of the system's operation. These concerns led to the development of a less-expensive model in postprocessing and distilling the computational burden of the postprocessing process into a neural network. It is expected that the approach proposed here to allow the real-time execution of postprocessing methods as the analog ensemble across millions of grid points and several lead times for global predictions can be applied to several other atmospheric variables and parameters. (See below for the range of applications for which the analog ensemble method has already been implemented.) Recently, with the availability of increased computation resources suitable for the execution of neural networks (e.g., on graphics processing units) and access to large training data sets, machine learning algorithms have been successfully explored to generate weather predictions and to postprocess numerical weather predictions (e.g., Burke et al., 2020;Chapman et al., 2019;Gagne et al., 2017;Lagerquist et al., 2019;Rasp & Lerch, 2018;Scher, 2018;Tao et al., 2016). It has also been shown that machine learning can support the decision-making process associated with high-impact weather phenomena  and it can be leveraged to enhance our physical understanding of atmospheric processes McGovern et al., 2019).
Analog-based methods, which are a type of machine learning, have been explored for decades (Lorenz, 1969) to develop predictions for a range of weather parameters. The basic idea is to find situations from the past similar to the current one and use what unfolded in these situations to estimate the future evolution of a parameter (Klausner et al., 2009;Panziera et al., 2011) or to infer the errors of today's prediction from a dynamical model's past performance (Delle Monache et al., 2013), an ensemble of model runs (Hamill & Whitaker, 2006), or other methods (Cervone et al., 2017;Mahoney et al., 2012).
One of the challenges of finding these similar situations is the size of the historical data set available to the algorithm. Van den Dool (1994) estimated that when matching fields over large spatial domains (e.g., the Northern Hemisphere), a training data set 10 30 years long would be needed to find matches with a degree of analogy below observational errors. However, Van den Dool (1994) also indicated that if the matching problem can be reduced to a few degrees of freedom, a much shorter historical data set can be sufficient.
A common issue with real-world use of an AnEn-based system is achieving the postprocessing speed that is needed in an operational environment. We outline how a distributed computing system can apply the conventional AnEn globally using the past 2 years of forecasts in around 20 min. We demonstrate that this can be even more efficient by distilling the entire AnEn into a deep neural network (DNN). Distilling, in the machine learning community, refers to training a DNN to memorize and thus mimic another model. It has been used in reinforcement learning , to compress an ensemble of predictions into a single model (Bucilu et al., 2006;Hinton et al., 2015), and to approximate a more complex neural network with a simpler one (Ba & Caruana, 2014). In all cases, the idea is to achieve a more computationally efficient version of a skillful, but perhaps inconvenient model.
Since the distilling process is performed off-line (in advance), it does not impact real-time operations regardless of the size of the historical data set. This is a key factor given that the skill of the AnEn tends to improve with a larger historical dataset.

Methods
In this section, we review the AnEn algorithm and discuss how it can be implemented at a global scale using distributed computing. We then discuss distilling the AnEn into a DNN. We use the former method to demonstrate that the much more efficient latter method achieves equivalent performance despite being significantly more desirable for use in a production system.

Conventional Analog Ensemble Algorithm
The AnEn estimates a probability distribution over a forecast parameter, such as wind speed or direction, given a forecast, previous forecasts made by the same model, and corresponding ground truth for those previous forecasts. A search for analogous situations, that is, previous forecasts we consider to be similar to the current forecast, is performed and ground truth corresponding to these analogous forecasts is used to construct an ensemble (Delle Monache et al., 2013). We report (below) the skill of the analog ensemble and its mean, which we use to generate probabilistic and deterministic predictions, respectively.
Let f (y|x f ) be the probability distribution of the observed value y of some predicted quantity given a model … ; x f k Þ contains k predictors from the model forecast, typically including a forecast value for y and other fields considered to be related or providing context on similarity. In the results reported below, x f includes wind speed and direction.
AnEn is a nearest-neighbor algorithm using a learned distance function. The closest analogs to x f from previous forecasts are selected, typically restricting to x i at the same grid point, that is, forecasts for the same latitude, longitude, and pressure and made for the same lead time. Each forecast has a corresponding ground truth referred to as y i . We denote the set of forecast and observation tuples at a grid point as P. We rank every x i ∈ P by a distance function where σ P j is a normalization factor, for example, the standard deviation, to bring all elements of x into a uniform numeric range and w P j is per-feature weight. The weight and normalization factors are chosen independently for every grid point to optimize the root-mean-square error (RMSE) of the ensemble mean on the training data set using a leave-one-out cross-validation, with the removed (x i , y i ) used as (x f , y).
The N analogs with the smallest distance to x f form an ensemble forecast. We use 25 analogs in the results below. The weighted ensemble mean can be used as a deterministic prediction (Delle Monache et al., 2011). We sort the candidate analogs by d(x f , x i ) and compute the weighted mean on the first N analogs where α is one over the sum of the weights and ϵ is a very small constant that guards against almost exact matches producing larger weights than can be represented numerically.
This procedure is designed for cases where there is a plurality of analogous situations, but in the case of a rare forecast that is, for example, larger than most samples in the training, then the AnEn will predict a reversion to the mean and likely not produce a skillful forecast. Similar to Alessandrini et al. (2019), we apply a bias correction term to our forecast of wind speed.
where m is a learned parameter to correct for systematic forecast bias.

Global Scale With Distributed Computing
While the calculation described above at a particular grid point is tractable, a barrier to operationalizing a global AnEn system is processing the corpus of analogs, which can easily grow to hundreds of terabytes of data for three-dimensional global predictions over several years. The AnEn algorithm provides a natural partitioning as execution is independent for each grid point and lead time. However, the data are not natively partitioned as both every historical forecast and the current prediction contain a piece of data needed to postprocess every grid point. The challenge is to organize the data so that the calculations can be efficiently executed across many data center computers. We use the MapReduce paradigm (Dean & Ghemawat, 2004), which allows the computation to run on a distributed computing (cloud) infrastructure like Google's Flume (Chambers et al., 2010). We describe the mechanics of this technique and provide pseudo-code in the supporting information.
Using this technique at appropriate scale, one can postprocess a stratospheric wind forecast in 10-20 minutes. In our case, we use hundreds to thousands of data center machines. We create a 3-D forecast with 20 pressure levels and 0.5°resolution in latitude and longitude over 20 lead times. This adds up to the AnEn being applied at around 100 million grid points with analogs from around 3 years of prior forecasts, for example, around 2,196 candidate analogs per grid point. A rough estimate (ignoring interprocess overhead) of trying to do this work on a single machine by multiplying the number of workers by the 10-20 min compute time highlights why an implementation on a single machine is likely easily too slow for an operational postprocessing system.
Despite being able to achieve appropriate scale, this is an expensive computation that grows proportional to corpus size. Postprocessing would take significantly longer in the case of a much larger historical corpus. In the next section we discuss distilling this computation into a DNN to address this issue.

Distilling the Analog Ensemble Into a Deep Neural Network
Every value of the analog ensemble mean, ŷ bc , corresponds to an HRES prediction of wind speed and direction, x f , which has been used to generate the analogs included in the set P. A DNN can be used to learn, a.k.a., to memorize or distill, the function mapping the wind speed and direction of the HRES forecast to the resulting analog ensemble mean. An example function for a particular grid point is shown in Figure 1. The AnEn mean wind speed values (ŷ bc ; color shading on the isosurface) are shown for each HRES forecast x f wind speed (distance from the origin) and direction forecast (rotation around z axis). The plot is roughly conical and would be exactly conical if AnEn postprocessing had no effect. Some deformation from the perfect cone is introduced by the AnEn algorithm, which we denote by h, that is, Generating this figure does not require actual new, unseen HRES forecasts. We instead plot the response of the AnEn in anticipation of potential HRES forecasts. Much the same in the learning process, the response curve can be learned by the DNN in advance of receiving a forecast, and when the times comes to postprocess the operational forecast, we do not require access to the corpus of potential analogs. This makes the distilled AnEn significantly faster than AnEn and more efficient when handling a new forecast in real time.
In this study we distill h P by training over all grid points. We train the DNN to learn the function ŷ distilled ¼ ĥðk; x f Þ, where k is a specific grid point. Because h P varies from grid point to grid point, we add the grid point parameters (latitude, longitude, pressure altitude, and lead time) as arguments to ĥ so that the DNN can learn different postprocessing transformations at different grid points.
While we discuss results on ensemble mean below, this procedure is not specific to the mean. For example, we have distilled both the ensemble mean of speed and direction (analyzed below), and ensemble forecasts for both quantities into a single DNN with multiple outputs (not shown). The results presented below use a DNN with 10 trainable fully connected ReLu layers 50 units wide trained with stochastic gradient descent in TensorFlow (Abadi et al., 2015). Full details on the DNN architecture, training parameters, and nonstandard data flow, which is conceptually similar to a replay buffer in deep reinforcement learning (Lin, 1992;Mnih et al., 2015), can be found in the supporting information.
We are able to demonstrate a good approximation (next section) for forecast speed and direction within a few billion training examples. Because the training procedure can be performed once (or perhaps periodically, but infrequently) prior to using the network in the operation pipeline, the training time is not particularly important to optimize. Our unoptimized implementation was able to train the network used in our results within a few days on a single CPU (being fed from a distributed data flow). The specific DNN architecture and the data flow to supply training with examples are outlined in greater detail in the supporting information.
We are able to demonstrate a good approximation (next section) for forecast speed and direction within a few billion training examples. Because the training procedure can be performed once (or perhaps periodically, but infrequently) prior to using the network in the operation pipeline, the training time is not particularly important to optimize. Our unoptimized implementation was able to train the network used in our results within a few days on a single CPU (being fed from a distributed data flow). The specific DNN architecture and the data flow to supply training with examples are outlined in greater detail in the supporting information. Figure 1. The output of the AnEn is a function mapping x f to ŷ bc . The plot in (a) shows the speed prediction for a particular P swept over speed (distance from origin) and direction (rotation around z axis). For speed, this function will typically look like a cone. Taking a top-down view of this plot, we see in (b) the identity operator, that is, no AnEn postprocessing. In (d) we see the same cone as (a) from the top down.
Finally, (f) shows the output of the distilled AnEn. Note that it resembles the transformation applied by the conventional AnEn but is not expected to be identical as the function is generalized across multiple P.
The plots in (c) and (e) show the difference (m s −1 ) between the adjacent plots.
Once a network is trained, it can be applied point based, that is, at a particular place and time with the HRES forecast as input. It adds only a few milliseconds to the real-time computational cost needed to look up forecast data, because the computation performed is a forward (inference) pass through a deep network, that is, a simple mathematical expression is executed. More study is required to optimize the balance between generalization across different grid points and fitting the particular nuance of a given data set.

Results
We present the forecast skill of the AnEn and distilled AnEn aggregated over a half year of forecasts from July to December 2019, compared against the ECMWF Integrated Forecast System's high-resolution forecast (HRES) and the ensemble forecast (ENS). We also provide a yearlong comparison at a different time period (October 2017 to September 2018) against the HRES and a persistence ensemble that provides an equivalent result in the supporting information (see Figure S8).
Comprehensive ground truth measurements of winds throughout the stratosphere are not currently available, so to evaluate the quality of the various forecasts, we use two proxies for ground truth. The first proxy is the HRES analysis that provides an "observation" comprehensively across all grid points. The second proxy is true observations from Loon high-altitude balloons. This data set of 10.5 million observations, largely concentrated in the lower latitudes, is significantly more sparse as it only allows us to compare forecasts at places and times where a Loon balloon was present. Taken together, these two comparisons characterize the quality of our method.
To summarize the detailed results that follow, the AnEn and distilled AnEn improve the ECMWF Integrated Forecast System's high-resolution forecast (HRES) of winds in the lower stratosphere. The AnEn methods also produce a skillful probabilistic forecast that is able to quantify the forecast uncertainty, which is an advantage over using the raw deterministic HRES forecast. The ENS ensemble mean outperforms the AnEn methods when evaluating using the HRES analysis as ground truth but underperforms the AnEn methods on the sparser observations from real Loon flights. The AnEn method has a significantly reduced computational cost of creating or using a 51-member ensemble forecast. Overall, the results that follow indicate the AnEn methods are very competitive when both considering practical implications, and on the merits of forecast skill alone.
Our region of interest is the lower stratosphere, from around 48 to 145 hPa. We apply the technique globally and consider the lead times forecast in the HRES which range from 12 hr to 10 days in the future. The results reported in this section are in latitudes below 70°. Results at higher latitudes are similar, but not shown. Our training data set is the HRES forecasts produced from July 2016 to June 2019. We use this to choose weights used in the analog matching process. The validation period is over the HRES forecasts produced from July 2019 to December 2019. The data available in the AnEn matching include all the forecasts in the training data set plus any additional forecasts between the beginning of the validation time period but prior to the current forecast. This simulates operational use of an AnEn system. To evaluate the distilled AnEn, we only use a DNN distilled from the training data set. In practice, one would distill the AnEn into a new DNN from time to time to incorporate additional forecasts into the training corpus, but that has not been attempted in this study. Figure 2 shows a comparison of the aggregated deterministic forecast error of the HRES, ENS, AnEn, and distilled AnEn grouped by lead time. Note that 90% bootstrap confidence intervals are omitted because they are very small because for each metric computed, and for each lead time we have almost 2 billions and more than 10.5 millions ground truth/prediction pairs when using HRES and Loon data, respectively. The reader can find a view of the these confidence intervals in Figures S4 and S5. Figure 2a shows the evaluation performed using the HRES operational analysis as the ground truth field. The centered root mean square (CRMSE) is the portion of the RMSE measuring the random (or anomaly) differences between two fields (Taylor, 2001). The AnEn methods have a lower CRMSE than HRES across all lead times for wind direction, and after hour 84 for wind speed. The AnEn methods have the same skill as ENS up to hour 60 and are competitive for longer lead times, which is remarkable considering that AnEn realtime computation cost, given that it is based on HRES, is significantly lower than ENS. The correlation between the fields and the ground truth is either preserved or improved with the analog-based methods when compared to HRES. The remaining portion of RMSE is the bias, which in this study is significantly lower 10.1029/2020GL089098

Geophysical Research Letters
than CRMSE for all the prediction systems analyzed (not shown). The large reductions of CRMSE for both wind speed and direction obtained with AnEn confirm the ability to tackle conditional biases, which is a result of the algorithm being designed to learn the error of the current prediction from the errors of analogous past forecasts. The ability of the distilled approach to reproduce AnEn deterministic skill is

10.1029/2020GL089098
Geophysical Research Letters remarkable, as shown by the minimal differences between the two AnEn versions across the different metrics and cases considered. Figure 2b shows the results when the measurements from Loon stratospheric balloons are used as ground truth. This is a much smaller data set and lacks global coverage but is real in situ observations from the stratosphere (see Figure S2 for the geographical distribution of Loon's measurements). For the convenience of the reader, we provide basic statistical breakdowns and ranges of the observations in the data set overlapping with our validation period in Figure S1. The AnEn methods exhibit lower CRMSE than HRES, and significantly lower than ENS for both wind speed and direction. AnEn correlation is significantly higher than ENS for wind speed and better than HRES for wind direction. The better performance of AnEn compared to ENS when using Loon data can be explained by the fact that AnEn, by design, is an excellent downscaling method. This is more evident when making a comparison with data that has a high spatial and temporal resolution, like Loon in situ observations. On the other hand, that is a disadvantage for the coarser ENS.
We turn our attention to probabilistic forecasts. We compare the ensemble forecast generated by the AnEn on stratospheric winds to the ENS. Figure 3 shows the continuous ranked probability score (CRPS), rank histogram, and binned-spread/skill plot across different lead times for the AnEn and ENS. We show these metrics for wind direction forecasts using the HRES analysis (left) and Loon data (right) as ground truth. Results for wind speed are qualitatively similar and are shown in Figure S3.
The CRPS provides an assessment of the quality of a probabilistic forecast that is not necessarily of a binary event (Hersbach, 2000). It is the probabilistic equivalent of the mean absolute error for deterministic predictions, and a zero indicates a perfect forecast. Similarly to the deterministic results with HRES analysis as the ground truth, AnEn is competitive with ENS up to hour 60 and better then HRES at all lead times. However, when this performance metric is calculated against the Loon data, AnEn is significantly better even of ENS, reducing the latter CRPS between 7% and 70%.
The rank histogram estimates the statistical consistency of an ensemble (Anderson, 1996). For a perfect ensemble, the observation will appear to be drawn from the same distribution as the ensemble members. The rank histogram is flat in that case. The ENS has a U-shaped rank histogram with both ground-truth data sets, which indicates a lack of spread. With HRES as the ground truth, the AnEn rank histogram instead is closer to the ideal flat shape, though it exhibits for the first few lead times a dome shape indicating an excess spread. This may reflect that the AnEn is including a few analogs that have a larger match distance at early lead times. Against Loon data, AnEn has a rank histogram significantly closer to the ideal shape, being U-shaped but less so than ENS.
The binned-spread/skill plot (van den Dool, 1989;Wang & Bishop, 2003) (which is only applicable to probabilistic predictions) characterizes, perhaps, the most important attribute of an ensemble system: the ability to quantify uncertainty while accounting for the flow-dependent error characteristics. This is approximated by analyzing the spread-skill relationship across different spread bins. A perfect ensemble results in a diagonal line. Against the HRES analysis, AnEn is closer to the diagonal than ENS, although both system exhibit a good spread-skill relationship. However, when Loon measurements are used as ground truth, AnEn exhibits a significantly better ability to characterize the prediction uncertainty. The ENS diagram is horizontal for most bins and lead times, which reflects a lack of a spread-skill relationship for the ECMWF ensemble system when predicting wind direction. Figure 4a shows an example of the difference in forecast wind speed between the (distilled) AnEn-based forecast and the HRES across a constant-pressure slice of the stratosphere. Figure 4b shows the percent change. In this particular example, which was arbitrarily chosen at random, the largest percent changes are made in the tropics. This tends to be a common pattern. Most regions we have analyzed see forecast improvements with the AnEn when compared to HRES, and the largest improvements are at latitudes below 23°. The arrows in Figure 4b indicate the flow of the wind direction vector field at this pressure level.

Discussion
The analog ensemble (AnEn) and distilled AnEn improve the European Centre for Medium-Range Weather Forecasts (ECMWF) high-resolution (HRES) deterministic forecast of winds in the lower stratosphere in our evaluation over half a year of forecasts using both global ECMWF analyses and a smaller set of observations from Loon high-altitude balloons as ground truth. The AnEn is also competitive with ECMWF ensemble (ENS) system up to hour 60 for deterministic and probabilistic forecasts when HRES analysis is used as ground truth and significantly better when the performance metrics are computed against Loon's data set of true ground-truth observations. In particular, AnEn is able to quantify the prediction uncertainty, as evident from the analysis of the probabilistic systems spread-skill relationship, while ENS lacks such attribute, particularly for wind direction predictions. This is true, despite AnEn being computationally cheaper in realtime.
Physics-based numerical weather models, such as the ECMWF's HRES, are marvels of engineering and science and produce high-quality forecasts of many meteorological fields in a coupled and principled manner. However, improvements can sometimes come at great cost, both in research time and in computation and power. Pure machine learning techniques, that is, end-to-end learned model-free forecasting, hold promise but are limited due to training on a small number of observations and a limited ability to extrapolate beyond that training data.
For example, a weakness of an analog-based approach is new situations. If not handled properly, postprocessing can reduce forecast skill. We found a specific example of this in our experiments, which covered a period of vortex breakdown over North America during February 2018. Because there was only a single Northern Hemisphere winter in our training corpus and it did not exhibit a large vortex breakdown over North America, the algorithm was not able to find analogs with sufficiently high wind speed. When testing the method without the bias correction term of Equation 3, the method decreased forecast skill. While bias correction acts as a stop gap in this scenario, the desired approach would be to extend the historical corpus to be long enough to find analogous vortex breakdown scenarios.
Recently, there have been several contributions exploring the potential of machine learning for weather and climate predictions (e.g., Burke et al., 2020;Chapman et al., 2019;Gagne et al., 2017Gagne et al., , 2019Lagerquist et al., 2019;McGovern et al., 2017McGovern et al., , 2019Rasp & Lerch, 2018;Scher, 2018;Tao et al., 2016). However, although there have been encouraging attempts to develop pure machine learning weather forecasting methods (e.g., Weyn et al., 2019), those may still be out of reach given the relatively low number of available learning examples compared to the number of degrees of freedom in the atmosphere. Currently, successful attempts have been reported only in replacing individual physical processes (e.g., O'Gorman & Dwyer, 2018).
The AnEn distilling procedure can be seen through two lenses. One can consider the distilled AnEn as an approximation of the conventional AnEn, that is, a highly efficient implementation of the conventional technique. A second lens is that the DNN is the learning technique and the process of distilling the AnEn is a data augmentation method to increase the number of examples used to train the network. One may prefer to distill an AnEn over directly training a DNN to improve forecasts because DNNs have a high capacity (the complexity of the function the model can encode) and, unfortunately, there are limited numbers of forecast-ground truth pairs that are available for training. The lack of training data is exacerbated by growing the number of outputs we want the DNN to produce, for example, a probability distribution over our forecast 10.1029/2020GL089098

Geophysical Research Letters
field. The AnEn has been shown to generalize well as a machine learning algorithm, that is, to provide an improved forecast when deployed on long validation periods on unseen meteorological forecasts. The distilled AnEn bootstraps training a DNN off the AnEn, effectively combining the AnEn's strength of being able to generate forecasts with a relatively small corpus of training examples with the DNN's ability to memorize this complex correction function with a significantly smaller amount of data.
This may be a pragmatic compromise. It seems there is a large opportunity for machine learning by relying on the extremely high-quality numerical weather models and making improvements in postprocessing. The authors believe there is potential in this fused approach. This paper provides an example of how machine learning can contribute to increasing forecast skill and uncertainty quantification. As forecasts are asked to be simultaneously faster, more granular, and more accurate, the physics-based models can continue to do the heavy lifting and machine learning postprocessing can improve forecast quality to alleviate some issues of scale.

Data Availability Statement
For more information on Loon stratospheric wind observations, contact Loon (at stratospheric-data@loon. com). The data used in this paper is portion of the publicly available Loon observations data set (Candido, 2020).