A Machine Learning Augmented Data Assimilation Method for HighResolution Observations
Abstract
The accuracy of initial conditions is an important driver of the forecast skill of numerical weather prediction models. Increases in the quantity of available measurements, particularly highresolution remote sensing observational data products from satellites, are valuable inputs for improving those initial condition estimates. However, the traditional data assimilation methods for integrating observations into forecast models are computationally expensive. This makes incorporating dense observations into operational forecast systems challenging, and it is often prohibitively timeconsuming. Additionally, highresolution observations often have correlated observation errors which are difficult to estimate and create problems for assimilation systems. As a result, large quantities of data are discarded and not used for state initialization. Using the Lorenz96 system for testing, we demonstrate that a simple machine learning method can be trained to assimilate highresolution data. Using it to do so improves both initial conditions and forecast accuracy. Compared to using the Ensemble Kalman Filter with highresolution observations ignored, our augmented method has an average rootmeansquared error reduced by 37%. Ensemble forecasts using initial conditions generated by the augmented method are more accurate and reliable at up to 10 days of forecast lead time.
Key Points

Machine Learning (ML) augmented data assimilation (DA) of highresolution observations improves the analysis in a nonlinear dynamical model

Explainable Artificial Intelligence identifies system covariances to guide neural network training for analysis state reproduction

Shortterm forecasts from the analysis generated by the ML augmented DA are more accurate and more reliable
Plain Language Summary
Weather forecasts are highly sensitive to the estimate of the current state of the atmosphere, known as initial conditions. The atmosphere is chaotic, meaning that small errors in this estimate can grow quickly as the forecast model predicts events further into the future. The satellite era has contributed to large improvements in weather forecasts by providing additional data that allow for more accurate estimates of initial conditions. However, current methods for generating initial conditions are computationally timeconsuming, and as a result, large fractions of available measurements are not used for this purpose. In a proofofconcept study using a simplified representation of the atmosphere for testing, we train a machine learning (ML) method to replicate the results of a traditional method. Once trained, ML models are usually very fast. Applying the trained model exclusively to measurements that would otherwise be too timeconsuming to use produces better initial conditions and more accurate forecasts.
1 Introduction
The accuracy of operational weather forecasts is highly dependent on the quality of the initial conditions provided to the model (Bauer et al., 2015). To correct drift and maintain the robustness of forecasts, model initial conditions are regularly updated based on measurements (Bannister, 2017; Edwards et al., 2015). These measurements include both insitu data, such as weather station measurements, and remotely sensed data (Choi et al., 2017; Zhang & Tian, 2021). Observations are noisy and may not be aligned with the model grid or state variables; the task of identifying optimal initial conditions consistent with all available information is therefore challenging and computationally expensive. Data assimilation (DA) methods are employed to do this. With the proliferation of highresolution data sets, often at resolutions higher than that of the forecast models, otherwise useful data is regularly ignored and not assimilated into operational models due to time or computational constraints (Eyre et al., 2022). Correlations in observation errors often require thinning of observations before they are assimilated as well, limiting the quantity of satellite data that is processed for state initialization (Eyre et al., 2022). However, assimilation of a subset of available satellite data has improved forecasts, making it likely that leveraging currently unused data could generate further improvements (Eyre et al., 2022).
DA techniques can be generally categorized as variational methods or sequential methods (Bannister, 2017; Edwards et al., 2015). Variational methods use numerical optimization, finding the initial condition that minimizes an error metric or cost function. Sequential methods nowadays are some variant of the Ensemble Kalman Filter (EnKF), in which a set of model realizations are simulated to quantify covariance structures before assimilating observations (Evensen, 2003; Hoteit et al., 2018). Both methods require running multiple simulations of the full forecast model, a step that is computationally expensive and timeconsuming. To capture the information of a highresolution measurement, the model itself must at least match the resolution of the measurement—further increasing the cost of this step as the model run time scales with resolution. In addition to the necessity of using a higher resolution model (run multiple times), the physics of smaller scale dynamics create more complex correlation structures, and a larger ensemble size is required to actually improve the forecast (Miyoshi et al., 2015).
An additional cost associated with assimilating highresolution observations is the observation operator. Both variational and sequential methods assess the error of a model forecast for a given initial condition in observation space. This requires, for each observation, explicitly mapping between forecast model space and observation space. For some observations, this is straightforward. For others, particularly for remotely sensed data such as highresolution satellite measurements, this calculation itself is a physicsbased model that can be a computational bottleneck (Eyre et al., 2022). In some operational models, the tradeoff between the speed and accuracy of these observation operators is already an important avenue of research for improving the performance of their DA systems even before currently unusable highresolution data is considered (Shahabadi & Buehner, 2021). The observation operator calculation must be performed for each data point and so also scales with the number of discrete observations, again increasing its cost.
Remotely sensed observations are integral to the performance of numerical weather forecasts, especially since insitu observations of large portions of the atmosphere and surface are sparse (Bannister, 2017; Cardinali, 2009). Efforts to incorporate satellite and other remotely sensed observations into assimilation systems have been effective at improving model initialization and forecast accuracy (Buehner et al., 2018; Shahabadi & Buehner, 2021). However, as a result of the expense associated with assimilation much of the potential of these highresolution measurements for improving state initialization in forecast models has not been realized. Currently employed DA methods are simply not efficient enough to sufficiently quickly ingest this data to be useful in an operational setting. Machine Learning (ML) methods may provide a potential solution.
ML techniques have been increasingly used in earth science applications, including DA (Abarbanel et al., 2018; Bonavita et al., 2021; Penny et al., 2022; Sonnewald et al., 2021). They are appealing for this particular problem mostly due to their speed. While the training process is often expensive, once trained ML methods are very fast compared to weather forecast models. Since many of the bottlenecks in traditional DA methods are related to computational efficiency, as described above, much of the effort in employing ML to improve DA has been targeted at the most computationally expensive parts of the process.
One obvious place to look is at the model simulations themselves. Attempts to use deep learning, in which neural networks comprised of many layers are used to capture complex structures, have proven successful. The basic approach is to train the ML model on the output of a traditional physicsbased model (Kim et al., 2019; Pathak et al., 2017). The result is more computationally efficient but at the cost of accuracy. In the context of DA, it has been demonstrated that model surrogates can be successfully trained iteratively using DA state estimates (Brajard et al., 2020; Gottwald & Reich, 2021).
Extending this approach beyond demonstrating that ML can capture the dynamics of complex and chaotic systems, augmented approaches that use model surrogates only to represent scales unresolved by the physical model (Brajard et al., 2021; Gottwald & Reich, 2021) have shown that the integration of ML as a model surrogate can generate improvements over traditional DA methods. Other work has demonstrated the utility of using ML model surrogates to increase the ensemble size beyond what would be otherwise practical (Wu et al., 2021; Yang & Grooms, 2021). Related to the issue of unresolved scale and model resolution, ML has been employed to successfully generate parameterizations used to capture unresolved physics, making the generation of the tangent linear models needed in many variational DA methods more efficient (Hatfield et al., 2021).
The observation operator, which can be another computational bottleneck, has also been targeted using ML (Geer, 2021; Jung et al., 2010; X. Liang et al., 2022; J. Liang et al., 2023; Stegmann et al., 2022; Wang et al., 2022). Other efforts have used ML methods to identify regions of tropical cyclone activity to target highresolution modeling in a subdomain (Lee et al., 2019), to perform bias correction of model forecasts before they are fed into the assimilation algorithm (Arcucci et al., 2021; Chen et al., 2022), and apply timevarying localization to the covariance structure of the system (Lacerda et al., 2021).
In contrast, relatively limited efforts have been directed toward using ML to perform the assimilation step directly. Rather than using ML to replace pieces of the DA process, we propose an augmented DA method in which a ML model is trained offline to assimilate highresolution measurements. Convolutional neural networks (CNN) are particularly good candidates for assimilating spatial data and learning the spatial correlation structure of the system of interest, as their design and main demonstrated use cases rely on their ability to identify spatial patterns and localize input data (Dong et al., 2021; Mallat, 2016).
In a realworld scenario, computational resource bottlenecks require some highresolution observations to be either thinned before being assimilated or discarded entirely. As a proofofconcept demonstration for our proposed method, we use a synthetic system with observations available at regular time intervals. The observations are alternatively highresolution or lowresolution. Lowresolution observations are always assimilated using the EnKF, and in the augmented method, highresolution observations are assimilated using a CNN trained on analyses produced in an offline setting where computational constraints are less pressing.

A shallow CNN can be successfully trained to reproduce the analysis of the EnKF offline

When used online to assimilate otherwise ignored highresolution data, with the traditional EnKF used for lowresolution data, assimilation performance will be improved
2 Methods
2.1 Lorenz96 System
We numerically integrated Equation 1 forward to generate a reference trajectory for our experiments. A fifthorder RungeKutta method was used, with a variable time step to control error assuming fourthorder accuracy (Dormand & Prince, 1980), as implemented in the SciPy package (Virtanen et al., 2020). The maximum allowable relative error was set to 0.001 and the maximum allowable absolute error to 10^{−6}. The system is in an unstable equilibrium when all variables are equal to F; initial conditions were set by perturbing one of the variables to a value of 8.01. The system was integrated out to t = 2,000 and output was generated at time intervals of Δt = 0.05 generating data for a 40 variable vector at 40,000 time steps, or 800,000 data points representing the true time evolution of the system. Synthetic observations were generated by adding normally distributed random noise with a standard deviation equal to 30% of the standard deviation of the reference state. This level of observation noise is consistent with other work done using the Lorenz96 system for testing DA methods (Brajard et al., 2020; Hatfield et al., 2018; Hoteit et al., 2008).
2.2 The Ensemble Kalman Filter
The solution to the filtering problem is referred to as the analysis. When forecast and observation errors are unbiased, normally distributed, and independent, and the forecast model and observation operator are linear, the Kalman filter (KF) provides the closedform optimal solution to the filtering problem (Kalman, 1960).
Here, X^{a} is a matrix whose column vectors are analysis ensemble members. X^{f} is a matrix whose column vectors are individual forecasts. C is the sample covariance of the ensemble forecast, X^{f}, used as an approximate representation of the true covariance, R is the observation error covariance matrix, and Y is a matrix whose columns are the observation vector y_{i}. To ensure that the covariance does not systematically underrepresent the true error, random Gaussian noise with covariance R is added to the observation matrix Y (Burgers et al., 1998; Evensen, 2003).
The EnKF assumes normality for the forecast and observation errors, $$\vec{\mu }$$ and $$\vec{\nu }$$, although has been demonstrated to be somewhat robust to nonGaussian distributions (Reichle et al., 2002). Also relevant for earth system models in which the state space is very large, the EnKF can be effective even when the number of ensemble members is much smaller than the size of the state space. This contrasts with more exact methods, such as particle filters, which often exhibit stability problems in such situations (Farchi & Bocquet, 2018; Hoteit et al., 2008).
The EnKF is known to be vulnerable to issues associated with the fact that it approximates a PDF with samples represented by a finite number of ensemble members (Ehrendorfer, 2007). These issues include spurious correlations, in which deviations of the sample covariance from the true covariance degrade filter performance, and covariance collapse, in which the ensemble becomes sharply clustered at a point in state space far away from the true state. To address spurious correlations, localization is often used, a technique that has been shown to improve performance by reducing the impact of spurious correlations of the system state at distant grid points (Evensen, 2003). For this application, we used a step function to localize the covariance, setting any covariance between variables greater than five grid points apart equal to zero (Houtekamer & Mitchell, 1998).
Covariance inflation, in which all covariance values are multiplied by a factor greater than one before computing Equation 4, is another technique used for improving the stability and performance of the EnKF by reducing the risk of covariance collapse. Our base settings did not include covariance inflation as our initial experiments did not show significant improvements employing it. Both approaches can improve performance in some circumstances by preventing covariance collapse (Evensen, 2003). However, since both address issues created by the finite size of the ensemble, they become less necessary with larger ensemble sizes and must be tuned (Miyoshi et al., 2015). As such, they are appropriate parameters to vary in our sensitivity analysis in order to identify optimal values.
We use the EnKF here for two purposes: to generate training data for a CNN and as a benchmark to evaluate the performance of our augmented method. After assimilating the synthetic observations with the EnKF using the settings described above, the following data are available at all 40,000 time steps: the true model state, the model forecast, the synthetic observation, and the EnKF analysis. Other combinations of settings were also used to assess sensitivity. These are outlined in Table 1. Observation error is specified as the standard deviation of the added noise used to generate the synthetic observations as a fraction of the system standard deviation.
Run name  Observation error StDev (fraction)  Ensemble size  Inflation factor  Localization distance 

Base  0.3  100  1  5 
s1  0.4  100  1  5 
s2  0.2  100  1  5 
s3  0.3  33  1  5 
s4  0.3  1,000  1  5 
s5  0.3  100  1.01  5 
s6  0.3  100  1.05  5 
s7  0.3  100  1.1  5 
s8  0.3  100  1  3 
s9  0.3  100  1  7 
 Note. Observation error is presented as observation noise standard deviation as a fraction of the total system standard deviation.
2.3 CNN Architecture and Training
The ML model is a CNN with several hidden layers. Its architecture is shown in Figure 1. The input layer has three channels: the model ensemble forecast mean, the difference between the forecast mean and the observation (the innovation), and the ensemble forecast standard deviation. The output layer contains two channels: the ensemble analysis mean and standard deviation, generating a probabilistic prediction of the analysis state. All input and output variables are drawn from the same time step.
The following parameters for each layer define a CNN: the filter size, the number of feature maps, and the activation function (Alzubaidi et al., 2021). The weights of each convolutional filter are independent of space and are applied uniformly across the domain. Each feature map in a hidden layer is assigned different filter weights and a constant bias weight. The ReLu activation function is used for hidden layers and no activation function is applied to the output, which is then a linear combination of the activation values in the second hidden layer plus the value of the bias weight.
Output from a convolution has a smaller dimension than its input, since locations on the edge of the domain don't have a neighboring point on one side. In image recognition and other similar tasks, zeropadding addresses this issue by artificially adding zeros to the input on the edges of the domain. Here, with a cyclically symmetric system, zeropadding is not an appropriate solution. Instead, we implemented cyclic padding such that the neighboring spatial nodes for i = 1 are i = N and i = 2. Similarly, the neighboring spatial nodes for i = N are i = 1 and i = N − 1. The data from spatial locations on one end of the domain are therefore concatenated to the other before being passed to the input layer. This maintains the cyclic nature of the Lorenz96 system and ensures that the size of the CNN output matches the dimensions of the system.
The training data comprises the first half of the EnKF analysis states, for times t = 0 to 1,000 covering 20,000 individual time steps. The data set from the bestperforming EnKF sensitivity run settings described in Table 1 was used for training. The Adam optimizer (Kingma & Ba, 2015) was used to train the model, using 25 training epochs and 100 batches per epoch. We treated the filter size, number of hidden layers, and number of feature maps per hidden layer as tunable hyperparameters and tested various combinations to identify a set that performed well. We trained CNN's with 96 combinations of hyperparameters, with filter sizes of 1, 3, 5, or 7, the number of hidden layers varying from 1 to 4, and the number of feature maps per hidden layer varying from 2 to 7. Based on the results of this tuning, the chosen CNN has a filter size of 3, 5 hidden layers, 7 feature maps per hidden layer, and 576 trainable weights, with an input shape of 50 × 3 and an output shape of 40 × 2.
2.4 Augmented Method and Experimental Setup

Data temporally thinned by discarding every other time step and spatially thinned by observing a randomly selected 25% of the variables at retained time steps

No temporal thinning is used. Observations are spatially thinned by randomly selecting 12.5% of the variables at each retained time step

No temporal thinning is used. Observations are spatially thinned by randomly selecting 25% of the variables at each retained time step.
The EnKF using base settings was then used to assimilate each set of observations, which will subsequently be referred to as EnKF SparseObs, EnKF SparseObs2, and EnKF SparseObs3, respectively. The method used to generate the training data, in which no thinning was applied to observations before being assimilated by the EnKF, will be referred to as “EnKF AllObs.”
The augmented method uses the EnKF to assimilate lowresolution spatially thinned observations. Before the temporal thinning is applied to the observations in EnKF SparseObs, on alternating time steps, a “highresolution” observation is available, including observations of 100% of the variables. For EnKF SparseObs these observations are assumed to be prohibitively computationally expensive to assimilate and are therefore ignored during thinning. The forecast continues on to the next time step where the next thinned observation is available and assimilated by the EnKF. In the augmented method, the CNN assimilates the highresolution observations not assimilated by the EnKF. A flow chart demonstrating the operation of the augmented method is shown in Figure 2.
The CNN takes the ensemble forecast mean, ensemble forecast standard deviation, and the innovation as input and returns the analysis ensemble mean and standard deviation as output, creating a probabilistic estimate of the analysis state. At this stage, an ensemble must be recreated to generate an ensemble forecast for the next time step (where the EnKF will be used to assimilate the lowresolution observation). One potential method would be to assume a diagonal analysis covariance matrix and resample the ensemble using the CNNpredicted means and standard deviations. Initial testing showed this approach to be ineffective, likely due to ignoring offdiagonal nonzero terms in the covariance resulting in a degraded ensemble forecast.
2.5 Explainable AI
ML models are fast to run and accurate when sufficient training data are available. In many earth system science applications, the computational efficiency of traditional tools is a significant bottleneck and available training data is voluminous. These models have a major drawback, however: models are a black box and it is therefore often not clear how they are generating their predictions (Gevaert, 2022). Using testing and validation data sets can provide some level of confidence in the models by demonstrating their level of accuracy on data not used for training. If they are to be deployed in something like an operational weather forecast system, however, such demonstrations may not provide a sufficient level of confidence. Outofsample input data cannot be guaranteed never to occur, and no guarantees can be made about the behavior of the ML model when presented with such data.
A variety of tools are available to make otherwise opaque datadriven models more transparent, collectively referred to as Explainable AI (XAI) methods (Linardatos et al., 2021). Shapely Additive Explanations, or SHAP, is one such tool. SHAP quantifies the impact of a specific input variable on the output generated by a model. The method is modelagnostic and is equally applicable to a simple linear regression model or a deep neural network with millions of trainable parameters. Full details and a formal definition can be found in Lundberg and Lee (2017). Heuristically, a SHAP value is the anomaly in an output variable attributable to the anomaly in an input variable. It provides a way of apportioning the deviation from the mean in the output to each input variable. This information can increase confidence that the trained model is reliable and provides insights into the structure of the underlying system.
We apply it here to analyze how the trained CNN assimilates observations and ensemble forecasts. In a DA context, the behavior of the CNN should be predictable and consistent with our understanding of the dynamics of the Lorenz96 system; it should not, for example, heavily weight forecasts from highly spatially distant nodes. Evaluating the CNN in this way can provide confidence in its ability to perform well when presented with new data.
3 Results
3.1 Sensitivity and Training
The results of the base run and 9 sensitivity runs using the EnKF are outlined in Table 2. These runs are intended to identify optimal settings for generating training data, with the EnKF assimilating all observations at every time step (i.e., the highresolution observation at every time step). The performance for each run is evaluated as the mean analysis rootmeansquared error (RMSE) divided by the standard deviation of the observation error. For all 10 cases, the EnKF analysis has a lower error than the observation error, as expected, with all runs achieving better than 24% on this metric. As the EnKF approximates the optimal solution by using the first two moments of the forecast ensemble to represent a normal distribution, errors caused by the finite size of the ensemble are expected to decrease with ensemble size. This is evident in our results, which show larger ensemble sizes producing lower errors.
Run name  Observation error (StDev fraction)  Ensemble size  Inflation factor  Localization distance  AllObs RMSE (% of obs StDev)  Augmented RMSE  SparseObs RMSE 

Base  0.3  100  1  5  20.3059  0.568  0.877 
s1  0.2  100  1  5  20.1084  n/a  n/a 
s2  0.4  100  1  5  20.6152  n/a  n/a 
s3  0.3  33  1  5  29.5916  0.593  1.453 
s4  0.3  1,000  1  5  19.6315  0.564  0.824 
s5  0.3  100  1.01  5  20.3089  0.568  0.883 
s6  0.3  100  1.05  5  20.3404  0.570  0.876 
s7  0.3  100  1.1  5  20.4216  0.573  0.882 
s8  0.3  100  1  3  22.6087  0.617  0.989 
s9  0.3  100  1  7  19.4958  0.564  0.899 
 Note. Rootmeansquared error (RMSE) is presented as a fraction of the observation standard deviation for EnKF AllObs results to allow for comparison between different observation error settings. For SparseObs and augmented, raw RMSE values are used.
Localization and covariance inflation can improve EnKF performance by mitigating errors related to finite ensemble size but can be detrimental for larger ensemble sizes as such errors become less important. We expect the performance to be dependent on localization and inflation settings but it is not clear a priori which values will be optimal. The bestperforming combination of settings was run s9 with localization of 7 grid spaces and covariance inflation factor of 1 (i.e., no inflation). These are the EnKF settings used for training the CNN and used in the augmented method. A run using no localization was also tested, which resulted in filter divergence (not shown).
The results from the training process are shown in Figure 3. The training targets were the EnKF ensemble analyses and standard deviations produced using observations of all variables as described in Section 4. CNN error can be considered in terms of how well CNN output matches the EnKF results and the deviation of its predictions from the true state. Errors with respect to true states in the second half of the time series, that is, the set not used for training, are treated as validation metrics. As the CNN produces probabilistic predictions, we evaluated several error metrics to assess performance, shown over 25 training epochs in Figure 3. Figures 3a and 3b show the mean absolute error and RMSE of the state predictions, with the latter shown with respect to both truth and training data. Overfitting would be indicated by an increase of the validation error in Figure 3b even as the training error remained flat or decreased. This is not evident here, demonstrating that our trained CNN is not overfitting and produces reliable predictions when presented with data from outside its training set (Ying, 2019). Low mean absolute error is also a desirable property, a feature evident in Figure 3b.
To evaluate the quality of the ensemble standard deviation predictions as well as the state predictions, we additionally normalize the errors by the predicted standard deviation. The normalized mean absolute errors and the standard deviation of the normalized errors are shown in Figures 3c and 3d, respectively. The normalized mean absolute error should be small compared to 1 (with 1 indicating that the mean absolute error is 1 standard deviation from the true value). The standard deviation of the normalized errors would be 1 if the predicted standard deviation perfectly characterized the error. Therefore, a welltrained model will produce a value close to 1 for this metric. Both of these conditions are met. All metrics plotted show a small decrease in performance from epoch 24 to 25, and the weights from epoch 24 were therefore used for the final trained model.
Another check on the trained CNN is to compare its RMSE on the validation data set to the observation error. To improve the state estimate, the CNN must perform better than observation error statistics alone. Using EnKF AllObs forecasts and residuals, the final validation RMSE with respect to the true state in Figure 3 is 23% of the observation standard deviation, compared to 21% for the EnKF AllObs analyses.
The sensitivity of the augmented method to different settings was tested. These results are shown in Table 2. The CNN trained on results using the s9 run settings was used for all runs. The s1 and s2 run settings were not tested; the CNN was trained on data with the observation error specified by s9 settings and the s1 and s2 settings are therefore not applicable for the augmented method. For those cases that are applicable, the augmented method performance was compared to the EnKF assimilating only the lowresolution observations (EnKF SparseObs). For all sensitivity settings, the augmented method outperforms EnKF SparseObs (Table 2).
3.2 Performance Comparison
Having shown that the trained CNN does not overfit and that its error is 23% of observation error, we can now assess the performance of the augmented method using this CNN. Base settings from Table 1 were used for all runs presented in this section. In addition to the augmented method and EnKF SparseObs, the performance of EnKF AllObs assimilating observations of all variables at every time step is included for comparison. The analysis RMSE for all three methods is shown in Figure 4. Time is shown as earthyears with 1 model time unit is equivalent to 5 real days (Lorenz, 2005). EnKF AllObs performs best with an average RMSE of 0.22. Considering only the density and frequency of assimilated observations, this comparative overperformance is unsurprising as EnKF AllObs assimilates more data than the other methods. More interestingly, the augmented method performs better than EnKF SparseObs, with an average RMSE of 0.57 compared to 0.88, representing an improvement of 35%. Ftests and Levine tests performed on the errors are significant at the p < 0.01 level, and a MannWhitney U test performed on the RMSEs is also significant at this level.
Other than the timeaveraged RMSE, the other thing to note is the variability of errors for EnKF SparseObs and the augmented method. The time series on the left is smoothed, and in this rolling average the augmented method consistently outperforms SparseObs at nearly all time steps. The histogram on the right shows the distribution of errors at all time steps for EnKF AllObs, EnKF SparseObs, and the augmented method. There is substantial overlap in the distribution of errors; using unsmoothed data, the augmented method outperforms in 45% of coincident time steps. However, the EnKF SparseObs errors have a notably fatter tail in the histogram. These spikes are periods where the state estimate diverged from reality, generating instabilities in the EnKF that resulted in large errors.
Despite having a similar error distribution to EnKF SparseObs, the augmented method does not have the same fat tail. It is better at maintaining the state estimate close to the true state, preventing instabilities and periodic spikes in the analysis error. This accounts for the consistent overperformance in the smoothed time series. The improved stability is an important factor in evaluating the relative performance and suggests that the augmented method is more reliable than what would be otherwise assessed based on the fact that it only outperforms EnKF SparseObs in 45% of time steps. The improved stability and reduced mean RMSE are clear benefits of exploiting all available data in assimilation using an efficient but possibly suboptimal technique (the CNN) compared to using only the EnKF on spatially thinned observations.
The results of applying the EnKF to alternatively thinned observations described in Section 2 are not shown in Figure 4 for visual clarity but will be described here. Those results are included in the supplemental information (Figures S2 and S3 in Supporting Information S1). The average RMSE for EnKF SparseObs2 is 0.95, and the average RMSE for EnKF SparseObs3 is 0.56. Notably, the latter is comparable to the results for the augmented method. To the extent that the number of observations assimilated by a traditional method drives computational cost, EnKF SparseObs3 assimilates twice as many observations with the EnKF as the other methods on a per unit time basis. Despite this, the augmented method performs similarly at potentially lower computational cost.
Another way of assessing performance is to generate forecasts using analyses as initial conditions. Forecast skill over time can then be compared. These results are shown in Figure 5. The mean error of ensemble forecasts from a sample of initial conditions is plotted out to 10 days of lead time. As with the results in Figure 4, EnKF AllObs performs best, generating better forecasts for all lead times. The augmented method again outperforms EnKF SparseObs. Out to greater than 10 days of forecast lead time, the RMSE of forecasts generated using initial conditions from the augmented method is statistically significantly better at the p < 0.05 level. Graphically, this is immediately evident as the 95% confidence interval bands do not overlap.
Figure 5 also shows another metric for evaluating the reliability of the ensemble forecasts. By definition, when errors are unbiased, the standard deviation of the error and the RMSE are equivalent. If the ensemble spread reflects the true error, then the actual RMSE should equal the ensemble standard deviation (Gneiting & Katzfuss, 2014; Leutbecher & Palmer, 2008). If the ensemble is overprecise with estimated errors smaller than actual, it is considered underdispersive. If the ensemble is underprecise, its spread overestimates actual errors and precision and is said to be overdispersive. Here, the ensemble standard deviation for all three methods is less than the RMSE for all times shown, indicating underdispersive ensemble forecasts that do not adequately represent the true forecast error.
For a more detailed examination of the reliability of ensemble state estimates, we examine the distribution of actual versus expected analysis errors. These results (as opposed to the distributions of RMSE) can indicate whether forecasts are biased or poorly distributed. If errors are assumed to be Gaussian, then if scaled by the ensemble standard deviation the error distribution should follow a unit normal with a mean of zero. Conversely, if a normal distribution does not well represent the errors, or the ensemble standard deviations don't reflect the true analysis error statistics, the scaled distribution will diverge from the reference unit normal. This comparison is shown in Figure 6.
Figure 6a suggests that both EnKF SparseObs and the augmented methods produce slightly underdispersive forecasts, with the PDF peak lower and tails higher than the reference unit normal PDF. These features are also apparent in Figure 6b, which displays the same data but as a CDF instead. For negative errors, the CDF is higher than the reference distribution; for positive errors, it is lower. The augmented method better represents the true statistics of its error than EnKF SparseObs, with a standard deviation of 1.08 compared to 1.18, while a perfectly dispersive ensemble would have a standard deviation of 1.
Rank histograms are an alternative way of visualizing the dispersion of ensemble forecasts or analyses (Candille & Talagrand, 2005; Hamill, 2001; Hatfield et al., 2018; Weigel, 2011). For each forecast (or analysis), the percentile of the true value within the ensemble is calculated. When the distribution of the percentile values is plotted, a uniform distribution indicates a welldispersed forecast. Errors of a given size occur as frequently as expected if the ensemble spread represents the true error statistics. A U shape is underdispersive, with errors outside the range of the ensemble overrepresented. A tilt indicates a biased ensemble forecast, with positive errors more or less likely to occur than negative errors. Figure 7 includes rank histograms for both methods. It is more visually apparent here that both methods produce underdispersive forecasts. Small and large percentile frequencies are clearly larger than frequencies at or around the 50th percentile.
3.3 Explainable AI: SHAP Values
We now return to the behavior of the CNN in producing state estimates from forecasts and innovations. In an operational setting, allowing blackbox operators to produce new initial conditions is not tenable. There must be some confidence that the system will not generate unrealistic results when presented with outofsample data, and some understanding of how it is producing its state estimates. Here we use SHAP values to estimate the impact of input variables on the outputs generated by the CNN (Figure 8).
Figures 8a–8f show the mean absolute SHAP values of each input variable (vertical axis) on each output variable (horizontal axis). The first column shows the impact of each of the three input variables on the analysis, and these plots show that the largest contributors to the state estimate of a variable are the forecast and the innovation of that variable. This is an excellent first sanity check on the CNN. In estimating a state variable, it weights the forecast and observation of that variable more heavily than forecasts or observations of nearby variables.
Figures 8g and 8h are identical to the plots from Figures 8a and 8f but with the first input variable not shown. The longterm spatial correlation structure of the Lorenz96 system is important to note at this point. Since the system is symmetric, without loss of generality we can consider the dynamics and correlation between locations only in terms of absolute spatial lag. The dynamics of a state variable are nonlinearly dependent on the state variables at spatial lags of 1 and 2 as described in Equation 1.
This dependence defines the temporal derivative. In nondifferential terms, considering the variable values rather than rates of change, the correlation extends further than two grid points. At spatial lags of 1–4, the longterm absolute correlation coefficients are 0.05, 0.33, 0.11, and 0.03. Variables at a distance of 2 and 3 grid points removed from one another are more highly correlated than immediately adjacent variables one grid point removed (Lorenz, 2005). It is also important to note that the convolutional layers will tend to create a binomial distribution here: larger lags have fewer paths of influence. This means that we should expect the SHAP results at increasing spatial lags to be a combination of a binomial distribution and the Lorenz96 correlations. Figure 8 demonstrates monotonically decreasing SHAP values with increasing spatial lag consistent with the binomial influence of the CNN. Figure 8d shows a peak SHAP value at a spatial lag of 2, consistent with the correlation structure of Lorenz96.
4 Discussion
The augmented method's results are encouraging, showing that it outperforms a traditional approach using only the EnKF. However, considering the training process of the CNN makes some limitations apparent. First: the network is trained using only the ensemble mean and standard deviation, rather than the entire ensemble, as input. As a result, it can only learn the diagonal elements of the covariance structure to the extent that they are dependent on location in the state space. Other factors, most obviously the time since the last observation and analysis, will impact forecast covariance. The trained CNN cannot include such factors.
Even within the confines of the experiment we have set up, the training data's limitations impact performance. When employed online in the augmented method the CNN is provided as input a forecast initialized by assimilating only 25% of the variables. In comparison, all variables are observed in the EnKF configuration used to generate training data. In the online setting, therefore, the initial condition error will be larger and the forecast precision lower than in the training set. In the experimental setup, the augmented method has an RMS of 0.56, more than double the RMSE when it is simply applied to forecasts from the validation time period generated offline. This partly reflects that the augmented method simply assimilates less data. On a timeaveraged basis, it is observing 62.5% of the observations assimilated in the training set, but that does not fully explain the greater than 2fold increase in RMSE. The remaining decrease in accuracy is attributable to the smaller forecast errors in the training data compared to forecast errors in the online setting.
While the reduced performance of the CNN applied to an online setting as opposed to input data generated offline is unavoidable to some extent, future opportunities for improvement may be found by allowing the CNN to better approximate forecast accuracy. Providing additional input to the network, such as other portions of the forecast covariance or the entire forecast ensemble itself, may improve performance. Our results here provide no indication either way whether a neural network would be able to learn effectively from other input data, or how complex the network would have to be, but it is a potential avenue of further exploration.
The results from the SHAP analysis provide additional insights into the possible extensions of the approach. Localization is widely used to improve the performance of many assimilation systems. The SHAP values demonstrate that the trained CNN has applied localization to the forecast. The CNN also has learned the longterm correlation structure (teleconnections) of the system, applying a localization structure to the innovations consistent with that of the Lorenz96 system. These are both reasons to think it is plausible that in future extensions convolutional layers may be able to generate spatial estimates that blend forecasts and observations in a way that is both reliable and skillfully reflects underlying system behavior and dynamics.
5 Conclusion
This study demonstrated a proofofconcept augmented assimilation methodology in which ML was used to directly assimilate highresolution observations for potential improvement of the performance of an assimilation scheme. Significant quantities of observational data, particularly from remotesensing platforms, go unused in operational forecast models due to the computational cost and time required for to incorporate them into the model. The potential viability of training a ML model offline to assimilate this data could have a significant impact—improved state initialization has real and notable impacts on forecast quality, and the ability to use the vast amounts of newly available observational data products to that end is of clear benefit.
To demonstrate the potential feasibility of such an approach, we trained a small CNN to replicate the results generated by the EnKF on synthetic observations. Using the EnKF on lowresolution observations and the trained CNN on the highresolution observations outperformed an EnKF assimilating only spatially and temporally thinned data. More specifically, in an experimental setting using the Lorenz96 model, the analyses generated by the augmented method have a mean RMSE 37% lower than using the EnKF on only lowresolution observations. Forecasts using analyses generated by the augmented method as initial conditions produce lower RMSE up to a forecast lead time of 10 days.
Additionally, using an explainable AI method, we demonstrate that the trained CNN effectively applies localization and learns the underlying system's correlation structure (teleconnections) via training. Distant observations do not impact its estimates. The natural tendency of convolutional layers to exploit local spatial correlations in this way is encouraging for potential extensions to more realistic applications. It also generates confidence that such a method would both be reliable and generate physically realistic results when presented with new data.
Further studies are needed to demonstrate the ability of this approach to work in more complex systems and at scale. Testing using a quasigeostrophic model and more realistic observational data would be a logical next step. The demonstrated feasibility of the general approach in this proofofconcept study will hopefully encourage additional efforts to address the large quantity of data that is currently unusable in an operational forecast setting using ML approaches.
Acknowledgments
Support for this work was provided by KAUST CRG Grant ORA2021CRG104649.2. This study was also supported by the U.S. Army Corps of Engineers (USACE) as part of Forecast Informed Reservoir Operations (FIRO) under Grant W912HZ1520019 and by the California Department of Water Resources AR Program (Grant: 4600013361). This project was additionally funded by the Climate Change AI Innovation Grants program, hosted by Climate Change AI with the support of the Quadrature Climate Foundation, Schmidt Futures, and the Canada Hub of Future Earth. This work utilized the Alpine high performance computing resource at the University of Colorado Boulder. Alpine is jointly funded by the University of Colorado Boulder, the University of Colorado Anschutz, Colorado State University, and the National Science Foundation (award 2201538). The authors also wish to sincerely thank the anonymous reviewers who provided valuable feedback and suggestions, all of which were incisive and constructive and significantly improved the quality of both the experiments and the final manuscript.
Open Research
Data Availability Statement
Code for generating the data used in this study, raw data files, and code for generating the plots in this paper are publicly available (Howard, 2023).