Applying Machine Learning to Improve Simulations of a Chaotic Dynamical System Using Empirical Error Correction

Abstract Dynamical weather and climate prediction models underpin many studies of the Earth system and hold the promise of being able to make robust projections of future climate change based on physical laws. However, simulations from these models still show many differences compared with observations. Machine learning has been applied to solve certain prediction problems with great success, and recently, it has been proposed that this could replace the role of physically‐derived dynamical weather and climate models to give better quality simulations. Here, instead, a framework using machine learning together with physically‐derived models is tested, in which it is learnt how to correct the errors of the latter from time step to time step. This maintains the physical understanding built into the models, while allowing performance improvements, and also requires much simpler algorithms and less training data. This is tested in the context of simulating the chaotic Lorenz '96 system, and it is shown that the approach yields models that are stable and that give both improved skill in initialized predictions and better long‐term climate statistics. Improvements in long‐term statistics are smaller than for single time step tendencies, however, indicating that it would be valuable to develop methods that target improvements on longer time scales. Future strategies for the development of this approach and possible applications to making progress on important scientific problems are discussed.


Introduction
Numerical weather prediction and climate models attempt to predict and simulate components of the Earth system, including the atmosphere and perhaps also the oceans, land surface, and biosphere. While the fundamental physical equations governing the system are known, they cannot be solved accurately with available computational resources. Instead, approximations are made in the models' equations, and this gives rise to errors in their output. Methods to reduce these errors are highly valuable for giving better warning of major meteorological and climatic events.
Recently, great advances in machine learning have taken place, for example, in the domains of image recognition and game playing (e.g., He et al., 2016;Silver et al., 2017). The algorithms developed have been found to excel at certain problems that involve predicting an unknown value given values of predictor variables (e.g., predicting what objects a photograph contains given its pixel values)-this is similar to the problem of predicting future behavior of the Earth system given knowledge of its past and present state, and so there has been high interest in applying machine learning to improve such predictions. This has included predicting future weather events directly from observations and postprocessing dynamical models' output (e.g., Krasnopolsky & Lin, 2012;McGovern et al., 2017;Rasp & Lerch, 2018;Scher & Messori, 2018).
Another emerging application is applying machine learning to improve components of the dynamical Earth system models themselves, particularly the parameterizations of unresolved small-scale processes such as radiative interactions and cloud processes. This could allow larger improvements in prediction skill than is achieved by postprocessing models' output, by better representing the physical interactions between variables.
Previous work has primarily used artificial neural networks (ANNs), which are functions constructed from "neurons." Neurons are simply functions that linearly combine their inputs and then apply a given (generally nonlinear) transformation to produce the output value. ANNs pass input data into neurons, whose output may then be used as inputs to more neurons, and so on until a final output value (or vector of values) is produced. ANNs can relatively efficiently encode complex functional relationships. Indeed, an ANN One promising approach has been to use algorithms such as ANNs to reproduce the behavior of atmospheric parameterization schemes at a reduced computational cost. For example, Chevallier et al. (1998Chevallier et al. ( , 2000 found that ANNs could cheaply reproduce the behavior of the radiative transfer scheme in the European Centre for Medium-Range Weather Forecasts model, although Morcrette et al. (2008) note that this approach did not work sufficiently well after the model's vertical resolution was increased. Krasnopolsky et al. (2005Krasnopolsky et al. ( , 2010 also found that an ANN could be used to cheaply reproduce the output of the radiation scheme in the National Center for Atmospheric Research Community Atmosphere Model. More recent work has focused on replacing atmospheric models' convection schemes with ANNs, in order not just to reduce the cost of presently used schemes but to allow more expensive, higher-quality schemes to be used, such as superparameterization Rasp et al., 2018) or emulations of convection-resolving models (Brenowitz & Bretherton, 2018). O'Gorman and Dwyer (2018) also showed that a model's convection scheme could be replaced by a random forest algorithm, and the model could run stably and reasonably reproduce precipitation extremes.
Machine learning also holds promise of being able to reduce errors in dynamical models' predictions. Schneider et al. (2017) describe how better values of parameters of models could be learnt by algorithms being fed data from observations and high-resolution models. Dueben and Bauer (2018) examine whether ANNs could be trained to simulate atmospheric dynamics and provide prediction skill exceeding that of existing models, using a time stepping scheme where ANNs predict the tendency of the system in a similar approach to that used in existing dynamical models, and they conclude that it is possible. However, their simulations became unstable after about 2 weeks. Additionally, Bolton and Zanna (2019) showed that a convolutional ANN could skillfully predict small-scale momentum forcing in a quasi-geostrophic ocean model given spatially smoothed output from a simulation, although this was not implemented into a freely evolving model. Significant advances have also been made in learning symbolic equations and invariant quantities from data (Rudy et al., 2017;Schmidt & Lipson, 2009;Wu & Tegmark, 2018;Zhang & Lin, 2018), which could be applied in Earth system modeling (Gaitan et al., 2016).
The above-mentioned property of ANNs that they can represent any continuous function means that, in principle, given sufficient data of high enough quality to learn from and adequate computational resources for training, an ANN's representation of the equations of motion of the Earth system could reach the maximum skill possible for given inputs. So it seems that ANNs could potentially learn to reduce systematic model errors without the need for human ingenuity, greatly speeding-up model development and helping us to address challenges like predicting extreme weather and the impacts of climate change.
However, the use of ANNs as discussed in Schneider et al. (2017) and Dueben and Bauer (2018) has been presented as being in competition with improving the conventional physically derived aspects of Earth system models. Schneider et al. (2017) argue that improving physically derived parameterization schemes is preferable to using ANNs because they will obey conservation laws and symmetries. Dueben and Bauer (2018) ask whether models based entirely on ANNs can compete with physically derived models.
One purpose of the work presented here is to explore whether it is actually possible to use such algorithms to complement physically-derived model components, thereby preserving the benefits of using the latter, such as having better physical interpretability of the model behavior and better trust that the model will perform reasonably well in an unseen physical situation. Karpatne et al. (2017a) and Reichstein et al. (2019) provide overviews of the ways in which statistical algorithms can be combined with physical modeling and the associated challenges. The specific proposal tested here is to use algorithms to perform empirical error correction in dynamical models. Rather than predict the whole tendency of a system, as in the models considered by Dueben and Bauer (2018), the algorithms would predict the difference between the measured and the observed tendencies. Then the total tendency would be (x) + (x), where x is the system state at, and potentially before, the start of the time step, (x) is the tendency predicted by the physically-derived model, and (x) is the correction output by the algorithm. If (x) is close to the optimum tendency, (x) should be small, and so concerns about (x) not obeying conservation laws and symmetries are consequently less important than in the case where whole model components are replaced by algorithms (note, though, 10.1029/2018MS001597 that it may also be possible to constrain (x) to obey these physical principles more strictly; e.g., Jia et al., 2018). (x) should only have a large effect on the simulations when the prediction by the physically-derived model is poor, in which case the value of improving the total simulated tendency is larger compared to concerns about whether physical principles are strictly abided by. The physically-derived model maintains a key role, and it is desirable to continue improving it to strengthen the link between the simulation results and our physical understanding. This study focuses on the use of ANNs as model error correctors, but other algorithms could also be applied in a similar way.
A further advantage of using algorithms to correct models' errors rather than replace physically-derived models entirely is that it greatly simplifies the process of incorporating ANNs into dynamical models. Dueben and Bauer (2018) detail the numerous challenges in replacing physically derived models with ANNs (or other algorithms), such as obtaining the required data for training a full-complexity model and learning to use algorithms with the required complexity. A lot of development effort would be required before a model with better performance than current models would be produced. By contrast, development of error-correcting algorithms can start just by improving a small number of outputs as much as possible given a small number of inputs, which is achievable with a smaller research program, and progress can build from there. A disadvantage of this approach is that the computational cost of the models cannot easily be reduced this way if the resolution and parameterization schemes are kept the same-the focus is on improving the simulation quality. However, it may turn out to be more cost effective than using more expensive parameterization schemes or increasing the model resolution, and it could reduce costs if it allows the same or greater skill to be obtained using cheaper physically-derived parameterizations. It would also be very informative about the problems that would need to be overcome to get dynamical models based entirely on algorithms like ANNs to perform well.
An empirical error-correcting approach using an ANN was applied by Forssell and Lindskog (1997) to predict the water level in a tank. Previous environmental applications include the prediction of groundwater flow by Xu and Valocchi (2015) and the prediction of lake temperatures by Karpatne et al. (2017b) and Jia et al. (2018). These systems exhibit variability that is strongly influenced by external drivers, while Earth's atmosphere and oceans have a large component of unforced variability due to chaotic dynamics. A demonstration that using an algorithm to correct model errors could improve short-range forecasts of simple chaotic systems was given by Pathak et al. (2018), who used reservoir computing, but it is unclear if this would also produce stable simulations with improved long-term statistics. Cooper and Zanna (2015) applied this approach to improve a chaotic shallow water model but did so using a linear system of equations, which seems unlikely to be able to represent errors in complex dynamical systems as well as a more flexible algorithm such as an ANN in general-for example, atmospheric convection is a highly nonlinear process (Arakawa, 2004), and errors in its representation seem unlikely to be captured well by linear equations. Their work also focused on improving long-term statistics of the simulation, and it is not clear whether there was a simultaneous improvement in short-range forecast skill that would indicate that the dynamics were being better represented.
It is a key problem in Earth system prediction to improve our models in such a way that they are stable and give both improved skill in initialized predictions and better long-term climate statistics. In the remainder of this paper, it is tested whether this can be achieved by using an error-correcting ANN to better simulate a chaotic system, namely, the Lorenz '96 dynamical system (Lorenz, 1996; sometimes also referred to as the Lorenz '95 system; e.g., by Dueben & Bauer, 2018). This system, or variants of it, has been used in many previous studies to test concepts for how to improve dynamical Earth system models (e.g., Arnold et al., 2013;Dueben & Bauer, 2018;Schneider et al., 2017;Wilks, 2005). The results are informative about the potential for machine learning approaches to improve skill at simulating dynamical systems such as the Earth's climate, albeit in a much simpler setting.

The Lorenz '96 Equations and Coarse-Resolution Models
The Lorenz '96 dynamical equations describe the evolution of variables arranged in a ring, intended to be analogous to a latitude circle. The variables are divided into two types: slowly varying X k and quickly varying Y j,k , defined for k = 1, … , K, and j = 0, … , J + 2. Lorenz (1996) suggested that the Y j,k be considered analogous to a convective-scale quantity in the real atmosphere and X k analogous to an environmental vari-able that favors convective activity. Here one of the systems used by Arnold et al. (2013) is simulated as the "Truth" system-there is no particular strong reason to choose this system over other variants, but its prior use in the parameterization development work of Arnold et al. (2013) makes it seem like a good choice for exploring how the design of models can be further advanced. This has K = 8 and J = 32, in which with cyclic boundary conditions X k = X k + K and Y j,k = Y j,k + K and parameter values h = 1, F = 20, b = 10, and c = 4. The Y variables are connected in a ring, such that Y 0,k = Y J,k − 1 , Y J + 1,k = Y 1,k + 1 , and Y J + 2,k = Y 2,k + 1 , so there are J unique Y j,k variables associated with each X k variable. The time units are arbitrary and denoted as model time units (MTUs). These equations were integrated in time with a time step of 0.001 MTU using a fourth-order Runge-Kutta time stepping scheme.
A "training" simulation of this system of length 3,000 MTUs (not including 10 MTUs discarded as "spin up" at the beginning) was produced to provide a sample of "true" statistics to use in constructing coarse-resolution models below. The simulation was then extended for 10 MTUs so that memory of the training data set was effectively lost, and then a further 3,000 MTUs of data were generated to use as a validation data set, which is used only to evaluate and not to develop the coarse-resolution models below.
(Note that evaluation against a separate "test" data set is not done because the aim is not to select a single best-performing ANN structure and estimate the model's true skill. This means that an ANN that is selected on the basis of having an especially good performance on the validation data set would not be expected to perform as well relative to other ANNs on separate test data, since sampling variability would be expected to have contributed to its diagnosed skill in the validation data. It is shown below that performance improvements on the validation data are obtained for a wide range of ANN structures, and so the conclusion that ANNs can improve prediction skill for this system is robust to sampling variability.)

Coarse-Resolution Model
Suppose that a much computationally cheaper model of equation (1) is desired for making short-term forecasts and simulating the long-run statistics of this system. Following Wilks (2005) and Arnold et al. (2013), inspection of equation (1) suggests that it may be reasonable to forego simulating the Y variables explicitly and parameterize their effect on the X variables with a function U(X), analogous to how the effect of unresolvable physical processes on resolved scales is parameterized in Earth system models. This yields the coarse-resolution system with X k = X k + K . The time step is also increased to 0.005 MTU, so that the system has a coarsened time resolution as well, analogous to how Earth system models cannot resolve real Earth processes that happen on very fast time scales.
The function U(X * k ) is derived using the same method as Arnold et al. (2013), using essentially a coarse-graining approach. It is defined as a cubic function, a n X n , and its parameters are chosen using tendencies of the X variables over intervals of length 0.005 MTU, derived from the run of the truth system sampled every 0.005 MTU. Its parameters were fit to minimize the root-mean-square error (RMSE) of predictions of these tendencies made using equation (2), taking values a 0 = −0.207, a 1 = 0.577, a 2 = −0.00553, and a 3 = −0.000220. This is a statistical procedure, but note that here U(X) is not considered part of the machine learning algorithm used to correct the model errors. Following Wilks (2005) and Arnold et al. (2013), U(X) is thought of as being analogous to parameterizations of unresolved processes in an Earth system model-note that development of physically-derived Earth system model parameterizations can also involve fitting parameters to data (e.g., Hourdin et al., 2016). Including U(X) in equation (2) helps to test whether complex algorithms such as ANNs are able to give skill improve-10.1029/2018MS001597 ments after much of the possible progress has been made with physical reasoning and simpler statistical methods, using the deterministic model of Arnold et al. (2013) as a benchmark. It would also be possible to do a similar study with U(X) = 0, which would probably increase the potential improvement made by using error-correcting algorithms as they learnt to represent the improvements made by including U(X).
Hereafter, the model given by equation (2) is referred to as "No-ANN."

Models With ANNs
To produce coarse-resolution models with error-correcting ANNs, ANNs with a multilayer perceptron architecture (Goodfellow et al., 2016;Nielsen, 2015) were trained to predict the difference between the true system tendency and that predicted by the coarse-resolution model for one X variable at a time: The inputs to the ANNs are X variables up to two points away from the location where the prediction is being made, so that five X values in total are used as input. This is one more than used by the No-ANN model and takes advantage of the ability of ANNs to use inputs that are difficult to know how to include by physical reasoning-this may be helpful in Earth system modeling to account for subgrid phenomena that propagate between grid boxes but are difficult to include in parameterization schemes, such as horizontally propagating gravity waves (Alexander et al., 2010). Results for ANNs using the same inputs as the No-ANN model are discussed at the end of section 2.2 to quantify the impact made by using X k + 2 as an additional input-it does not qualitatively affect the findings. Not all of the X variables were used as input in order that the ANNs are only using information from nearby grid points. This is desirable in Earth system models so they can be run much more quickly in parallel computing environments (Dueben & Bauer, 2018). Also, since on Earth phenomena at one location could not be meaningfully influenced by phenomena on the other side of the world within one model time step, this structure constrains the ANNs to be more faithful to the true equations, so they are more likely to work well in novel situations. The results presented here are not expected to depend qualitatively on the number of X variables used, and they were found to not be sensitive to using X variables up to three points away instead.
The X variables are transformed by subtracting their mean and dividing by their standard deviation before being used as ANN inputs, since this tends to speed up training of ANNs (LeCun et al., 2012). The values of the mean and standard deviation are derived during training and are not changed when the model is tested on the validation data.
It is also interesting to compare using ANNs in this way against using ANNs that are trained to replace dynamical models or components of them, as done by Dueben and Bauer (2018). Therefore, multilayer perceptron ANNs were also trained to predict the full tendency of the Truth system dX k ∕dt, given the same inputs as the ANN error correctors. Again, these predict the tendency for one X variable at a time. In an Earth system model, individual parameterization schemes could also be replaced by ANNs. However, the Lorenz '96 system lacks the complexity to do an interesting experiment where something closely analogous to replacing a parameterization scheme is carried out, given the simplicity of U(X).
ANNs with different arrangements of neurons were tested, with one or more hidden layers (the "depth") and with an equal number of neurons in each hidden layer (the "width"). As a shorthand notation, an ANN with depth D and width W will be referred to as "dDwW" (e.g., d2w32 refers to an ANN with depth-2 and width-32). The ANNs all use a linear output activation function and rectified linear unit activation functions on the hidden layers, which were found to work more robustly than hyperbolic tangent functions for the case of training ANNs to predict the full system tendencies. For this case, the outputs of ANNs with hyperbolic tangent activation functions were found to be prone to saturating, so that the largest tendencies could not be simulated. This may have happened because the magnitude of the output is limited to be the sum of the magnitudes of the weights connecting the final hidden layer to the output, which were not made large enough in training. This suggests that for predicting values that can take any size, using activation functions whose output values are not typically limited in magnitude is more likely to give good results.

Training ANNs
Results are presented for models using ANNs trained on 1,000 MTUs of truth model data, using the tendency over every 0.005 MTU interval, in order to test their potential skill when data availability is not a limitation. Using all 3,000 MTUs of the training, data were not found to increase the skill of ANNs substantially when tested on a few chosen ANN structures. It is also shown in section 2.2.1 that the performance of the ANNs is similar at predicting tendencies in the training and validation truth data sets, indicating that the ANNs are not substantially overfitting the training data, so increasing the amount of training data would not be expected to improve the ANNs' performances much.
ANNs are trained to minimize the sum of the squared prediction error and an L 2 regularisation term for the weights with coefficient 10 −4 . This was done using stochastic gradient descent with the Adam algorithm (Kingma & Ba, 2014). Minibatches of size 200 sets of input and output were used together with a learning rate of 0.001. Training stopped when the squared prediction error failed to decrease by at least 10 −4 twice consecutively after iterating over the whole training data set.

Results
Diagnostics comparing simulations from the Truth, No-ANN model, and models using ANNs are shown below. All results are very robust to sampling variability, as determined by checking that they are very similar when only half of the data is used, except for the difference in the mean biases (section 2.2.2) for which the use of ANNs was not found to give a statistically significant difference in most cases. Figure 1 shows the RMSE of tendency predictions over a single coarse time step (0.005 MTUs) for coarse-resolution models with error-correcting ANNs . Results are shown for models with ANNs with depths up to 3 and widths that are integer powers of 2 between 2 and 64 (width-1 ANNs did not generally perform well, as would be expected). The RMSE is calculated for 10,000 randomly chosen time steps in each of the training and validation data sets, using the same time steps for each ANN. The error is reduced compared to that for the No-ANN model for every ANN structure, showing that even very simple ANNs (e.g., with two neurons in a single layer) can improve the skill of predicting the tendencies. The errors do generally decrease as the ANN width and depth each increase, indicating that the optimal function relating the coarse-resolution model's tendency errors to the X variables may be quite complex. The maximum error reduction on the validation data set is 42% for the largest (d3w64) ANN, showing that ANNs can greatly reduce the error. The RMSEs on the validation data set are not more than 3% above those on the training data set, indicating that no substantial overfitting is occurring. The errors decrease as ANN size increases, and it seems likely that futher error reductions are possible. The aim here is to explore how it might be made easier to get some improvement using ANNs, rather than to find the optimum performance, so results for larger ANNs and varied training hyperparameters are not shown.

One Time Step Tendency Forecast Errors
For comparison, Figure 2 shows RMSEs of tendency predictions of coarse-resolution models using ANNs to predict the full tendency. Errors are generally higher than for the models using error-correcting ANNs for a given width and depth and are only better than the No-ANN model once the ANNs become sufficiently large (with width at least 32 or width at least 16 with a depth of 2 or more). This illustrates how more complex ANNs are generally required to replace the model components rather than just to correct their errors, making it harder to achieve better performance using this approach. More parameters are also needed, increasing the risk of overfitting the data. Note that it could still be the case that the optimum attainable performance, using ANNs larger than those tested here, is better than in models with error-correcting ANNs. Root-mean-square errors of tendency predictions, as in Figure 1, but for artificial neural networks (ANNs) predicting the full X tendencies (section 2.1.2). Note that the color scale is saturated for the smallest ANNs. More complex ANNs are required to outperform the No-ANN model than for models using error-correcting ANNs.
Improvements upon the No-ANN model can also be obtained with much smaller amounts of training data using the error-correcting approach. Models with ANN correctors trained on just 2 MTUs of training data predicted tendencies for the validation data set with RMSEs that were robustly less than those predicted from the No-ANN model in most cases (not shown). Pure ANN models require at least about 20 MTUs to achieve this. For comparison with the typical time scales of the true system, the autocorrelation of the X variables falls to about 0.05 after a lag of 0.4 MTUs.
In order to determine if there are any situations in which the errors of tendencies predicted by the models using error-correcting ANNs are large, Figure 3 shows scatter plots of predicted tendencies and their errors versus the true tendencies. The sets of tendencies shown are from the No-ANN model and two example coarse-resolution models with error-correcting ANNs (with d1w16 and d2w32 structures, the latter performing particularly well at improving short-term forecast and climate skill scores [section 2.2.2]). One hundred thousand scatter points are shown for tendencies predicted in each of the training and validation data sets.  The models with error-correcting ANNs predict tendencies that are close to the true tendencies in both the training and validation data sets, including for extreme positive and negative tendencies. The predicted tendencies are generally closer to the true tendencies than those made by the No-ANN model throughout the whole range of true tendency values, including for extreme cases, although the No-ANN model also does not make any particularly large errors. This is evidence that the ANNs have learnt how to actually improve the representation of the dynamics, so that they can improve most predictions and not degrade predictions of extreme values in the validation data set even when there are few examples of the latter in the training data. This is generally the case for all of the different ANN structures, even for the smallest ANN that was tested (d1w2; not shown). It is important to show that ANNs do not simply fit the training data and perform poorly at extrapolating to make predictions for rare, extreme situations, since it is essential in Earth system modeling applications that models' performance does not severely degrade in these cases.

Forecast and Climate Simulation Skill
Metrics of forecast skill and the quality of the simulated climate of the X variables are shown in Figure 4 for coarse-resolution models with error-correcting ANNs of different depths and widths, evaluated using the validation data set only (this is the case for all model quality metrics shown from now on).
Forecast diagnostics were computed from 10-member ensembles of simulations initialized from each of 3,000 states of the X variables sampled from the Truth validation run, each separated by 1 MTU, giving effectively independent initial conditions. To form the initial conditions for each ensemble member for each Truth initial condition, random perturbations were sampled for each X variable independently (noting that correlations between X variables in the Truth system are small). Firstly, a sample ( ) was taken from a Gaussian distribution with a mean of 0 and a standard deviation of 0.05. Then 10 samples were taken from a Gaussian distribution with a mean and a standard deviation of 0.05 and added to the Truth state. This ensured that the population standard deviation of the initial conditions equalled the standard deviation of the differences between their means and the Truth states, as would be expected if the perturbations came from a well-calibrated error distribution in the estimate of the initial state in a forecasting system.
The forecast anomaly correlation coefficient (ACC) and RMSE at lead time 1 MTU are better than in the No-ANN model for all models with ANNs except those with width-2 and depth-2 or 3 (Figure 4, top; squares are shaded red where the metric is better than that for the No-ANN model). (Forecasts at a lead time of 1 MTU are roughly analogous to a "medium-range" forecast of the Earth's atmosphere, given the autocorrelation time scale of the system.) Therefore, in most cases the improvement in representing the single time step tendencies (section 2.2.1) has brought about an improvement of longer range forecast skill relative to the No-ANN model. The improvement seems to be quite modest, however, raising the ACC from about 0.46 to 0.49 and decreasing the RMSE from 5.89 to 5.73 at best. However, note that the ACC for the Truth model initialized with the same initial conditions perturbations is only 0.52 and its RMSE is 5.59. This is the maximum potential skill. Therefore, the best improvements in the ACC and RMSE are slightly over 60% of the difference between the maximum possible skill and that of the No-ANN model. For the median case, they are 49% and 48% of the difference, respectively. This suggests that in a case where forecast skill were much lower than the maximum possible skill than it is here, the absolute skill improvements gained by using ANNs could be much more substantial. (Indeed, ANNs that are trained to simulate the full tendency dX k ∕dt can have skill similar to the models with error-correcting ANNs tested here [not shown]. This suggests than ANNs could learn to correct the errors of a No-ANN model that were degraded to have a much lower skill level, so that the gap between the No-ANN model and the models with ANNs were much larger, though this is not tested here.) The biases of the time mean of the X variables diagnosed from 3,000 MTU climate runs are shown in the bottom left panel of Figure 4. The diagnosed biases are mostly similar to those of the No-ANN model, except those for the models with d1w2 and d2w2 ANNs, which have much larger biases. This is the one diagnostic for which sampling variability is substantial. The biases are not statistically significantly different from that of the No-ANN model at the 95% level, except in the cases of the models with d1w2 and d2w2 ANNs. Therefore, it is difficult to be confident about how many of the models with error-correcting ANNs have smaller mean biases without using much longer climate runs, but it seems clear that the changes in the bias are quite small overall. (The statistical significance was calculated according to a Monte Carlo permutation test; Efron & Tibshirani, 1994. Each time series was divided into blocks of length 100 MTU, which is much larger than the autocorrelation time scale of the data. For each model with an ANN, surrogate time series of length 3,000 MTU were created by selecting blocks randomly without replacement from the simulation by this model and the simulation by the No-ANN model. The probability of the absolute difference between the means of two of these time series being smaller than that between the actual time series was calculated to quantify the statistical significance.) In order to evaluate improvements in the shape as well as the mean of the simulated climatological distribution of X values, the two-sample Kolmogorov-Smirnov (KS) statistic was calculated between the simulated distribution and the distribution in the truth model validation run. This is simply the maximum difference between the cumulative density functions of the two distributions as a function of X. The KS statistic is improved in all but the d2w2 case, by up to ∼15% (Figure 4, bottom right). Part of the reason that this happens even though the bias in the mean of the distribution is not always improved is that the variance of the X values is increased relative to that in the No-ANN model (not shown), bringing the X distribution closer to that of the truth model by this measure, except in the d2w2 case.
Formal statistical significance tests were not carried out for diagnostics other than the mean bias because it seems very unlikely to get the result that models with all but the smallest ANNs seem to have improved diagnostics (Figure 4) if it were not the case that ANNs were truly producing improvements in most cases. Detailed consideration of the sampling uncertainty would be required to assess the relative skill for different ANN structures, but it is not the aim here to do this. Altogether this indicates that the use of error-correcting ANNs in this system is able to robustly give improvements in forecast skill and the shape of the climate distribution relative to that of the No-ANN model. However, comparing Figure 4 with Figure 1 shows that improving the error of the predicted tendency does not guarantee that the quality of longer simulations will also improve. The improvements in the climate diagnostics are also smaller than might be anticipated, given the large reduction in the tendency errors that was shown in section 2.2.1. Figure 5 shows the forecast ACC and RMSE as a function of lead time for the No-ANN model and the models with the d1w16 and d2w32 error-correcting ANNs, which are the same models that were used for Figure 3. The forecast skill is very similar for the different models up to a lead time of about 0.75 MTUs, after which the models with error-correcting ANNs begin to have higher skill than the No-ANN model. After a lead time of ∼1 MTU, their skill is approximately half way between that of the No-ANN model and the Truth model using the same initial condition perturbations. The maximum skill differences between the models with the error-correcting ANNs and the No-ANN model are about 0.04 in the ACC and 0.2 in the RMSE.
To understand better how the improvements in climate statistics shown in Figure 4 are manifested in the frequency distribution of the X variables, Figure 6 shows their distribution in the Truth validation data set, in the No-ANN model and in the previously discussed models with the error-correcting ANNs. The simulations produced by the latter have smaller frequencies near the center of the distribution, so that the bias here is smaller, with the frequencies at moderate negative values between about −7.5 and −5 beneficially increased. All of the coarse-resolution models have too low frequencies of large positive and negative X values, however. This may indicate that it is not possible to simulate the correct frequencies of these extremes without explicitly representing the Y variables, though it is also possible that it could be improved by applying better machine learning approaches or including stochasticity in the coarse-resolution models (Arnold et al., 2013).
On top of considering statistical summary measures of simulation skill, it is also important to verify that the temporal evolution of the system state is realistically simulated in the models with ANNs. Figure 7 shows a time series of the first X variable of length 5 MTUs at the start of the validation data set (Truth; Figure 7, top left), in the No-ANN model (top right) and in the models with the d1w16 and d2w32 error-correcting ANNs (bottom). All time series begin with the initial condition of the validation data at time zero, so that the models capture the features of the initial evolution up to about time 1 MTU, and then the model simulations diverge, likely primarily due to chaotic variability. After this point, the coarse-resolution models produce variability that appears qualitatively similar to that in the Truth system. and models with error-correcting artificial neural networks (ANNs) with depth-1 width-16 (bottom left) and depth-2 and width-32 (bottom right). The models with ANNs capture the initial evolution of the true system, and then beyond the predictability limit they exhibit similar variability to the true system. MTU = model time unit.
To quantify the impact of using X k + 2 as an input to the ANNs, which is not used as an input to the No-ANN model, results for error-correcting ANNs using the same inputs as the No-ANN model were analyzed. These were also found to robustly improve the metrics discussed above (again, with the exception of the mean climate bias). The improvements in short-range prediction errors relative to the No-ANN model were smaller than found above, by a median of 18% for one time step tendency errors across ANNs with the same structures as those considered above and by about 40% for the ACC and RMSE at a lead time of 1 MTU. The improvements in the climate KS statistic tended to be slightly greater, however, by a median of 5%-this is perhaps because optimizing short-range skill does not always optimize long-range skill. Overall, this illustrates that using inputs that are not presently incorporated in all Earth system model components could be a substantial source of skill added by machine learning algorithms (e.g., data in horizontally adjacent grid columns that are not typically used in subgrid parameterization schemes).

Additional Considerations for Earth System Modeling
The approach used here of training ANNs to reduce single time step tendency errors could not be applied exactly analogously to learn to better represent the dynamics of the Earth system because observations at a given location are typically spaced 6 hr or more apart, and state-of-the-art dynamical Earth system models use time steps that are much shorter. Maintaining a short time step is desirable so that the model equations can better approximate the true equations, which are continuous in time. It may also be necessary to ensure numerical stability. Therefore, an approach is required that could update the learning algorithm's parame-ters based on what would improve forecast skill over multiple time steps. Brenowitz and Bretherton (2018) achieve this when emulating convection-resolving simulations in a single column model by optimizing a cost function that takes into account errors in a prediction over multiple time steps-this is in order to make their system stable, but it may also help to improve longer range prediction skill. In a free-running system, the impact of parameter perturbations on output from previous time steps would need to be taken into account, on top of their impact in the time step corresponding to the observation. The "backpropagation through time" algorithm (Werbos, 1990) that is used in recurrent ANNs (Funahashi & Nakamura, 1993) could be used. In a model with multiple grid points, if the Earth system learning algorithm is "local," then the effect of varying the algorithm's parameters on predictions at nearby grid points probably needs to be taken into account as well as the effect on the predictions through multiple time steps-the backpropagation needs to be done "backward through time and sideways through space." This is because tendencies at a given grid point depend on the system state at nearby grid points, and so prediction errors at those points at earlier time steps need to be accounted for. (It seems desirable for the algorithm to take inputs only from local grid points in order to be easier to implement in parallel computing environments, Dueben & Bauer, 2018, and to respect symmetry of the physical equations with respect to spatial translation.) For error-correcting algorithms, this approach requires the tangent linear approximation of the remainder of the model, which is related to the adjoint models that are often used in data assimilation (Errico, 1997) and have been developed for some models of Earth system components (e.g., Janisková & Lopez, 2013;Lea et al., 2015).
The data used for training algorithms also need to be considered. Dueben and Bauer (2018) suggest using reanalysis data. Although reanalysis data is imperfect, it is likely to have smaller climate biases than existing dynamical models, enabling the algorithms to yield performance improvements. A possible next step would be to recalculate the reanalysis using the improved model, combining information from this model and observations to get a yet better estimate of climate statistics. This could then be used to train better algorithms, and so on, yielding further upward steps in performance, as well as an optimal estimate of past weather given our observations.

Application to Problems Beyond Increasing Prediction Skill With a Stationary Climate
Error-correcting algorithms in dynamical models may be useful for addressing problems besides improving simulation skill. For example, if they do a good job at correcting large model errors, then it may be possible to understand from them how model components like conventional parameterization schemes can be improved, making use of advances in interpreting the workings of algorithms like ANNs (e.g., Ribeiro et al., 2016). They could also help to constrain stochastic parameterizations (Palmer, 2001;Watson et al., 2015) by placing an upper bound on the size of the component of the tendencies that is not predictable given the variables on the coarse grid, an irreducible error for a given model resolution, which can be modeled stochastically. Generative-adversarial algorithms (Goodfellow et al., 2014) could also find better ways to model the effects of unresolved flow stochastically (Xie et al., 2018;Gagne II, DJ et al., "Machine Learning for Stochastic Parameterization: Generative Adversarial Networks in the Lorenz '96 Model," in preparation).
The ability to vary the complexity of algorithms like ANNs in a systematic way to create a model ensemble also allows for testing of the seamless prediction paradigm-the idea that models that have better short-range prediction skill also have better long-range skill, which would mean that metrics of weather forecast skill would be informative about models' abilities to simulate the climate response to anthropogenic forcing (Matsueda et al., 2016;Palmer et al., 2008). Alternative methods of creating an ensemble of models such as by perturbing model parameters may generally struggle to give any skill improvements, so it cannot be seen if climate simulation skill improves as short-range prediction skill gets better. In the Lorenz '96 system studied here, correlations between the single-tendency prediction error in the validation data set (Figure 1, right) and the forecast and climate skill diagnostics shown in Figure 4 have magnitudes between 0.63 and 0.70. Correlations between the forecast RMSE at lead time 1 MTU (Figure 4, top right) and the climate mean and KS statistic (Figure 4, bottom panels) are 0.57 and 0.81, respectively. This quantifies the relationship between short-range and long-range skills in this system when using error-correcting ANNs, showing that improvements in predictions at shorter lead times do indeed tend to be associated with improvements in long-range predictions. However, as noted earlier, the correspondence is not perfect, and the improvements made to long-term climate diagnostics by using ANNs are considerably smaller than what might be expected given the improvements made to single time step tendency predictions. Therefore, the seamless prediction paradigm does not apply fully. It would be very interesting to see how well it applies 10.1029/2018MS001597 in Earth system models, given the correspondence that has been identified between biases in short-range forecasts and simulated climate (Ma et al., 2013;Sexton et al., 2019).
Another interesting question is whether using statistical learning algorithms within Earth system models could help to give more accurate simulations of the impacts of anthropogenic climate change. This is challenging because this requires making predictions about conditions that are dissimilar from those we have observed, so that a good representation of the underlying dynamics of the system is necessary. O'Gorman and Dwyer (2018) found that their emulation of a convection parameterization could not reproduce the effect of climate change well when it was trained only in a stationary "control" climate. However, statistical approaches such as optimal fingerprinting are well established in work on detection and attribution of climate change and can be used to estimate the extent to which a given model is overestimating or underestimating the response to a particular forcing (Bindoff et al., 2013). The climate change signal in individual weather events also appears clearer when dynamical variability is controlled for, which has been done previously using weather analogues (e.g., Cattiaux et al., 2010;Yiou et al., 2007). This suggests that there is scope for learning the effects of anthropogenic emissions more precisely within a model that can also accurately take into account all of the other influences on individual weather events. Even if such a model would not be trusted for projecting the impacts of large climatic changes without people being able to understand the calculations behind its predictions, it may still be useful for problems such as the attribution of observed extreme weather events (Allen, 2003; National Academies of Sciences Engineering and Medicine, 2016), for which extrapolation beyond observed conditions is not so much of a concern.

Conclusions
It has been shown that ANNs can learn to correct errors of a coarse-resolution model of a chaotic dynamical system (the Lorenz '96 system), resulting in stable simulations that have both improved skill in initialized forecasts and better long-term climate statistics. Improvements are found for a wide range of ANN structures, showing that they are quite robust.
The ANNs used here could reduce errors in single time step predictions by up to about 40%, and it seems that the errors could be reduced yet further if the ANNs were increased in size (Figure 1), though it is not the aim here to find the best possible performance. Errors of predicted single-time step tendencies become gradually smaller as the ANN complexity increases (Figure 1), and there does not appear to be a substantial problem due to the training getting stuck in poor local minima. The models with ANNs also give good predictions of extreme tendencies that were not seen in the model training stage (Figure 3).
In initialized medium-range forecasts, the improvement in the absolute ACC and RMSE was only a few percent at lead times of ∼1 MTU. However, this was ∼50% of the maximum possible improvement in the median case, determined by comparison with forecasts made by the Truth model with the same initial condition perturbations. The improvement of climate statistics was modest, with improvements up to ∼15% in the climate KS statistic and no discernible improvement in the time-mean state. This may be because the model without an ANN was already actually quite skilful at predicting the Truth system's behavior-for example, Figure 3 shows that its predicted tendencies are always quite close to the true tendencies. For models of Earth's atmosphere, coarse-graining studies find much worse agreement between tendencies predicted by models and estimated true tendencies (e.g., Shutts & Pallarės, 2014;Shutts & Palmer, 2007), suggesting that there may be much more room for improvement using machine learning. However, as discussed in section 3.1, getting large benefits in longer range skill may require training algorithms to target improvements on time scales longer than single time steps.
These results support the idea that ANNs (or other machine learning algorithms) could help to reduce errors in dynamical Earth system simulations by learning a better representation of the physical equations from observations or from more realistic models that are too expensive to use generally in weather and climate prediction (Dueben & Bauer, 2018). However, it can be far easier to use ANNs to correct the output of an existing model than to train ANNs to simulate the entire system, because far smaller ANNs can be used and much less data is required for training the ANNs (it was also shown by Jia et al., 2018, that error-correcting ANNs require less data, in the context of simulating lake temperatures). Also, Earth system models typically relate dozens of inputs and outputs at every grid point, but an error-correcting system can produce performance improvements while only considering a subset of the models' inputs and outputs, meaning it is possible to begin demonstrating improvements without reproducing the complexity of the full model. This is valuable because the more complex the ANN that is required, the harder it is generally to find a training method that produces good results. This method also utilizes the physical understanding embedded in the existing parameterization schemes, and the error-corrections should only become large in situations when the schemes do not perform well, reducing concerns about their reliability. This makes this approach more appropriate for use in a research program to investigate the potential for ANNs to reduce model errors and to begin producing operational improvements. The next step is to test whether the method works as well in models of components of the Earth system.
The main drawback of this approach compared to training an algorithm to simulate the full system is that the computational cost of the model cannot be reduced. Using algorithms like ANNs to learn to represent the full system's dynamics may therefore be the approach adopted in the long run, but developing systems to learn to correct model errors will give invaluable insights about how to achieve this in the medium term and help to demonstrate whether attempting to learn a better representation of the full dynamics from observations or expensive models is likely to give a substantial improvement in forecast skill. (There is also nothing to preclude an error-correcting algorithm being used in conjunction with emulators of an existing model's parameterization schemes or high-resolution simulations that do reduce the computational cost; e.g., Brenowitz & Bretherton, 2018;Chevallier et al., 1998;Gentine et al., 2018;Krasnopolsky et al., 2010;O'Gorman & Dwyer, 2018;Rasp et al., 2018). Models of the Lorenz '96 system using ANNs to predict the full tendency were found to achieve similar performance to the models with error-correcting ANNs, just requiring larger ANNs to do so (not shown). Therefore, there is nothing in the results presented here to preclude using ANNs in place of physically derived models eventually. The two methods can be used in a complementary way in a research program.