A Hybrid Approach to Atmospheric Modeling That Combines Machine Learning With a Physics-Based Numerical Model
This article was corrected on 23 MAR 2023. See the end of the full text for details.
Abstract
This paper describes an implementation of the combined hybrid-parallel prediction (CHyPP) approach of Wikner et al. (2020), https://doi.org/10.1063/5.0005541 on a low-resolution atmospheric global circulation model (AGCM). The CHyPP approach combines a physics-based numerical model of a dynamical system (e.g., the atmosphere) with a computationally efficient type of machine learning (ML) called reservoir computing to construct a hybrid model. This hybrid atmospheric model produces more accurate forecasts of most atmospheric state variables than the host AGCM for the first 7–8 forecast days, and for even longer times for the temperature and humidity near the earth's surface. It also produces more accurate forecasts than a model based only on ML, or a model that combines linear regression, rather than ML, with the AGCM. The potential of the CHyPP approach for climate research is demonstrated by a 10-year long hybrid model simulation of the atmospheric general circulation, which shows that the hybrid model can simulate the general circulation with substantially smaller systematic errors and more realistic variability than the host AGCM.
Key Points
-
A hybrid model incorporating machine learning produces more accurate forecasts and more realistic climate than the host physics-based model
-
The hybrid model states are more realistically balanced and have substantially lower biases than the host model
-
The hybrid model produces more realistic atmospheric variability than the host model at time scales shorter than about a week
Plain Language Summary
This paper presents a computationally efficient novel approach to construct a hybrid model of the atmosphere by combining a physics-based model of the global atmospheric circulation with a machine learning component. The primary purpose of the hybrid model is to produce quantitative weather forecasts on the same grid as the physics-based model. It is found that the hybrid model produces more accurate forecasts than the host physics-based model for the first 7–8 forecast days for most forecast variables, and for even longer times for the temperature and humidity near the Earth's surface. Furthermore, the hybrid model is found to simulate the climate with substantially smaller systematic errors and more realistic temporal variability than the host model.
1 Introduction
Numerical weather prediction (NWP) models have been the backbone of operational weather prediction for several decades now (e.g., Harper, 2008; Lynch, 2006). A particular model implements a numerical solution algorithm for the physics-based set of coupled partial differential equations that govern atmospheric motion (e.g., Szunyogh, 2014). The resulting numerical equations form the dynamical core of the model. The effects of processes not resolved explicitly by the dynamical core are taken into account by parameterization schemes that contribute to the forcing terms of the equations. These schemes are based on some combination of theoretical and empirical considerations (e.g., Stensrud, 2007). The initial conditions of the numerical model solutions are observation-based estimates (analyses) of the state of the atmosphere, and the process that produces these estimates is called data assimilation (e.g., Szunyogh, 2014). The advances in modeling and data assimilation techniques, alongside with the increase of computing power and the number of observations available for assimilation, led to a “quiet revolution of NWP” (Bauer et al., 2015). The incorporation of machine learning (ML) techniques into the NWP process promises to lead to further forecast accuracy gains by extracting additional information from the observations.
The earliest applications of ML to atmospheric modeling focused on improving the computational efficiency of the physics-based numerical models (e.g., V. Krasnopolsky et al., 2005; V. Krasnopolsky & Fox-Rabinovitz, 2006; V. M. Krasnopolsky, 2013). These applications employed neural networks to emulate the computationally most expensive physics-based parameterization schemes at a reduced computational cost. The term hybrid model was first used in reference to models using this technique. One approach employed by this type of hybrid models is to use a single neural network to emulate the combined effect of multiple parameterized processes, such as cumulus convection, radiation, boundary layer transport, etc (e.g., V. Krasnopolsky et al., 2010; V. M. Krasnopolsky, 2013; Brenowitz & Bretherton, 2018, 2019; Rasp et al., 2018). For this purpose, the ML systems are often trained on data produced by model simulations at higher resolutions, or with more sophisticated physical parameterization schemes.
Another type of ML-based parameterization scheme (e.g., Chattopadhyay et al., 2020; Gentine et al., 2018; Rasp et al., 2018), is trained on observations or observations-based reanalyzes. Such a scheme has the potential to learn about the effects of processes that the higher resolution and more sophisticated model simulations are still unable to capture. ML techniques have also been considered for the estimation of the free parameters of physics-based parameterization schemes (Schneider et al., 2017). This approach takes advantage of the knowledge built into the parameterization schemes, but may suffer from the assumptions and approximations made by the schemes.
The hybrid approach we propose belongs to a class of techniques that are different from those mentioned thus far. Techniques of this class use ML for the frequent periodic interactive correction of the spatiotemporally evolving physics-based numerical model solution after training on observational analyses. The specific approach we propose was originally developed by Pathak, Wikner, et al. (2018) and later adapted to large dynamical systems by Wikner et al. (2020), who named it Combined Hybrid-Parallel Prediction (CHyPP). It evolves the hybrid forecasts iteratively, combining a short-term (e.g., 6 hr) numerical forecast with a state-dependent ML correction in each “time step” of the “hybrid model integration”. CHyPP is not a postprocessing technique, because each “time step” of the evolving hybrid model solution starts from the ML-corrected state of the preceding step, whereas a postprocessing technique does not interact with the evolving model solution. The ML component of CHyPP uses the computationally highly efficient parallel reservoir computing (RC) algorithm of Pathak, Hunt, et al. (2018). The other hybrid approaches of the same class use either a random forest (Watt-Meyer et al., 2021) or use a deep learning ML component (Farchi et al., 2021), rather than one based on RC.
Wikner et al. (2020) demonstrated the potential of CHyPP for predicting the evolution of a spatiotemporally chaotic system by experiments with the Kuramoto-Sivashinsky (KS) model (Sivashinsky, 1977), a model that has a single state variable that depends only on a single space dimension in addition to time. We implement CHyPP on the Simplified Parameterization, primitive-Equation Dynamics (SPEEDY) (Kucharski et al., 2006; Molteni, 2003) atmospheric global circulation model (AGCM). Ours is the first implementation of the approach on a model that has multiple state variables with a wide range of values and depend on all three spatial dimensions. Because SPEEDY has a substantially lower resolution than a state-of-the-art NWP or climate model, our primary goal is to demonstrate the feasibility and potentials of CHyPP for an atmospheric application, rather than to propose our current model as a potential replacement for a state-of-the-art numerical model. The results of our forecast experiments show that the performance of the hybrid model is superior to that of either SPEEDY, a model based only on ML, or a model that uses linear regression rather than ML for the correction of the short term (“one time step”) numerical forecasts.
In what follows, we first describe the hybrid approach and its implementation on SPEEDY in detail (Section 2). Then, we discuss the results of the forecast experiments (Section 3), and then the climate simulation (Section 4). Finally, we summarize our key findings and draw our conclusions (Section 5).
2 The Hybrid Model
In CHyPP, the physics-based numerical model state is evolved globally, while the ML correction is done in parallel, in small local domains (Pathak, Hunt, et al., 2018). The model state of a local domain is represented by a local state vector composed of the relevant components of the global state vector. The global hybrid prediction is obtained by piecing together the local hybrid predictions at the end of each Δt-long “time step” of the “hybrid model integration”. This approach can be implemented on any numerical model by adjusting the definition of the local state vectors to the spatial discretization strategy of the model. We note that the localization strategy of CHyPP is similar to that employed by the Local Ensemble Transform Kalman Filter (LETKF) data assimilation scheme (Hunt et al., 2007; Ott et al., 2004; Szunyogh et al., 2008), which has been found to scale efficiently even for very high (kilometer) resolution operational weather prediction models (e.g., Schraff et al., 2016).
2.1 The Global State Vector
SPEEDY is a spectral transform AGCM that was developed to produce rapid climate simulations, using simplified, but modern physical parameterization schemes (Molteni, 2003). We implement CHyPP on the standard configuration of Version 41 of the model: the spectral horizontal resolution is T30, while the grid used for the computation of the nonlinear terms and parameterizations has a nominal horizontal spatial resolution of 3.75° × 3.75° with state variables defined at eight vertical σ-levels (0.025, 0.095, 0.20, 0.34, 0.51, 0.685, 0.835, and 0.95), where σ is the ratio of pressure to the surface pressure. The three-dimensionally varying state variables of the model are the two components of the horizontal wind vector, temperature, and specific humidity, while the single two-dimensionally varying state variable is the natural logarithm of surface pressure. The global computational grid and the state variables of the hybrid model are the same as those of SPEEDY.
2.2 The Local State Vectors
In our implementation of CHyPP on SPEEDY, each local state vector represents the atmospheric state in a three-dimensional local domain that has the shape of a rectangular box with a 7.5° × 7.5° (2 × 2 horizontal grid points) base and extends vertically from ground level to σ = 0.025 (The boundaries of the horizontal footprint of a local domain are marked by a blue rectangle in Figure 1.) In what follows, we describe the computations carried out in parallel for each of the L = 1, 152 local domains to evolve the hybrid model state from time t to t + Δt.

Illustration of the localization strategy. The black dots indicate the horizontal locations of the grid-points of the model. The blue rectangle marks the horizontal boundaries of a particular local domain. The red rectangle indicates the horizontal boundaries of the associated extended local domain.
Let v(t) be the local state vector for an arbitrary local domain at time t. The dimension of this state vector is 4 × (8 × 4 + 1) = 132 (resulting from the 4 grid points of a local domain, the 8 σ-levels, the 4 volume distributed state variables, and the natural logarithm of surface pressure state variable). Because the different state variables have different units and ranges of values, where the ranges also depend on the geographical location and vertical level, each state variable is standardized to have, for each vertical level of each extended local domain (represented by a red rectangle in Figure 1), a mean of 0 and a standard deviation of 1 before forming v(t). The standardization is done by using ERA5 reanalysis data (Hersbach et al., 2020) for the computation of the climatological mean and standard deviation (across both time and the 16 horizontal grid points in the extended local domain) of each state variable at each vertical level. We introduce the notation vp(t), vh(t), and va(t) for the local state vector of SPEEDY, the hybrid model, and the reanalysis, respectively. We also introduce the notations vgp(t), vgh(t), and vga(t) for the related global state vectors. For instance, the components of vga(t) in an arbitrary local domain are the components of va(t). In what follows, we explain the steps of the computation of vgh (t + Δt) from vgh(t). A flowchart of these steps is shown in Figure 2a.

A flow chart of (a) the hybrid model and (b) the training operation of the hybrid model. The notation is defined in Sections. 2.2 and 2.3. The steps inside the red boxes are carried out in parallel for each of the L = 1, 152 local domains. The training finds the W that minimizes the cost function of Equation 4 by solving Equation 5.
2.3 Reservoir Dynamics

This dynamical system is the reservoir, r(t) is the reservoir state vector, and uh(t) is the local input state.
During the training, the input term uh(t) in Equation 1 is replaced by ua(t). The local input uh(t) in our case is a m-dimensional extended local state vector, composed of the components of the local state vector vh(t) plus additional components of the global state vector vgh(t) from the neighboring local domains (see Figure 1 for illustration), plus the prescribed incoming solar radiation at the top of the atmosphere for the extended local domain. The latter component is included to help the hybrid model to learn the diurnal cycle from the input data (SPEEDY uses the daily average value of the incoming solar radiation at the top of the atmosphere at all times of the day.) For all of the local domains, m = 16 × (8 × 4 + 1 + 1), except at the local domains adjacent to the poles where m = 12 × (8 × 4 + 1 + 1).
Referring to Equation 1, the dimension Dr of the vector r(t) is much higher than that of a local state vector vh (t) (e.g., 6,000 vs. 132 in the present article). The activation function with a vector argument, tanh (.), is a vector of the same dimension (Dr) as its argument, and a component of this vector is the hyperbolic tangent of the corresponding component of the argument vector. The matrix A is a sparse Dr × Dr weighted adjacency matrix that represents a low-degree, directed, random graph (Gilbert, 1959). Each entry of A is randomly chosen with a probability κ/Dr of being nonzero, where κ is the degree of the graph (the average number of incoming connections per node), and with the nonzero entries of A randomly drawn from a zero-mean uniform distribution (The ratio κ/Dr is a measure of the sparsity of A.) After randomization, the entries of A are scaled such that the largest eigenvalue of A is a prescribed number ρ (0 < ρ < 1), which is called the spectral radius. The spectral radius controls the length of the memory of the ML reservoir, and a value ρ < 1 typically makes the reservoir state r(t) depend only on the past states of the modeled system (the atmosphere in our case), and not on the initial reservoir state, when t is sufficiently large. This property of the reservoir is called the echo state property (Jaeger, 2001).
The matrix-vector product Buh(t) is called the input layer in RC. In our model, B is a m × Dr sparse random matrix with an equal number of nonzero entries in each row. These nonzero entries, which are chosen randomly from a uniform distribution on the interval (−α, α), couple the components of uh(t) to the reservoir nodes. The input strength α is an adjustable parameter that controls the degree of non-linearity experienced by the input signal uh(t) from the activation function.
2.4 The Hybrid Model






2.4.1 Training
Figure 2b shows the flow of operations during training. First, we generate a sequence of perturbed global analyses vga (kΔt)(1 + ɛg (kΔt)), k = −K − Kt, −K − Kt + 1, …, −1, where ɛg (kΔt) is a small-magnitude, zero-mean, normally distributed random noise vector, uncorrelated in time and uncorrelated between components of the noise vector. The role of this noise is to help the ML model learn to return to the bounded set of realistic atmospheric states (the “attractor”) in the presence of perturbations that may arise in future forecasts (e.g., Jaeger, 2001; Wikner et al., 2020). The addition of noise to the global analyses during training is essential for the hybrid model to produce stable, realistic predictions; predictions rapidly become unstable without it. Similar behavior has been observed in RC applications involving the prediction of other spatio-temporal systems (e.g., Patel et al., 2021).
The local input state ua(kΔt) is the extended local state vector associated with vga(kΔt)(1 + ɛg(kΔt)), for k = −K − Kt, −K − Kt + 1, …, −1 for the particular local domain. The initial state r((−K − Kt)Δt) of the reservoir can be chosen arbitrarily, because only the evolved reservoir states r[(k + 1)Δt], k = −K, −K + 1, …, −1, are used for training. The purpose of discarding the reservoir state of the first Kt (Kt ≪ K) iterations is to ensure that the reservoir state r(t) has sufficient time to settle on its attractor. The unperturbed global analyses vga (kΔt) are also used as the initial conditions for SPEEDY to obtain vgp ((k + 1)Δt) for k = −K, −K + 1, …, −1.

The local hybrid states vh (kΔt, W), k = −K + 1, −K + 2, …, 0, represent the results of Equation 2 at those times for a particular W, and va (kΔt) is the local state vector for the unperturbed global analysis vga (kΔt) (Notice that we use the notation W for both the variable and the solution of the minimization problem.) The last two terms of the cost function, in which ‖ · ‖2 denotes the sum of the squares of the entries of a matrix (the Frobenius norm), are regularization terms meant to prevent overfitting, with βmod and βres being the regularization parameters for the numerical model and reservoir component, respectively. With these terms, the direct solution of the least-square problem is a ridge regression (Tikhonov & Arsenin, 1977). The inclusion of the prior matrix Wprior, which was not part of Wikner et al. (2020), allows for a choice like Wprior = I, which dictates that in the absence of training data that demonstrates imperfections in the numerical model, the hybrid model should be equivalent to the numerical model. In our experiments, we tried both Wprior = I and Wprior = 0, and found that the latter yielded better stability. Thus, we report results with Wprior = 0, but think that other choices for nonzero Wprior merit further study.






2.4.2 Synchronization and Prediction
Let KfΔt be the forecast start time. Starting the hybrid forecast requires the availability of the global analysis vga(KfΔt) and the reservoir state r(KfΔt) for each local domain. Because according to the “echo state property” r(KfΔt) is determined by the past states of the atmosphere, it can be obtained by synchronizing the evolution of the reservoir states with the analyses for a sufficiently long time period that ends at KfΔt. Let KsΔt be the start time of the synchronization. Synchronization is achieved by evolving the reservoir equation using uh (kΔt) = ua (kΔt) in Equation 1 for k = Ks, Ks+1, …, Kf.
Piecing together the local hybrid forecasts for all local domains yields the global “one-step” hybrid forecast vgh [(Kf + 1)Δt] (Figure 2a). The forecast can be extended arbitrarily far into the future by using an iterative process for k = Kf + 1, Kf + 2, … , in which the extended local state vector uh (kΔt) extracted from vgh (kΔt) is used as uh (kΔt) in the Equation 1 to compute r [(k + 1)Δt]. The global “one-step” hybrid forecast vgh (kΔt) is also used as the initial condition of the vgh [(k + 1)Δt] SPEEDY component of the hybrid forecast. In a cycled forecast system of an operational NWP center, in which analyses are prepared and forecasts are started with a regular frequency (e.g., 6 hr), the reservoir state can be kept continuously synchronized with the real-time evolution of the atmosphere.
2.5 Implementation With ERA5 Reanalysis Data
We use interpolated hourly global ERA5 reanalyzes to train and synchronize the hybrid model. We do the horizontal interpolation of the reanalysis fields onto the computational grid of SPEEDY by a 2-dimensional quadratic B-spline interpolation. We then compute the value of σ at each horizontal grid point and use a 1-dimensional cubic B-spline for the vertical interpolation of the model state variables to the eight prescribed constant σ levels of SPEEDY. The training starts at 0000 UTC on 1 January 1990 and ends at 2300 UTC on 26 June 2011 (K ≈ 3.14 × 104), with the data discarded for the first 6.25 days (K = 31355 and Kt = 25).
2.6 Selection of the Hyperparameters
Hyperparameters are adjustable parameters (e.g., κ, ρ, α, Dr, βres, βmod, ɛ, and Δt) that control overall characteristics of the hybrid model and require “tuning” to produce desirable results. There exists “tricks of the trade” practical rules for the selection of the hyperparameters of an RC model (Lukoševičius, 2012). These general rules also work for the hyperparameters of the hybrid model. First, the hybrid model is only weakly sensitive to κ and ρ. While we use κ = 6, other small values of κ (e.g., κ = 3) work similarly well. We use a value of ρ that monotonically increases toward the poles from 0.3 at the equator to 0.7 at 45°, so that the reservoir mimics the general property of the atmospheric dynamics that its memory is shorter in the tropics than the extratropics. Changing these values by ±0.1–0.2 has little effect on the model performance. We choose Dr = 6,000, because we find that further increasing the reservoir size does not lead to substantial further improvement of the model performance. We find the hybrid model performance to be somewhat sensitive to the value of α, which controls the amount of nonlinearity of the reservoir dynamics. Setting α ≤ 0.3 or α ≥ 0.7 yields noticeable degradation of the errors compared to the value we use, α = 0.5. For each of the options Wprior = I and Wprior = 0, we tried various powers of 10 for the regularization parameters βres and βmod; we found that Wprior = 0 yielded better stability, and found that βres = 10−4 and βmod = 100 led to good model performance. Among the several values we tried, in increments of 0.05, for the standard deviation of the components of the random noise 1 + ɛg that is multiplied by the training data, we chose the smallest value (0.20) for which all hybrid forecasts were stable. The time step Δt is another important hyperparameter to tune; we chose Δt = 6 hr, because using Δt = 1 hr or Δt = 3 hr (with other hyperparameters tuned accordingly) led to clearly poorer model performance. Moreover, we use a time step of Δt/24 = 0.25 hr for the numerical integration of SPEEDY, because longer time steps degraded the 6 hr forecast performance of SPEEDY. Since the temporal resolution of the ERA5 reanalyzes is 1 hr (Δta = 1), the training is done on Δt/Δta = 6 time series of data.
3 Forecast Experiments
We compute forecast error statistics based on 100 21-day forecasts, with start times equally spaced every 4 days between 0000 UTC, 27 June 2011 and 0000 UTC, 28 July 2012. We evaluate the forecast performance of the hybrid model by comparing it to that of a variety of benchmark forecasts started from interpolated ERA5 reanalyzes.
3.1 Benchmark Forecasts
The set of benchmark forecasts includes numerical forecasts produced by SPEEDY, a model based only on ML, and a model in which the 6 hr SPEEDY forecasts are corrected by linear regression rather than by ML. We call the latter benchmark SPEEDY-LLR, where LLR stands for local linear regression.
Comparing the performance of the hybrid model to that of a model based only on ML is important, because ML-only models (e.g., Arcomano et al., 2020; Rasp & Thuerey, 2021; Weyn et al., 2020) are considered a potential alternative to the hybrid approaches for the utilization of ML in Earth system modeling. Our ML model is formally the same as our hybrid model except that we use the constraint Wmod = 0 in Equation 3, with Equations 4 and 5 modified accordingly, and the hyperparameters are different: Dr = 9,000, βres = 10−6, Δt = 3 hr, and ɛ has a standard deviation of 0.28 (The smaller reservoir size necessary to obtain good results from the hybrid as compared to the ML-only model is an important advantage of the hybrid model.) While this ML-only model is formally identical to the one described by Arcomano et al. (2020), its forecast performance is better, thanks mainly to using a time step of Δt = 3 hr rather than Δt = 1 hr and the addition of the incoming solar radiation to the input of the reservoir.
The SPEEDY-LLR is the same as the hybrid model except that Wres = 0. In this model, a larger regularization parameter is necessary to produce stable forecasts for at least 10 days. We use βmod = 1,600, which provides the most accurate short and medium range (1–5 days) forecasts that also remain stable for at least 10 days. The stability of the SPEEDY-LLR forecasts can be improved by further increasing βmod, but only at the price of degrading the short and medium range forecast accuracy (For βmod → ∞, SPEEDY-LLR becomes SPEEDY, which produces stable forecasts for indefinitely long lead times). Since, SPEEDY-LLR does not include the nonlinear ML correction of the hybrid model (the second term on the right side of Equation 3), training is a simple linear regression of the numerical model forecast. With the help of this benchmark, we can assess the relative importance of making periodic corrections to the numerical forecasts based on linear regression of the model state alone versus making those corrections by the proposed hybrid technique.
To assess whether a model forecast has skill, the figures also include comparisons to forecasts based on persistence and daily climatology. The persistence forecasts are based on the assumption that the state of the atmosphere at the beginning of the forecast persists for the entire duration of the forecast, while the climatological forecasts are based on the daily climatological mean for the calendar day at the particular geographical location and pressure level for years 1990–2010.
3.2 The Measure of the Forecast Error


Here the subscript i, j refers to the value of a scalar state variable V for a specific forecast lead time at a particular pressure level at grid point i, j of the verification region defined by Nlon discrete longitudes and Nlat discrete latitudes. The RMSE is averaged over the 100 forecasts to obtain a single scalar measure of the forecast error for each state variable, pressure level, and forecast lead time. In what follows, the term forecast error refers to this scalar measure. We call a forecast more accurate than another, if the forecast error is lower for the former than the latter forecast. In addition, we say that a model forecast has forecast value, if its forecast error is lower than that of both persistence and climatology (the latter two are available without the substantial cost of preparing model forecasts). The qualitative behavior of the errors of the model forecasts with respect to the errors of these two references is well understood. In particular, if the model has realistic climatology, in the sense that it represents the atmospheric variability (the variability of the atmospheric state) correctly, the error of the model forecasts and the error of persistence saturate at the same level. While the error is initially lower for persistence than climatology, its saturation value is higher by a factor of (e.g., Section 3.8 of Szunyogh (2014)).
3.3 Comparisons of the Forecast Accuracy
3.3.1 Synopsis of the Forecast Verification Results
Figures 3 and 4 illustrate the temporal evolution of the forecast errors for the first five forecast days in the NH midlatitudes and Tropics, respectively. The errors are shown for the temperature (top row), meridional component of the wind vector (middle row) and specific humidity (bottom row) at forecast lead times day 1 (left column), day 3 (middle column), and day 5 (right column). In general, the hybrid forecasts (blue curves) have forecast value, except for the specific humidity at day 5 in the NH midlatitudes, for which they are only about as accurate as the forecasts based on climatology. In addition, the hybrid forecasts are either more accurate than all benchmark forecasts, or similarly accurate to the most accurate benchmark forecast. The hybrid model performance in the SH midlatitudes (not shown) is similar to that in the NH midlatitudes. The advantage of the hybrid model compared to the different benchmarks, however, strongly depends on the forecast variable and lead time. Next, we discuss this dependence, as it provides important insight into the mechanisms by which CHyPP improves the numerical forecasts.

Northern Hemisphere midlatitudes (between 30°N and 70°N) forecast verification results. Results are shown for the (blue) hybrid model (green) simplified parameterization, primitive-equation dynamics (SPEEDY) (orange) machine learning-only model (purple) SPEEDY-LLR model (red) persistence, and (black) climatology. Shown is the area-weighted root-mean-square error at the different atmospheric levels for (top row) the temperature (middle row) meridional wind, and (bottom row) specific humidity at (left column) day 1 (middle column) day 3, and (right column) day 5 forecast time.

As in Figure 3 for the tropics (between 30°S and 30°N).
3.3.2 Hybrid Versus SPEEDY Forecasts
Compared to SPEEDY, the advantage of the hybrid model is the largest for the temperature. While all hybrid temperature forecasts have substantial forecast value for the first five forecast days, the SPEEDY day five temperature forecasts have no forecast value in the Tropics and in the stratosphere in the NH midlatitudes. In addition, the SPEEDY forecasts have little forecast value at day five in the midlatitudes. The benefit of the ML correction is particularly striking in the tropical upper troposphere, where the SPEEDY forecasts have a large error with a maximum of 6 K at 200 hPa, while the error of the hybrid forecasts remains below 1 K.
In addition to the temperature, the hybrid forecasts are also substantially more accurate than the SPEEDY forecasts for the specific humidity, especially, in the lower troposphere, where parameterizations play an important role in modeling the effects of moist atmospheric processes. While in the NH midlatitudes the hybrid forecasts degrade only to the level of the forecasts based on climatology by day five, the error of the SPEEDY forecasts reaches saturation by that time.
In the two midlatitudes, the state variable for which the advantage of the hybrid model is the smallest compared to SPEEDY is the meridional component of the wind vector. This result is not surprising, as numerical models are known to capture synoptic-scale Rossby wave dynamics, which dominate the variability of weather in the midlatitudes. In contrast, in the Tropics, where wave dynamics is coupled to the parameterized process of deep convection, the advantage of the hybrid model for the meridional wind component is more substantial.
To explore the scale-dependence of the performance of the hybrid and benchmark forecasts, we examine the spectrum of the errors for the meridional component of the wind at 500 hPa with respect to the zonal wave number (Figure 5) (This figure also shows results for day 10, in addition to the results for forecast days one, three, and five.) The left panel shows the results for the hybrid and the SPEEDY model. Because SPEEDY is a spectral transform model with cut-off wave number 30, the spectrum for SPEEDY has no power at all beyond that wave number, and it is heavily dampened at wave numbers larger than about 20. Therefore, the errors of the hybrid forecasts, which have realistic power at all wave numbers, are expected to saturate at a level that is higher than that for SPEEDY at the tail-end of the spectrum. At day one, the hybrid forecasts have a clear advantage over the SPEEDY forecasts at the synoptic and large scales (zonal wave numbers lower than about 20). A smaller, but spectrally similar advantage still exists at day three, while the advantage of the hybrid forecasts disappears, except at wave numbers five and six, by about day five.

Spectral distribution of the 500 hPa meridional wind forecast error in the NH midlatitudes (between 30°N and 70°N) with respect to the zonal wave number. The power spectra of the forecast errors are shown (left) for the the hybrid model (blue) versus simplified parameterization, primitive-equation dynamics (SPEEDY) (green) (middle) the hybrid model (blue) versus the ML-only model (orange), and (right) hybrid model (blue) versus SPEEDY-LLR (purple) at day 1 (solid square), day 3 (open circle), day 5 (solid triangle), and day 10 (open diamond).
3.3.3 Hybrid Versus ML-Only Forecasts
While the errors of the ML-only forecasts (orange curves in Figures 3-5) are only slightly larger than that of the hybrid forecasts at day one, they grow much faster in the next 4 days and the ML forecasts typically have no value by day three. This result suggests that while the RC-based ML technique can produce accurate forecasts in the short range (day one to two), it is more effective in assisting SPEEDY than directly predicting the weather beyond that range. A comparison of the left and middle panels of Figure 5 suggests that the information provided by SPEEDY to the hybrid is particularly beneficial at the large scales (wave numbers lower than about six).
3.3.4 Hybrid Versus SPEEDY-LLR Forecasts
Next to the hybrid model, the benchmark that performs the best in the medium (day two to five) forecast range is the SPEEDY-LLR (purple curves). While the hybrid forecasts are more accurate than the SPEEDY-LLR forecasts, the forecast error differences between the two models are modest, except for those in the stratosphere. The fact that the forecast error differences are smaller for the hybrid model versus SPEEDY-LLR than for the hybrid model versus SPEEDY indicates that the periodic interactive correction of the SPEEDY forecasts itself makes an important contribution to the good performance of the hybrid model. The additional forecast improvement, however, is not the only benefit of using ML rather than local linear regression for the forecast correction: while the hybrid forecasts remain stable indefinitely (see Section 4), some of the SPEEDY-LLR forecasts fail as early as day 11 lead time, with about 60% of the forecasts reaching the intended 21 days.
It should be noted that the fact that local linear regression can efficiently correct the errors of a 6 hr forecast is not completely surprising, considering that linear regression can be used to model the short-term forecast error dynamics for even a state-of-the-art NWP model (Bishop et al., 2017), in which nonlinear effects are expected to play a more important role even at short lead times. It is a nontrivial result, however, that the information provided by such a linear approach can be used for the periodic, interactive correction of an evolving numerical forecast. It is also a nontrivial result that an RC-based ML technique stabilizes the resulting hybrid model indefinitely, and leads to further forecast improvement in the short and medium (day 1–5) range.
3.4 Global Mean and Spatially Varying Errors
To gain further insight into the ways the hybrid approach improves forecast performance, we decompose the global RMSE into a bias and a standard deviation component (The sum of the squares of the two components is equal to the square of the root-mean-square error.) The bias measures the global mean error, while the standard deviation measures the spatially varying part of the forecast error. The time evolution of the two error components, averaged over the 100 forecasts is shown for three representative state variables in Figure 6.

The time evolution of the (dashed) standard deviation and (solid) mean of the forecast errors. Each color indicates forecasts by a particular model: (blue) Hybrid model (green) simplified parameterization, primitive-equation dynamics (SPEEDY) (purple) SPEEDY-LLR model (orange) machine learning model, and (red) persistence. Results are not shown for SPEEDY-LLR beyond day 11, at which time some of the forecasts for that model fail.
For the temperature near the surface (at 950 hPa, top panel), SPEEDY rapidly develops a warm bias that oscillates around a mean of 0.75 K with the diurnal cycle. This bias is the result of SPEEDY using a single daily average value of the incoming solar radiation at the top of the atmosphere at all times of the day. The hybrid model greatly reduces the magnitude of the bias and also removes its diurnal oscillation. The biases of the ML model and SPEEDY-LLR are comparable to that of the hybrid model in magnitude, but the SPEEDY-LLR bias exhibits diurnal variability.
The spatially variable component of the low-level temperature error remains lower for the hybrid model than for SPEEDY throughout the 14-day period shown in the figure. The same component is initially similarly low for the hybrid and ML-only model, but it increases much more rapidly for the ML-only model (Even with this rapid increase, the ML-only forecasts remain more accurate than the SPEEDY forecasts until about day 4). This component is initially lower for the hybrid model than for SPEEDY-LLR, but their accuracies are essentially the same after about day 8. Also, while the curves for SPEEDY and the hybrid model saturate at the same level as persistence, the curve for the ML-only model saturates at a higher level, indicating that the ML-only model overestimates the spatial variability of the low-level temperature at the longer forecast times.
SPEEDY rapidly develops a positive specific humidity bias near the surface (950 hPa, middle panel) that saturates at about 1 g/kg at day 7 lead time. Both the hybrid model and the other two benchmarks eliminate most of this bias. The spatially varying component of the error behaves similarly to that for the low level temperature, with the hybrid model outperforming the benchmarks for lead times from 1 to 7 days.
For the meridional wind component in the upper troposphere (200 hPa, bottom panel) none of the models develop a noteworthy bias. Thus, the differences in forecast performance are solely due to differences in the spatially varying component of the forecast error. This error component is still smaller for the hybrid model than SPEEDY for the first 9 forecast days, and than for the other benchmarks for the the first 6 forecast days.
3.5 Atmospheric Balance
Maintaining the delicate balance between the wind (momentum) and mass field in a numerical model, especially at short forecast lead times, has been one of the biggest challenges of atmospheric modeling since the dawn of NWP (e.g., Lynch, 2006). In a modern NWP model, a weakened balance is a short-lived transient property and the magnitude of the initial transient can be greatly reduced by initialization techniques (e.g., section 8 of Lynch (2006)). In the hybrid model and SPEEDY-LLR, however, no initialization is done before a corrected 6 hr forecast is used as the initial condition of the next 6 hr numerical forecast. Hence, the corrections inevitably upset the balance in the numerical component of the hybrid forecasts every 6 hr. The forecast verification results discussed thus far suggest that these imbalances do not outweigh the positive effects of the corrections on the accuracy of the hybrid forecasts. But, can the hybrid model produce realistic surface pressure tendencies by also correcting the surface pressure field for the effects of gravity waves excited by the imbalances? We investigate this possibility by examining the global root-mean-square of the surface pressure tendency in the forecasts for the hybrid and the benchmark models (Figure 7). We assume that the value computed for ERA5 (red curve), which is about 0.4 hPa/h, provides a realistic estimate of the global root-mean-square of surface pressure tendency in the atmosphere.

Atmospheric balance in the model forecasts. Shown is the global root-mean-square of the approximate surface pressure tendency computed by finite-differences based on 6-hourly data for the (blue) hybrid model (green) simplified parameterization, primitive-equation dynamics (SPEEDY) (orange) machine learning-only model, and (purple) SPEEDY-LLR model. The (red) value computed for 2011–2012 based on the ERA5 reanalyzes is also shown for reference.
As can be expected from a numerical model started from an uninitialized initial condition, the initial tendency for SPEEDY (about 1 hPa/h) is higher than desired. As forecast time increases, the the magnitude of the mean tendency drops, first rapidly, and then at a decreasing rate until it settles below the natural level, at about 0.28 hPa/h. The latter behavior suggests that the diffusion built into the model to combat imbalances over-smooths the temporal variability of the forecasts beyond day 1. While the magnitude of the mean tendency for the hybrid forecasts (about 0.38 hPa/h) is initially slightly smaller than the natural value, and further decreases in the first 72–84 hr (to about 0.36 hPa/h), it is closer to the natural value than those for the benchmark forecasts. The SPEEDY-LLR is less effective than the hybrid model in eliminating the initial transient and it also produces an average tendency at the later forecast times (about 0.30 hPa/h) that is further below the natural level. The ML-only model behaves similarly to the hybrid model for the first two forecast days, but the saturation value is clearly lower (about 0.33 hPa/h) than for the hybrid model.
3.6 Sensitivity to Training Length
To test the sensitivity of the performance and stability of the hybrid model to the training length, we carry out a series of experiments with the same hyperparameters as before, but for shorter training periods. In particular, we train the model on 2 years, 5 years, or 10 years of reanalysis data, with the training always ending at 2300 UTC, 26 June 2011, as for the original forecast experiments (We recall that the length of the training for the original experiments is 20.5 years) The results of these experiments for the usual 100 21-day forecast cases for select variables are summarized in Figure 8.

Time evolution of the global root-mean-square forecast error for different lengths of the training of the hybrid model. Results are shown for a (purple) 2 years (green) 5 years (red) 10 years, and (blue) 20.5 years training period. For reference, the forecast errors are also shown for (brown dashes) simplified parameterization, primitive-equation dynamics and (black dashes) climatology.
While training the hybrid model for only 2 years already significantly improves the forecast performance for the near-surface temperature and specific humidity compared to that of SPEEDY, extending the training length further improves the forecasts. The hybrid model trained for 2 years does not improve the meridional wind component in the upper troposphere, and actually degrades the forecasts beyond 3 days. A longer training makes the hybrid model perform better initially than SPEEDY. The length of the superior performance of the hybrid model becomes longer as the length of the training period increases. The results shown in Figure 8 also suggest that a further modest improvements of the forecast performance could be achieved by using a training period even longer than 20.5 years.
4 Climate Simulation Experiment
To evaluate the long term stability of the hybrid model and its ability to simulate the climate, we compute an 11 years long free run with the model. For this simulation experiment, the hybrid model is trained on ERA5 reanalyzes for the 19-year period from 1 January 1981 to 27 December 1999. The simulation starts from the ERA5 reanalysis valid at 0000 UTC, 1 January 2000. To suppress the effects of initial transients and the initial condition on the model diagnostics, we discard the data from the first year of the simulations before computing the diagnostics. To compare the performance of the hybrid model and SPEEDY in simulating the climate, we assume that the two simulations attempt to simulate the climate of the 10-year period from 2001 to 2010 as represented by ERA5.
4.1 Zonal Mean Biases
Figures 9 and 10 show the zonal mean biases of the simulations by SPEEDY (left panels) and the hybrid (right panels) for the boreal winter (December, January, and February) and boreal summer (June, July, and August), respectively. These figures can be used, not only to compare the quality of the two simulations, but also to assess the average magnitude of the corrections made by the ML component of the hybrid model. In particular, the difference between a left panel and the corresponding right panel is the zonal mean of the ML correction for a particular state variable.

Comparison of the zonal mean biases of the simplified parameterization, primitive-equation dynamics (SPEEDY) and hybrid simulation simulations for the boreal winter (December, January, February). Results are shown for (top) the temperature (middle) zonal wind, and (bottom) specific humidity for (left) SPEEDY and (right) the hybrid model.

Same as Figure 9, except for the boreal summer (June, July, August).
The top left panels show that SPEEDY has a large upper tropospheric warm bias for the tropical regions, during both the boreal winter and summer. In both polar regions SPEEDY has a cold bias for the upper troposphere and stratosphere during the boreal winter and a warm (cold) bias in the southern (northern) polar region during the boreal summer. The magnitude of the bias is not surprising given the coarse resolution and simplified parameterizations used in SPEEDY (Molteni, 2003). The top right panels show that the hybrid model greatly reduces, but does not completely eliminate, these biases when the model is cycled over a long period of time. The bias reduction is particularly notable in the the tropics and the midlatitudes. The largest remaining biases are in the polar regions.
The hybrid model reduces the zonal component of the wind bias, especially in the stratosphere and upper troposphere, and in the lower troposphere in the SH midlatitudes in the boreal summer. The only exception is the introduction of a positive zonal component of the wind bias in the stratosphere in the tropics. The hybrid model also greatly reduces the large positive humidity bias of SPEEDY with maxima in the tropics.
Figure 11 shows the mean surface pressure biases for the simulations by SPEEDY (left panels) and hybrid model (right panels) for the boreal winter (top row) and boreal summer (bottom row). The mottled short scale patterning seen in the two left panels of the figure are due to the spectrally truncated topography of SPEEDY, which is much smoother than the topography determining the interpolated ERA5 reanalyzes used for the evaluation of the simulations, and for the training of the hybrid model. In combination with the artifacts caused by the spectral truncation in SPEEDY, the large local differences in the mountainous regions lead to substantial surface pressure biases in the SPEEDY simulations. The hybrid model corrects the large local biases, but still has smaller magnitude large scale biases. The wave-number-two structure of the large-scale hybrid model bias in the NH suggests that these biases are related to the low resolution representation of the topography and the land-sea contrasts in the numerical model. The remaining biases are also relatively large in the polar regions, especially in the boreal summer. We speculate that the bias of the hybrid model in the polar regions might be related to our particular strategy to do the localization on a cylindric (Mercator) map projection. On the other hand, the bias is not concentrated at the poles for the variables shown in Figures 9 and 10.

The mean surface pressure bias in the simplified parameterization, primitive-equation dynamics (SPEEDY) and hybrid climate simulations. Shown is the bias for (top) the boreal winter (December, January, February) and (bottom) boreal summer (June, July, August) for (left) SPEEDY and (right) the hybrid model.
4.2 Temporal Variability
To investigate the temporal variability of the atmosphere in the SPEEDY and hybrid climate simulations, we examine the temporal dependence of the 950 hPa temperature at the four model grid points that fall in the Sahara Desert. The top two panels of Figure 12 show the power spectra of the temporal variability for the two models. These power spectra are computed by applying a Hamming filter first, and then a discrete Fourier transform to the 10 years of 6-hourly simulation data, and finally computing the square of the absolute value of the Fourier coefficients. The results show that both simulations correctly capture the variability at time scales longer than about a week. At the shorter time scales, however, SPEEDY increasingly underestimates the variability. The ML correction greatly reduces, but does not completely eliminate, this problem: the hybrid model underestimates the variability at the scales between 1 week and 1 day only slightly, and reduces the underestimation by SPEEDY at the even shorter scales. Most importantly, unlike SPEEDY, the hybrid model has a strong diurnal cycle. It should be noted that an earlier version of the hybrid model, which did not include the incoming solar radiation at the top of the atmosphere as an input to the reservoir, lost the diurnal cycle at around the end of year 4. This motivated us to add the incoming solar radiation as an input parameter, even though it had no significant effect on the forecast accuracy. We find it a noteworthy, nontrivial result that the earlier version of the hybrid model was able to learn the diurnal cycle strictly from the training data.

Temporal variability of the 950 hPa temperature in the Sahara Desert for the 10 years of simulations. Shown are the power spectra for (top) the hybrid model and ERA5 and (middle) simplified parameterization, primitive-equation dynamics and ERA5. The bottom panel shows the time series of simulated temperatures for the last full year of the simulations. The gray shading represents the range of plus/minus two standard deviations from the mean in the ERA5 reanalyzes for 2001–2010.
The fact that a simulation correctly captures the variability at a number of frequencies does not guarantee that the phases of the temporal changes (e.g., the timing of the seasons) are also correct. To exclude the possibility of such a flaw of the simulations, we plot (bottom panel of Figure 12) the time series of the average 950 hPa temperature for the same four Saharan grid points for the last full year of the simulations. The points along these curves should fall within two standard deviations from the mean for the given date and time (the interval marked by gray shading) with a 95% observed frequency. Based on the full 10 years of data, the observed frequency is 88.2% for SPEEDY and 98.0% for the hybrid model.
5 Conclusions
In this paper, we described results from the first implementation of the hybrid modeling approach CHyPP of Wikner et al. (2020) on a realistic atmospheric model. We used a low-resolution AGCM based on the full set of primitive equations, along with ERA5 reanalysis data for training and verification, to demonstrate the potentials of CHyPP for both NWP and climate modeling. The spatio-temporal structure of the improvements of the forecasts and simulations suggests that the ML component of the model primarily corrects for errors caused by the limitations of the parameterization schemes of the AGCM. While state-of-the-art numerical models have much higher resolutions and more advanced parameterization schemes than SPEEDY, the weather forecasts and climate simulations they provide still have substantial biases. We expect the hybrid approach to effectively reduce these biases.
Because the ML component of the hybrid model is based on RC, training the model is computationally highly efficient. Specifically, the training described in this paper requires only 30 min wall-clock time using 1,152 Intel Xeon E5-2670 v2 processors on a supercomputer that is much less powerful than those at the operational NWP centers. Using the same computational resources, preparing a 21-day forecast takes about 52 s, while carrying out a one-year simulation takes about 15 min. These numbers are only 25% higher than those for SPEEDY, and the extra time is mainly due to the overhead associated with the frequent restart of SPEEDY.
Due to the parallel nature of the computational algorithm, we expect it to scale well for higher model resolutions and larger number of processors. A modification of the current implementation of our method that might be helpful for scaling is vertical localization. By “vertical localization” we mean the use of local domains that, as well as being limited in horizontal extent as shown in Figure 1, are also of limited height and are stacked vertically with overlap from ground-level to the top of the atmosphere. Though we do not use vertical localization in this article, we plan to test it soon for potential improvements with SPEEDY.
The ideal size of a local domain still needs to be determined through additional experimentation, both for SPEEDY and for higher-resolution models. Thus, it is hard to make a precise quantitative projection for scaling, but here is a comparison that indicates feasibility for operational models. The current computer of ECMWF has 129,960 processors (about 100 times more than what we used), and their operational model has 6.5 × 106 horizontal grid points (about 180 times more than SPEEDY) (“IFS Documentation CY47R1–Part III: Dynamics and Numerical Procedures”, 2020). If the local regions for the ECMWF model would be defined by four horizontal and all vertical grid points, as in our paper, each processor would have to handle less than twice as many local regions at ECMWF than in our model. Also, there is no obvious reason to believe that the computational overhead of the hybrid model would be substantially higher than the 25% we found for SPEEDY. The high computational efficiency of the approach would allow for a large number of experiments to find the optimal configuration of a future operational hybrid model. Developing an efficient systematic approach to find a near optimal combination of the hyperparameters, nevertheless, would be highly desirable and is one of the subjects of our ongoing research efforts. An unknown factor that could have a very favorable impact on future scaling considerations is the ongoing rapid technological developments of alternative, fast, cheap physical implementations of reservoir computing, for example, implementations based on photonics or on Field Programmable Gate Arrays.
We emphasize that while the ML component of the hybrid model is highly efficient in correcting the biases of the forecasts and simulations prepared by the host model, it is not a ML-based postprocessing technique. While a technique of the latter type corrects the numerical-model-based forecasts of a specific forecast variable or phenomenon (e.g., Chapman et al., 2019; Kim et al., 2021; Rasp & Lerch, 2018) without interacting with the numerical model, the ML component of the hybrid model makes frequent periodic interactive corrections to the numerical model solution. Hence, it also greatly improves the representation of the spatiotemporal variability of the atmospheric state by the model.
We expect that the performance of the hybrid model can be further improved by investigating the relationship between the parameters of the ML model and the representation of basic atmospheric processes. Such an investigation could lead to further improvements of the model, similar to the way studies of the interactions between numerics and dynamics (e.g., Arakawa & Lamb, 1977) led to much improved physic-based numerical models. For instance, one potentially important fundamental question is the optimal relationship between the size of the local domains, the overlap between the local domains in the input of the reservoir, and the length of the time step Δt. The fact that the ML component is more effective in correcting localized errors than errors at the larger scales in the current version of our hybrid model may be partly the result of using local domains and an overlap that are less than optimal for the selected time step. In our experiments, the size of the overlap was primarily dictated by the structure of our code and the available computer resources, but larger local domains and a larger overlap could be used in the future.
An intriguing possibility is to use the hybrid model for data assimilation in addition to forecasting, as data assimilation could greatly benefit from the higher accuracy and smaller biases of the short term hybrid forecasts used as background. Furthermore, integrating ML and data assimilation may allow in the future to do online training of the ML component of the hybrid model on real-time observations rather than canned reanalyzes data. The availability of such training procedure would make it possible to extend the hybrid modeling approach to numerical models for which high-quality reanalysis data are not available (e.g., an AGCM that also includes a sophisticated model of the upper atmosphere well beyond the lower stratosphere). It could also allow the ML component of the model to adjust to variability and changes of the climate. We have made a first step toward this ambitious goal, in which we iteratively use the hybrid model to prepare an updated set of analyses, which is then used to train the next iteration of the hybrid model (Wikner et al., 2021). Our plan is to test this approach with the hybrid model of the current paper.
Acknowledgments
This work was supported by DARPA contract DARPA-PA-18-01 (HR111890044). The work of T. Arcomano and I. Szunyogh was also supported by ONR award N00014-18-2509. The work of Alexander Wikner was supported in part by the National Science Foundation (NSF) (Award No. DGE-1632976). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing. This paper greatly benefitted from stimulating discussions with Sarthak Chandra, Michelle Girvan, Garrett Katz, and Andrew Pomerance. The constructive comments of the three anonymous reviewers helped us to greatly improve the presentation of our ideas and results.
Open Research
Data Availability Statement
The new data generated for the paper are available online http://doi.org/10.5281/zenodo.5103176.
References
Erratum
Sections 2.2 and 2.4.1 of the originally published version of this article misstated two details of the method used to obtain the reported results. The results and conclusions of the paper were not affected. The errors have been corrected, and this may be considered the official version of record.