A Hybrid Approach to Atmospheric Modeling That Combines Machine Learning With a PhysicsBased Numerical Model
This article was corrected on 23 MAR 2023. See the end of the full text for details.
Abstract
This paper describes an implementation of the combined hybridparallel prediction (CHyPP) approach of Wikner et al. (2020), https://doi.org/10.1063/5.0005541 on a lowresolution atmospheric global circulation model (AGCM). The CHyPP approach combines a physicsbased numerical model of a dynamical system (e.g., the atmosphere) with a computationally efficient type of machine learning (ML) called reservoir computing to construct a hybrid model. This hybrid atmospheric model produces more accurate forecasts of most atmospheric state variables than the host AGCM for the first 7–8 forecast days, and for even longer times for the temperature and humidity near the earth's surface. It also produces more accurate forecasts than a model based only on ML, or a model that combines linear regression, rather than ML, with the AGCM. The potential of the CHyPP approach for climate research is demonstrated by a 10year long hybrid model simulation of the atmospheric general circulation, which shows that the hybrid model can simulate the general circulation with substantially smaller systematic errors and more realistic variability than the host AGCM.
Key Points

A hybrid model incorporating machine learning produces more accurate forecasts and more realistic climate than the host physicsbased model

The hybrid model states are more realistically balanced and have substantially lower biases than the host model

The hybrid model produces more realistic atmospheric variability than the host model at time scales shorter than about a week
Plain Language Summary
This paper presents a computationally efficient novel approach to construct a hybrid model of the atmosphere by combining a physicsbased model of the global atmospheric circulation with a machine learning component. The primary purpose of the hybrid model is to produce quantitative weather forecasts on the same grid as the physicsbased model. It is found that the hybrid model produces more accurate forecasts than the host physicsbased model for the first 7–8 forecast days for most forecast variables, and for even longer times for the temperature and humidity near the Earth's surface. Furthermore, the hybrid model is found to simulate the climate with substantially smaller systematic errors and more realistic temporal variability than the host model.
1 Introduction
Numerical weather prediction (NWP) models have been the backbone of operational weather prediction for several decades now (e.g., Harper, 2008; Lynch, 2006). A particular model implements a numerical solution algorithm for the physicsbased set of coupled partial differential equations that govern atmospheric motion (e.g., Szunyogh, 2014). The resulting numerical equations form the dynamical core of the model. The effects of processes not resolved explicitly by the dynamical core are taken into account by parameterization schemes that contribute to the forcing terms of the equations. These schemes are based on some combination of theoretical and empirical considerations (e.g., Stensrud, 2007). The initial conditions of the numerical model solutions are observationbased estimates (analyses) of the state of the atmosphere, and the process that produces these estimates is called data assimilation (e.g., Szunyogh, 2014). The advances in modeling and data assimilation techniques, alongside with the increase of computing power and the number of observations available for assimilation, led to a “quiet revolution of NWP” (Bauer et al., 2015). The incorporation of machine learning (ML) techniques into the NWP process promises to lead to further forecast accuracy gains by extracting additional information from the observations.
The earliest applications of ML to atmospheric modeling focused on improving the computational efficiency of the physicsbased numerical models (e.g., V. Krasnopolsky et al., 2005; V. Krasnopolsky & FoxRabinovitz, 2006; V. M. Krasnopolsky, 2013). These applications employed neural networks to emulate the computationally most expensive physicsbased parameterization schemes at a reduced computational cost. The term hybrid model was first used in reference to models using this technique. One approach employed by this type of hybrid models is to use a single neural network to emulate the combined effect of multiple parameterized processes, such as cumulus convection, radiation, boundary layer transport, etc (e.g., V. Krasnopolsky et al., 2010; V. M. Krasnopolsky, 2013; Brenowitz & Bretherton, 2018, 2019; Rasp et al., 2018). For this purpose, the ML systems are often trained on data produced by model simulations at higher resolutions, or with more sophisticated physical parameterization schemes.
Another type of MLbased parameterization scheme (e.g., Chattopadhyay et al., 2020; Gentine et al., 2018; Rasp et al., 2018), is trained on observations or observationsbased reanalyzes. Such a scheme has the potential to learn about the effects of processes that the higher resolution and more sophisticated model simulations are still unable to capture. ML techniques have also been considered for the estimation of the free parameters of physicsbased parameterization schemes (Schneider et al., 2017). This approach takes advantage of the knowledge built into the parameterization schemes, but may suffer from the assumptions and approximations made by the schemes.
The hybrid approach we propose belongs to a class of techniques that are different from those mentioned thus far. Techniques of this class use ML for the frequent periodic interactive correction of the spatiotemporally evolving physicsbased numerical model solution after training on observational analyses. The specific approach we propose was originally developed by Pathak, Wikner, et al. (2018) and later adapted to large dynamical systems by Wikner et al. (2020), who named it Combined HybridParallel Prediction (CHyPP). It evolves the hybrid forecasts iteratively, combining a shortterm (e.g., 6 hr) numerical forecast with a statedependent ML correction in each “time step” of the “hybrid model integration”. CHyPP is not a postprocessing technique, because each “time step” of the evolving hybrid model solution starts from the MLcorrected state of the preceding step, whereas a postprocessing technique does not interact with the evolving model solution. The ML component of CHyPP uses the computationally highly efficient parallel reservoir computing (RC) algorithm of Pathak, Hunt, et al. (2018). The other hybrid approaches of the same class use either a random forest (WattMeyer et al., 2021) or use a deep learning ML component (Farchi et al., 2021), rather than one based on RC.
Wikner et al. (2020) demonstrated the potential of CHyPP for predicting the evolution of a spatiotemporally chaotic system by experiments with the KuramotoSivashinsky (KS) model (Sivashinsky, 1977), a model that has a single state variable that depends only on a single space dimension in addition to time. We implement CHyPP on the Simplified Parameterization, primitiveEquation Dynamics (SPEEDY) (Kucharski et al., 2006; Molteni, 2003) atmospheric global circulation model (AGCM). Ours is the first implementation of the approach on a model that has multiple state variables with a wide range of values and depend on all three spatial dimensions. Because SPEEDY has a substantially lower resolution than a stateoftheart NWP or climate model, our primary goal is to demonstrate the feasibility and potentials of CHyPP for an atmospheric application, rather than to propose our current model as a potential replacement for a stateoftheart numerical model. The results of our forecast experiments show that the performance of the hybrid model is superior to that of either SPEEDY, a model based only on ML, or a model that uses linear regression rather than ML for the correction of the short term (“one time step”) numerical forecasts.
In what follows, we first describe the hybrid approach and its implementation on SPEEDY in detail (Section 2). Then, we discuss the results of the forecast experiments (Section 3), and then the climate simulation (Section 4). Finally, we summarize our key findings and draw our conclusions (Section 5).
2 The Hybrid Model
In CHyPP, the physicsbased numerical model state is evolved globally, while the ML correction is done in parallel, in small local domains (Pathak, Hunt, et al., 2018). The model state of a local domain is represented by a local state vector composed of the relevant components of the global state vector. The global hybrid prediction is obtained by piecing together the local hybrid predictions at the end of each Δtlong “time step” of the “hybrid model integration”. This approach can be implemented on any numerical model by adjusting the definition of the local state vectors to the spatial discretization strategy of the model. We note that the localization strategy of CHyPP is similar to that employed by the Local Ensemble Transform Kalman Filter (LETKF) data assimilation scheme (Hunt et al., 2007; Ott et al., 2004; Szunyogh et al., 2008), which has been found to scale efficiently even for very high (kilometer) resolution operational weather prediction models (e.g., Schraff et al., 2016).
2.1 The Global State Vector
SPEEDY is a spectral transform AGCM that was developed to produce rapid climate simulations, using simplified, but modern physical parameterization schemes (Molteni, 2003). We implement CHyPP on the standard configuration of Version 41 of the model: the spectral horizontal resolution is T30, while the grid used for the computation of the nonlinear terms and parameterizations has a nominal horizontal spatial resolution of 3.75° × 3.75° with state variables defined at eight vertical σlevels (0.025, 0.095, 0.20, 0.34, 0.51, 0.685, 0.835, and 0.95), where σ is the ratio of pressure to the surface pressure. The threedimensionally varying state variables of the model are the two components of the horizontal wind vector, temperature, and specific humidity, while the single twodimensionally varying state variable is the natural logarithm of surface pressure. The global computational grid and the state variables of the hybrid model are the same as those of SPEEDY.
2.2 The Local State Vectors
In our implementation of CHyPP on SPEEDY, each local state vector represents the atmospheric state in a threedimensional local domain that has the shape of a rectangular box with a 7.5° × 7.5° (2 × 2 horizontal grid points) base and extends vertically from ground level to σ = 0.025 (The boundaries of the horizontal footprint of a local domain are marked by a blue rectangle in Figure 1.) In what follows, we describe the computations carried out in parallel for each of the L = 1, 152 local domains to evolve the hybrid model state from time t to t + Δt.
Let v(t) be the local state vector for an arbitrary local domain at time t. The dimension of this state vector is 4 × (8 × 4 + 1) = 132 (resulting from the 4 grid points of a local domain, the 8 σlevels, the 4 volume distributed state variables, and the natural logarithm of surface pressure state variable). Because the different state variables have different units and ranges of values, where the ranges also depend on the geographical location and vertical level, each state variable is standardized to have, for each vertical level of each extended local domain (represented by a red rectangle in Figure 1), a mean of 0 and a standard deviation of 1 before forming v(t). The standardization is done by using ERA5 reanalysis data (Hersbach et al., 2020) for the computation of the climatological mean and standard deviation (across both time and the 16 horizontal grid points in the extended local domain) of each state variable at each vertical level. We introduce the notation v^{p}(t), v^{h}(t), and v^{a}(t) for the local state vector of SPEEDY, the hybrid model, and the reanalysis, respectively. We also introduce the notations v^{gp}(t), v^{gh}(t), and v^{ga}(t) for the related global state vectors. For instance, the components of v^{ga}(t) in an arbitrary local domain are the components of v^{a}(t). In what follows, we explain the steps of the computation of v^{gh} (t + Δt) from v^{gh}(t). A flowchart of these steps is shown in Figure 2a.
2.3 Reservoir Dynamics
This dynamical system is the reservoir, r(t) is the reservoir state vector, and u^{h}(t) is the local input state.
During the training, the input term u^{h}(t) in Equation 1 is replaced by u^{a}(t). The local input u^{h}(t) in our case is a mdimensional extended local state vector, composed of the components of the local state vector v^{h}(t) plus additional components of the global state vector v^{gh}(t) from the neighboring local domains (see Figure 1 for illustration), plus the prescribed incoming solar radiation at the top of the atmosphere for the extended local domain. The latter component is included to help the hybrid model to learn the diurnal cycle from the input data (SPEEDY uses the daily average value of the incoming solar radiation at the top of the atmosphere at all times of the day.) For all of the local domains, m = 16 × (8 × 4 + 1 + 1), except at the local domains adjacent to the poles where m = 12 × (8 × 4 + 1 + 1).
Referring to Equation 1, the dimension D_{r} of the vector r(t) is much higher than that of a local state vector v^{h} (t) (e.g., 6,000 vs. 132 in the present article). The activation function with a vector argument, tanh (.), is a vector of the same dimension (D_{r}) as its argument, and a component of this vector is the hyperbolic tangent of the corresponding component of the argument vector. The matrix A is a sparse D_{r} × D_{r} weighted adjacency matrix that represents a lowdegree, directed, random graph (Gilbert, 1959). Each entry of A is randomly chosen with a probability κ/D_{r} of being nonzero, where κ is the degree of the graph (the average number of incoming connections per node), and with the nonzero entries of A randomly drawn from a zeromean uniform distribution (The ratio κ/D_{r} is a measure of the sparsity of A.) After randomization, the entries of A are scaled such that the largest eigenvalue of A is a prescribed number ρ (0 < ρ < 1), which is called the spectral radius. The spectral radius controls the length of the memory of the ML reservoir, and a value ρ < 1 typically makes the reservoir state r(t) depend only on the past states of the modeled system (the atmosphere in our case), and not on the initial reservoir state, when t is sufficiently large. This property of the reservoir is called the echo state property (Jaeger, 2001).
The matrixvector product Bu^{h}(t) is called the input layer in RC. In our model, B is a m × D_{r} sparse random matrix with an equal number of nonzero entries in each row. These nonzero entries, which are chosen randomly from a uniform distribution on the interval (−α, α), couple the components of u^{h}(t) to the reservoir nodes. The input strength α is an adjustable parameter that controls the degree of nonlinearity experienced by the input signal u^{h}(t) from the activation function.
2.4 The Hybrid Model
2.4.1 Training
Figure 2b shows the flow of operations during training. First, we generate a sequence of perturbed global analyses v^{ga} (kΔt)(1 + ɛ^{g} (kΔt)), k = −K − K_{t}, −K − K_{t} + 1, …, −1, where ɛ^{g} (kΔt) is a smallmagnitude, zeromean, normally distributed random noise vector, uncorrelated in time and uncorrelated between components of the noise vector. The role of this noise is to help the ML model learn to return to the bounded set of realistic atmospheric states (the “attractor”) in the presence of perturbations that may arise in future forecasts (e.g., Jaeger, 2001; Wikner et al., 2020). The addition of noise to the global analyses during training is essential for the hybrid model to produce stable, realistic predictions; predictions rapidly become unstable without it. Similar behavior has been observed in RC applications involving the prediction of other spatiotemporal systems (e.g., Patel et al., 2021).
The local input state u^{a}(kΔt) is the extended local state vector associated with v^{ga}(kΔt)(1 + ɛ^{g}(kΔt)), for k = −K − K_{t}, −K − K_{t} + 1, …, −1 for the particular local domain. The initial state r((−K − K_{t})Δt) of the reservoir can be chosen arbitrarily, because only the evolved reservoir states r[(k + 1)Δt], k = −K, −K + 1, …, −1, are used for training. The purpose of discarding the reservoir state of the first K_{t} (K_{t} ≪ K) iterations is to ensure that the reservoir state r(t) has sufficient time to settle on its attractor. The unperturbed global analyses v^{ga} (kΔt) are also used as the initial conditions for SPEEDY to obtain v^{gp} ((k + 1)Δt) for k = −K, −K + 1, …, −1.
The local hybrid states v^{h} (kΔt, W), k = −K + 1, −K + 2, …, 0, represent the results of Equation 2 at those times for a particular W, and v^{a} (kΔt) is the local state vector for the unperturbed global analysis v^{ga} (kΔt) (Notice that we use the notation W for both the variable and the solution of the minimization problem.) The last two terms of the cost function, in which ‖ · ‖^{2} denotes the sum of the squares of the entries of a matrix (the Frobenius norm), are regularization terms meant to prevent overfitting, with β_{mod} and β_{res} being the regularization parameters for the numerical model and reservoir component, respectively. With these terms, the direct solution of the leastsquare problem is a ridge regression (Tikhonov & Arsenin, 1977). The inclusion of the prior matrix W_{prior}, which was not part of Wikner et al. (2020), allows for a choice like W_{prior} = I, which dictates that in the absence of training data that demonstrates imperfections in the numerical model, the hybrid model should be equivalent to the numerical model. In our experiments, we tried both W_{prior} = I and W_{prior} = 0, and found that the latter yielded better stability. Thus, we report results with W_{prior} = 0, but think that other choices for nonzero W_{prior} merit further study.
2.4.2 Synchronization and Prediction
Let K_{f}Δt be the forecast start time. Starting the hybrid forecast requires the availability of the global analysis v^{ga}(K_{f}Δt) and the reservoir state r(K_{f}Δt) for each local domain. Because according to the “echo state property” r(K_{f}Δt) is determined by the past states of the atmosphere, it can be obtained by synchronizing the evolution of the reservoir states with the analyses for a sufficiently long time period that ends at K_{f}Δt. Let K_{s}Δt be the start time of the synchronization. Synchronization is achieved by evolving the reservoir equation using u^{h} (kΔt) = u^{a} (kΔt) in Equation 1 for k = K_{s}, K_{s+1}, …, K_{f}.
Piecing together the local hybrid forecasts for all local domains yields the global “onestep” hybrid forecast v^{gh} [(K_{f} + 1)Δt] (Figure 2a). The forecast can be extended arbitrarily far into the future by using an iterative process for k = K_{f} + 1, K_{f} + 2, … , in which the extended local state vector u^{h} (kΔt) extracted from v^{gh} (kΔt) is used as u^{h} (kΔt) in the Equation 1 to compute r [(k + 1)Δt]. The global “onestep” hybrid forecast v^{gh} (kΔt) is also used as the initial condition of the v^{gh} [(k + 1)Δt] SPEEDY component of the hybrid forecast. In a cycled forecast system of an operational NWP center, in which analyses are prepared and forecasts are started with a regular frequency (e.g., 6 hr), the reservoir state can be kept continuously synchronized with the realtime evolution of the atmosphere.
2.5 Implementation With ERA5 Reanalysis Data
We use interpolated hourly global ERA5 reanalyzes to train and synchronize the hybrid model. We do the horizontal interpolation of the reanalysis fields onto the computational grid of SPEEDY by a 2dimensional quadratic Bspline interpolation. We then compute the value of σ at each horizontal grid point and use a 1dimensional cubic Bspline for the vertical interpolation of the model state variables to the eight prescribed constant σ levels of SPEEDY. The training starts at 0000 UTC on 1 January 1990 and ends at 2300 UTC on 26 June 2011 (K ≈ 3.14 × 10^{4}), with the data discarded for the first 6.25 days (K = 31355 and K_{t} = 25).
2.6 Selection of the Hyperparameters
Hyperparameters are adjustable parameters (e.g., κ, ρ, α, D_{r}, β_{res}, β_{mod}, ɛ, and Δt) that control overall characteristics of the hybrid model and require “tuning” to produce desirable results. There exists “tricks of the trade” practical rules for the selection of the hyperparameters of an RC model (Lukoševičius, 2012). These general rules also work for the hyperparameters of the hybrid model. First, the hybrid model is only weakly sensitive to κ and ρ. While we use κ = 6, other small values of κ (e.g., κ = 3) work similarly well. We use a value of ρ that monotonically increases toward the poles from 0.3 at the equator to 0.7 at 45°, so that the reservoir mimics the general property of the atmospheric dynamics that its memory is shorter in the tropics than the extratropics. Changing these values by ±0.1–0.2 has little effect on the model performance. We choose D_{r} = 6,000, because we find that further increasing the reservoir size does not lead to substantial further improvement of the model performance. We find the hybrid model performance to be somewhat sensitive to the value of α, which controls the amount of nonlinearity of the reservoir dynamics. Setting α ≤ 0.3 or α ≥ 0.7 yields noticeable degradation of the errors compared to the value we use, α = 0.5. For each of the options W_{prior} = I and W_{prior} = 0, we tried various powers of 10 for the regularization parameters β_{res} and β_{mod}; we found that W_{prior} = 0 yielded better stability, and found that β_{res} = 10^{−4} and β_{mod} = 10^{0} led to good model performance. Among the several values we tried, in increments of 0.05, for the standard deviation of the components of the random noise 1 + ɛ^{g} that is multiplied by the training data, we chose the smallest value (0.20) for which all hybrid forecasts were stable. The time step Δt is another important hyperparameter to tune; we chose Δt = 6 hr, because using Δt = 1 hr or Δt = 3 hr (with other hyperparameters tuned accordingly) led to clearly poorer model performance. Moreover, we use a time step of Δt/24 = 0.25 hr for the numerical integration of SPEEDY, because longer time steps degraded the 6 hr forecast performance of SPEEDY. Since the temporal resolution of the ERA5 reanalyzes is 1 hr (Δt_{a} = 1), the training is done on Δt/Δt_{a} = 6 time series of data.
3 Forecast Experiments
We compute forecast error statistics based on 100 21day forecasts, with start times equally spaced every 4 days between 0000 UTC, 27 June 2011 and 0000 UTC, 28 July 2012. We evaluate the forecast performance of the hybrid model by comparing it to that of a variety of benchmark forecasts started from interpolated ERA5 reanalyzes.
3.1 Benchmark Forecasts
The set of benchmark forecasts includes numerical forecasts produced by SPEEDY, a model based only on ML, and a model in which the 6 hr SPEEDY forecasts are corrected by linear regression rather than by ML. We call the latter benchmark SPEEDYLLR, where LLR stands for local linear regression.
Comparing the performance of the hybrid model to that of a model based only on ML is important, because MLonly models (e.g., Arcomano et al., 2020; Rasp & Thuerey, 2021; Weyn et al., 2020) are considered a potential alternative to the hybrid approaches for the utilization of ML in Earth system modeling. Our ML model is formally the same as our hybrid model except that we use the constraint W_{mod} = 0 in Equation 3, with Equations 4 and 5 modified accordingly, and the hyperparameters are different: D_{r} = 9,000, β_{res} = 10^{−6}, Δt = 3 hr, and ɛ has a standard deviation of 0.28 (The smaller reservoir size necessary to obtain good results from the hybrid as compared to the MLonly model is an important advantage of the hybrid model.) While this MLonly model is formally identical to the one described by Arcomano et al. (2020), its forecast performance is better, thanks mainly to using a time step of Δt = 3 hr rather than Δt = 1 hr and the addition of the incoming solar radiation to the input of the reservoir.
The SPEEDYLLR is the same as the hybrid model except that W_{res} = 0. In this model, a larger regularization parameter is necessary to produce stable forecasts for at least 10 days. We use β_{mod} = 1,600, which provides the most accurate short and medium range (1–5 days) forecasts that also remain stable for at least 10 days. The stability of the SPEEDYLLR forecasts can be improved by further increasing β_{mod}, but only at the price of degrading the short and medium range forecast accuracy (For β_{mod} → ∞, SPEEDYLLR becomes SPEEDY, which produces stable forecasts for indefinitely long lead times). Since, SPEEDYLLR does not include the nonlinear ML correction of the hybrid model (the second term on the right side of Equation 3), training is a simple linear regression of the numerical model forecast. With the help of this benchmark, we can assess the relative importance of making periodic corrections to the numerical forecasts based on linear regression of the model state alone versus making those corrections by the proposed hybrid technique.
To assess whether a model forecast has skill, the figures also include comparisons to forecasts based on persistence and daily climatology. The persistence forecasts are based on the assumption that the state of the atmosphere at the beginning of the forecast persists for the entire duration of the forecast, while the climatological forecasts are based on the daily climatological mean for the calendar day at the particular geographical location and pressure level for years 1990–2010.
3.2 The Measure of the Forecast Error
Here the subscript i, j refers to the value of a scalar state variable V for a specific forecast lead time at a particular pressure level at grid point i, j of the verification region defined by N_{lon} discrete longitudes and N_{lat} discrete latitudes. The RMSE is averaged over the 100 forecasts to obtain a single scalar measure of the forecast error for each state variable, pressure level, and forecast lead time. In what follows, the term forecast error refers to this scalar measure. We call a forecast more accurate than another, if the forecast error is lower for the former than the latter forecast. In addition, we say that a model forecast has forecast value, if its forecast error is lower than that of both persistence and climatology (the latter two are available without the substantial cost of preparing model forecasts). The qualitative behavior of the errors of the model forecasts with respect to the errors of these two references is well understood. In particular, if the model has realistic climatology, in the sense that it represents the atmospheric variability (the variability of the atmospheric state) correctly, the error of the model forecasts and the error of persistence saturate at the same level. While the error is initially lower for persistence than climatology, its saturation value is higher by a factor of (e.g., Section 3.8 of Szunyogh (2014)).
3.3 Comparisons of the Forecast Accuracy
3.3.1 Synopsis of the Forecast Verification Results
Figures 3 and 4 illustrate the temporal evolution of the forecast errors for the first five forecast days in the NH midlatitudes and Tropics, respectively. The errors are shown for the temperature (top row), meridional component of the wind vector (middle row) and specific humidity (bottom row) at forecast lead times day 1 (left column), day 3 (middle column), and day 5 (right column). In general, the hybrid forecasts (blue curves) have forecast value, except for the specific humidity at day 5 in the NH midlatitudes, for which they are only about as accurate as the forecasts based on climatology. In addition, the hybrid forecasts are either more accurate than all benchmark forecasts, or similarly accurate to the most accurate benchmark forecast. The hybrid model performance in the SH midlatitudes (not shown) is similar to that in the NH midlatitudes. The advantage of the hybrid model compared to the different benchmarks, however, strongly depends on the forecast variable and lead time. Next, we discuss this dependence, as it provides important insight into the mechanisms by which CHyPP improves the numerical forecasts.
3.3.2 Hybrid Versus SPEEDY Forecasts
Compared to SPEEDY, the advantage of the hybrid model is the largest for the temperature. While all hybrid temperature forecasts have substantial forecast value for the first five forecast days, the SPEEDY day five temperature forecasts have no forecast value in the Tropics and in the stratosphere in the NH midlatitudes. In addition, the SPEEDY forecasts have little forecast value at day five in the midlatitudes. The benefit of the ML correction is particularly striking in the tropical upper troposphere, where the SPEEDY forecasts have a large error with a maximum of 6 K at 200 hPa, while the error of the hybrid forecasts remains below 1 K.
In addition to the temperature, the hybrid forecasts are also substantially more accurate than the SPEEDY forecasts for the specific humidity, especially, in the lower troposphere, where parameterizations play an important role in modeling the effects of moist atmospheric processes. While in the NH midlatitudes the hybrid forecasts degrade only to the level of the forecasts based on climatology by day five, the error of the SPEEDY forecasts reaches saturation by that time.
In the two midlatitudes, the state variable for which the advantage of the hybrid model is the smallest compared to SPEEDY is the meridional component of the wind vector. This result is not surprising, as numerical models are known to capture synopticscale Rossby wave dynamics, which dominate the variability of weather in the midlatitudes. In contrast, in the Tropics, where wave dynamics is coupled to the parameterized process of deep convection, the advantage of the hybrid model for the meridional wind component is more substantial.
To explore the scaledependence of the performance of the hybrid and benchmark forecasts, we examine the spectrum of the errors for the meridional component of the wind at 500 hPa with respect to the zonal wave number (Figure 5) (This figure also shows results for day 10, in addition to the results for forecast days one, three, and five.) The left panel shows the results for the hybrid and the SPEEDY model. Because SPEEDY is a spectral transform model with cutoff wave number 30, the spectrum for SPEEDY has no power at all beyond that wave number, and it is heavily dampened at wave numbers larger than about 20. Therefore, the errors of the hybrid forecasts, which have realistic power at all wave numbers, are expected to saturate at a level that is higher than that for SPEEDY at the tailend of the spectrum. At day one, the hybrid forecasts have a clear advantage over the SPEEDY forecasts at the synoptic and large scales (zonal wave numbers lower than about 20). A smaller, but spectrally similar advantage still exists at day three, while the advantage of the hybrid forecasts disappears, except at wave numbers five and six, by about day five.
3.3.3 Hybrid Versus MLOnly Forecasts
While the errors of the MLonly forecasts (orange curves in Figures 35) are only slightly larger than that of the hybrid forecasts at day one, they grow much faster in the next 4 days and the ML forecasts typically have no value by day three. This result suggests that while the RCbased ML technique can produce accurate forecasts in the short range (day one to two), it is more effective in assisting SPEEDY than directly predicting the weather beyond that range. A comparison of the left and middle panels of Figure 5 suggests that the information provided by SPEEDY to the hybrid is particularly beneficial at the large scales (wave numbers lower than about six).
3.3.4 Hybrid Versus SPEEDYLLR Forecasts
Next to the hybrid model, the benchmark that performs the best in the medium (day two to five) forecast range is the SPEEDYLLR (purple curves). While the hybrid forecasts are more accurate than the SPEEDYLLR forecasts, the forecast error differences between the two models are modest, except for those in the stratosphere. The fact that the forecast error differences are smaller for the hybrid model versus SPEEDYLLR than for the hybrid model versus SPEEDY indicates that the periodic interactive correction of the SPEEDY forecasts itself makes an important contribution to the good performance of the hybrid model. The additional forecast improvement, however, is not the only benefit of using ML rather than local linear regression for the forecast correction: while the hybrid forecasts remain stable indefinitely (see Section 4), some of the SPEEDYLLR forecasts fail as early as day 11 lead time, with about 60% of the forecasts reaching the intended 21 days.
It should be noted that the fact that local linear regression can efficiently correct the errors of a 6 hr forecast is not completely surprising, considering that linear regression can be used to model the shortterm forecast error dynamics for even a stateoftheart NWP model (Bishop et al., 2017), in which nonlinear effects are expected to play a more important role even at short lead times. It is a nontrivial result, however, that the information provided by such a linear approach can be used for the periodic, interactive correction of an evolving numerical forecast. It is also a nontrivial result that an RCbased ML technique stabilizes the resulting hybrid model indefinitely, and leads to further forecast improvement in the short and medium (day 1–5) range.
3.4 Global Mean and Spatially Varying Errors
To gain further insight into the ways the hybrid approach improves forecast performance, we decompose the global RMSE into a bias and a standard deviation component (The sum of the squares of the two components is equal to the square of the rootmeansquare error.) The bias measures the global mean error, while the standard deviation measures the spatially varying part of the forecast error. The time evolution of the two error components, averaged over the 100 forecasts is shown for three representative state variables in Figure 6.
For the temperature near the surface (at 950 hPa, top panel), SPEEDY rapidly develops a warm bias that oscillates around a mean of 0.75 K with the diurnal cycle. This bias is the result of SPEEDY using a single daily average value of the incoming solar radiation at the top of the atmosphere at all times of the day. The hybrid model greatly reduces the magnitude of the bias and also removes its diurnal oscillation. The biases of the ML model and SPEEDYLLR are comparable to that of the hybrid model in magnitude, but the SPEEDYLLR bias exhibits diurnal variability.
The spatially variable component of the lowlevel temperature error remains lower for the hybrid model than for SPEEDY throughout the 14day period shown in the figure. The same component is initially similarly low for the hybrid and MLonly model, but it increases much more rapidly for the MLonly model (Even with this rapid increase, the MLonly forecasts remain more accurate than the SPEEDY forecasts until about day 4). This component is initially lower for the hybrid model than for SPEEDYLLR, but their accuracies are essentially the same after about day 8. Also, while the curves for SPEEDY and the hybrid model saturate at the same level as persistence, the curve for the MLonly model saturates at a higher level, indicating that the MLonly model overestimates the spatial variability of the lowlevel temperature at the longer forecast times.
SPEEDY rapidly develops a positive specific humidity bias near the surface (950 hPa, middle panel) that saturates at about 1 g/kg at day 7 lead time. Both the hybrid model and the other two benchmarks eliminate most of this bias. The spatially varying component of the error behaves similarly to that for the low level temperature, with the hybrid model outperforming the benchmarks for lead times from 1 to 7 days.
For the meridional wind component in the upper troposphere (200 hPa, bottom panel) none of the models develop a noteworthy bias. Thus, the differences in forecast performance are solely due to differences in the spatially varying component of the forecast error. This error component is still smaller for the hybrid model than SPEEDY for the first 9 forecast days, and than for the other benchmarks for the the first 6 forecast days.
3.5 Atmospheric Balance
Maintaining the delicate balance between the wind (momentum) and mass field in a numerical model, especially at short forecast lead times, has been one of the biggest challenges of atmospheric modeling since the dawn of NWP (e.g., Lynch, 2006). In a modern NWP model, a weakened balance is a shortlived transient property and the magnitude of the initial transient can be greatly reduced by initialization techniques (e.g., section 8 of Lynch (2006)). In the hybrid model and SPEEDYLLR, however, no initialization is done before a corrected 6 hr forecast is used as the initial condition of the next 6 hr numerical forecast. Hence, the corrections inevitably upset the balance in the numerical component of the hybrid forecasts every 6 hr. The forecast verification results discussed thus far suggest that these imbalances do not outweigh the positive effects of the corrections on the accuracy of the hybrid forecasts. But, can the hybrid model produce realistic surface pressure tendencies by also correcting the surface pressure field for the effects of gravity waves excited by the imbalances? We investigate this possibility by examining the global rootmeansquare of the surface pressure tendency in the forecasts for the hybrid and the benchmark models (Figure 7). We assume that the value computed for ERA5 (red curve), which is about 0.4 hPa/h, provides a realistic estimate of the global rootmeansquare of surface pressure tendency in the atmosphere.
As can be expected from a numerical model started from an uninitialized initial condition, the initial tendency for SPEEDY (about 1 hPa/h) is higher than desired. As forecast time increases, the the magnitude of the mean tendency drops, first rapidly, and then at a decreasing rate until it settles below the natural level, at about 0.28 hPa/h. The latter behavior suggests that the diffusion built into the model to combat imbalances oversmooths the temporal variability of the forecasts beyond day 1. While the magnitude of the mean tendency for the hybrid forecasts (about 0.38 hPa/h) is initially slightly smaller than the natural value, and further decreases in the first 72–84 hr (to about 0.36 hPa/h), it is closer to the natural value than those for the benchmark forecasts. The SPEEDYLLR is less effective than the hybrid model in eliminating the initial transient and it also produces an average tendency at the later forecast times (about 0.30 hPa/h) that is further below the natural level. The MLonly model behaves similarly to the hybrid model for the first two forecast days, but the saturation value is clearly lower (about 0.33 hPa/h) than for the hybrid model.
3.6 Sensitivity to Training Length
To test the sensitivity of the performance and stability of the hybrid model to the training length, we carry out a series of experiments with the same hyperparameters as before, but for shorter training periods. In particular, we train the model on 2 years, 5 years, or 10 years of reanalysis data, with the training always ending at 2300 UTC, 26 June 2011, as for the original forecast experiments (We recall that the length of the training for the original experiments is 20.5 years) The results of these experiments for the usual 100 21day forecast cases for select variables are summarized in Figure 8.
While training the hybrid model for only 2 years already significantly improves the forecast performance for the nearsurface temperature and specific humidity compared to that of SPEEDY, extending the training length further improves the forecasts. The hybrid model trained for 2 years does not improve the meridional wind component in the upper troposphere, and actually degrades the forecasts beyond 3 days. A longer training makes the hybrid model perform better initially than SPEEDY. The length of the superior performance of the hybrid model becomes longer as the length of the training period increases. The results shown in Figure 8 also suggest that a further modest improvements of the forecast performance could be achieved by using a training period even longer than 20.5 years.
4 Climate Simulation Experiment
To evaluate the long term stability of the hybrid model and its ability to simulate the climate, we compute an 11 years long free run with the model. For this simulation experiment, the hybrid model is trained on ERA5 reanalyzes for the 19year period from 1 January 1981 to 27 December 1999. The simulation starts from the ERA5 reanalysis valid at 0000 UTC, 1 January 2000. To suppress the effects of initial transients and the initial condition on the model diagnostics, we discard the data from the first year of the simulations before computing the diagnostics. To compare the performance of the hybrid model and SPEEDY in simulating the climate, we assume that the two simulations attempt to simulate the climate of the 10year period from 2001 to 2010 as represented by ERA5.
4.1 Zonal Mean Biases
Figures 9 and 10 show the zonal mean biases of the simulations by SPEEDY (left panels) and the hybrid (right panels) for the boreal winter (December, January, and February) and boreal summer (June, July, and August), respectively. These figures can be used, not only to compare the quality of the two simulations, but also to assess the average magnitude of the corrections made by the ML component of the hybrid model. In particular, the difference between a left panel and the corresponding right panel is the zonal mean of the ML correction for a particular state variable.
The top left panels show that SPEEDY has a large upper tropospheric warm bias for the tropical regions, during both the boreal winter and summer. In both polar regions SPEEDY has a cold bias for the upper troposphere and stratosphere during the boreal winter and a warm (cold) bias in the southern (northern) polar region during the boreal summer. The magnitude of the bias is not surprising given the coarse resolution and simplified parameterizations used in SPEEDY (Molteni, 2003). The top right panels show that the hybrid model greatly reduces, but does not completely eliminate, these biases when the model is cycled over a long period of time. The bias reduction is particularly notable in the the tropics and the midlatitudes. The largest remaining biases are in the polar regions.
The hybrid model reduces the zonal component of the wind bias, especially in the stratosphere and upper troposphere, and in the lower troposphere in the SH midlatitudes in the boreal summer. The only exception is the introduction of a positive zonal component of the wind bias in the stratosphere in the tropics. The hybrid model also greatly reduces the large positive humidity bias of SPEEDY with maxima in the tropics.
Figure 11 shows the mean surface pressure biases for the simulations by SPEEDY (left panels) and hybrid model (right panels) for the boreal winter (top row) and boreal summer (bottom row). The mottled short scale patterning seen in the two left panels of the figure are due to the spectrally truncated topography of SPEEDY, which is much smoother than the topography determining the interpolated ERA5 reanalyzes used for the evaluation of the simulations, and for the training of the hybrid model. In combination with the artifacts caused by the spectral truncation in SPEEDY, the large local differences in the mountainous regions lead to substantial surface pressure biases in the SPEEDY simulations. The hybrid model corrects the large local biases, but still has smaller magnitude large scale biases. The wavenumbertwo structure of the largescale hybrid model bias in the NH suggests that these biases are related to the low resolution representation of the topography and the landsea contrasts in the numerical model. The remaining biases are also relatively large in the polar regions, especially in the boreal summer. We speculate that the bias of the hybrid model in the polar regions might be related to our particular strategy to do the localization on a cylindric (Mercator) map projection. On the other hand, the bias is not concentrated at the poles for the variables shown in Figures 9 and 10.
4.2 Temporal Variability
To investigate the temporal variability of the atmosphere in the SPEEDY and hybrid climate simulations, we examine the temporal dependence of the 950 hPa temperature at the four model grid points that fall in the Sahara Desert. The top two panels of Figure 12 show the power spectra of the temporal variability for the two models. These power spectra are computed by applying a Hamming filter first, and then a discrete Fourier transform to the 10 years of 6hourly simulation data, and finally computing the square of the absolute value of the Fourier coefficients. The results show that both simulations correctly capture the variability at time scales longer than about a week. At the shorter time scales, however, SPEEDY increasingly underestimates the variability. The ML correction greatly reduces, but does not completely eliminate, this problem: the hybrid model underestimates the variability at the scales between 1 week and 1 day only slightly, and reduces the underestimation by SPEEDY at the even shorter scales. Most importantly, unlike SPEEDY, the hybrid model has a strong diurnal cycle. It should be noted that an earlier version of the hybrid model, which did not include the incoming solar radiation at the top of the atmosphere as an input to the reservoir, lost the diurnal cycle at around the end of year 4. This motivated us to add the incoming solar radiation as an input parameter, even though it had no significant effect on the forecast accuracy. We find it a noteworthy, nontrivial result that the earlier version of the hybrid model was able to learn the diurnal cycle strictly from the training data.
The fact that a simulation correctly captures the variability at a number of frequencies does not guarantee that the phases of the temporal changes (e.g., the timing of the seasons) are also correct. To exclude the possibility of such a flaw of the simulations, we plot (bottom panel of Figure 12) the time series of the average 950 hPa temperature for the same four Saharan grid points for the last full year of the simulations. The points along these curves should fall within two standard deviations from the mean for the given date and time (the interval marked by gray shading) with a 95% observed frequency. Based on the full 10 years of data, the observed frequency is 88.2% for SPEEDY and 98.0% for the hybrid model.
5 Conclusions
In this paper, we described results from the first implementation of the hybrid modeling approach CHyPP of Wikner et al. (2020) on a realistic atmospheric model. We used a lowresolution AGCM based on the full set of primitive equations, along with ERA5 reanalysis data for training and verification, to demonstrate the potentials of CHyPP for both NWP and climate modeling. The spatiotemporal structure of the improvements of the forecasts and simulations suggests that the ML component of the model primarily corrects for errors caused by the limitations of the parameterization schemes of the AGCM. While stateoftheart numerical models have much higher resolutions and more advanced parameterization schemes than SPEEDY, the weather forecasts and climate simulations they provide still have substantial biases. We expect the hybrid approach to effectively reduce these biases.
Because the ML component of the hybrid model is based on RC, training the model is computationally highly efficient. Specifically, the training described in this paper requires only 30 min wallclock time using 1,152 Intel Xeon E52670 v2 processors on a supercomputer that is much less powerful than those at the operational NWP centers. Using the same computational resources, preparing a 21day forecast takes about 52 s, while carrying out a oneyear simulation takes about 15 min. These numbers are only 25% higher than those for SPEEDY, and the extra time is mainly due to the overhead associated with the frequent restart of SPEEDY.
Due to the parallel nature of the computational algorithm, we expect it to scale well for higher model resolutions and larger number of processors. A modification of the current implementation of our method that might be helpful for scaling is vertical localization. By “vertical localization” we mean the use of local domains that, as well as being limited in horizontal extent as shown in Figure 1, are also of limited height and are stacked vertically with overlap from groundlevel to the top of the atmosphere. Though we do not use vertical localization in this article, we plan to test it soon for potential improvements with SPEEDY.
The ideal size of a local domain still needs to be determined through additional experimentation, both for SPEEDY and for higherresolution models. Thus, it is hard to make a precise quantitative projection for scaling, but here is a comparison that indicates feasibility for operational models. The current computer of ECMWF has 129,960 processors (about 100 times more than what we used), and their operational model has 6.5 × 10^{6} horizontal grid points (about 180 times more than SPEEDY) (“IFS Documentation CY47R1–Part III: Dynamics and Numerical Procedures”, 2020). If the local regions for the ECMWF model would be defined by four horizontal and all vertical grid points, as in our paper, each processor would have to handle less than twice as many local regions at ECMWF than in our model. Also, there is no obvious reason to believe that the computational overhead of the hybrid model would be substantially higher than the 25% we found for SPEEDY. The high computational efficiency of the approach would allow for a large number of experiments to find the optimal configuration of a future operational hybrid model. Developing an efficient systematic approach to find a near optimal combination of the hyperparameters, nevertheless, would be highly desirable and is one of the subjects of our ongoing research efforts. An unknown factor that could have a very favorable impact on future scaling considerations is the ongoing rapid technological developments of alternative, fast, cheap physical implementations of reservoir computing, for example, implementations based on photonics or on Field Programmable Gate Arrays.
We emphasize that while the ML component of the hybrid model is highly efficient in correcting the biases of the forecasts and simulations prepared by the host model, it is not a MLbased postprocessing technique. While a technique of the latter type corrects the numericalmodelbased forecasts of a specific forecast variable or phenomenon (e.g., Chapman et al., 2019; Kim et al., 2021; Rasp & Lerch, 2018) without interacting with the numerical model, the ML component of the hybrid model makes frequent periodic interactive corrections to the numerical model solution. Hence, it also greatly improves the representation of the spatiotemporal variability of the atmospheric state by the model.
We expect that the performance of the hybrid model can be further improved by investigating the relationship between the parameters of the ML model and the representation of basic atmospheric processes. Such an investigation could lead to further improvements of the model, similar to the way studies of the interactions between numerics and dynamics (e.g., Arakawa & Lamb, 1977) led to much improved physicbased numerical models. For instance, one potentially important fundamental question is the optimal relationship between the size of the local domains, the overlap between the local domains in the input of the reservoir, and the length of the time step Δt. The fact that the ML component is more effective in correcting localized errors than errors at the larger scales in the current version of our hybrid model may be partly the result of using local domains and an overlap that are less than optimal for the selected time step. In our experiments, the size of the overlap was primarily dictated by the structure of our code and the available computer resources, but larger local domains and a larger overlap could be used in the future.
An intriguing possibility is to use the hybrid model for data assimilation in addition to forecasting, as data assimilation could greatly benefit from the higher accuracy and smaller biases of the short term hybrid forecasts used as background. Furthermore, integrating ML and data assimilation may allow in the future to do online training of the ML component of the hybrid model on realtime observations rather than canned reanalyzes data. The availability of such training procedure would make it possible to extend the hybrid modeling approach to numerical models for which highquality reanalysis data are not available (e.g., an AGCM that also includes a sophisticated model of the upper atmosphere well beyond the lower stratosphere). It could also allow the ML component of the model to adjust to variability and changes of the climate. We have made a first step toward this ambitious goal, in which we iteratively use the hybrid model to prepare an updated set of analyses, which is then used to train the next iteration of the hybrid model (Wikner et al., 2021). Our plan is to test this approach with the hybrid model of the current paper.
Acknowledgments
This work was supported by DARPA contract DARPAPA1801 (HR111890044). The work of T. Arcomano and I. Szunyogh was also supported by ONR award N00014182509. The work of Alexander Wikner was supported in part by the National Science Foundation (NSF) (Award No. DGE1632976). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing. This paper greatly benefitted from stimulating discussions with Sarthak Chandra, Michelle Girvan, Garrett Katz, and Andrew Pomerance. The constructive comments of the three anonymous reviewers helped us to greatly improve the presentation of our ideas and results.
Open Research
Data Availability Statement
The new data generated for the paper are available online http://doi.org/10.5281/zenodo.5103176.
References
Erratum
Sections 2.2 and 2.4.1 of the originally published version of this article misstated two details of the method used to obtain the reported results. The results and conclusions of the paper were not affected. The errors have been corrected, and this may be considered the official version of record.