# A Machine Learning-Based Global Atmospheric Forecast Model

## Abstract

The paper investigates the applicability of machine learning (ML) to weather prediction by building a reservoir computing-based, low-resolution, global prediction model. The model is designed to take advantage of the massively parallel architecture of a modern supercomputer. The forecast performance of the model is assessed by comparing it to that of daily climatology, persistence, and a numerical (physics-based) model of identical prognostic state variables and resolution. Hourly resolution 20-day forecasts with the model predict realistic values of the atmospheric state variables at all forecast times for the entire globe. The ML model outperforms both climatology and persistence for the first three forecast days in the midlatitudes, but not in the tropics. Compared to the numerical model, the ML model performs best for the state variables most affected by parameterized processes in the numerical model.

## Key Points

- A low-resolution, global, reservoir computing-based machine learning (ML) model can forecast the atmospheric state
- The training of the ML model is computationally efficient on a massively parallel computer
- Compared to a numerical (physics-based) model, the ML model performs best for the state variables most affected by parameterized processes

## 1 Introduction

The ultimate goal of our research is to develop a *hybrid* (numerical machine learning, ML) *weather prediction* model. We hope to achieve this goal by implementing algorithms developed by Pathak, Wikner, et al. (2018), Pathak, Hunt, et al. (2018), and Wikner et al. (2020): The first paper introduced an efficient ML algorithm for numerical model-free prediction of large, spatiotemporal dynamical systems, based solely on the knowledge of past states of the system; the second paper showed how to combine a ML algorithm with an imperfect numerical model of a dynamical system to obtain a hybrid model that predicts the system more accurately than either component alone; while the third paper combined the techniques of the first two into a computationally efficient hybrid modeling approach. The present paper implements the parallel ML technique of Pathak, Wikner, et al. (2018) to build a model that predicts the weather in the same format as a global numerical model. We train and verify the model on hourly ERA5 reanalysis data from the European Centre for Medium-Range Weather Forecasts (Hersbach et al., 2019).

The work presented here can also be considered an attempt to develop a ML model that can predict the evolution of the three-dimensional, multivariate, global atmospheric state. To the best of our knowledge, the only similar prior attempts were those by Scher (2018) and Scher and Messori (2019), but they trained their three-dimensional multivariate ML model on data that were produced by low-resolution numerical model simulations. In addition, Dueben and Bauer (2018) and Weyn et al., 2019 (2020) designed ML models to predict two-dimensional, horizontal fields of select atmospheric state variables. Similar to our verification strategy, they also verified the ML forecasts against reanalysis data. Compared to all of the aforementioned studies, an important new aspect of our work is that we employ *reservoir computing (RC)* (Jaeger, 2001; Lukoševičius & Jaeger, 2009; Lukoševičius, 2012; Maass et al., 2002) rather than *deep learning* (e.g., Goodfellow et al., 2016), which is primarily motivated by the significantly lower computer wall clock time required to train an RC-based model. This difference in training efficiency would allow for a larger number of experiments to tune the ML model at higher resolutions.

The structure of the paper is as follows. Section 2 describes the ML model, while section 3 presents the results of the forecast experiments, using as benchmarks persistence of the atmospheric state, daily climatology, and numerical forecasts from a physics-based model of identical prognostic state variables and resolution. Section 4 summarizes our conclusions.

## 2 The ML Model

The *N* components of the state vector **v**^{m}(*t*) of the ML model are the grid point values associated with the spatially discretized fields of the Eulerian dependent variables of the model. Training the model requires the availability of a discrete time series of past *observation-based estimates* (analyses) **v**^{a}(*k*Δ*t*) (*k*=−*K*,−*K*+1,…,0) of the atmospheric states that use the same *N*-dimensional representation of the state as the model. Beyond the training period, the analyses **v**^{a}(*k*Δ*t*) (*k*=1,2,…) are used only to maintain the synchronization of the model state with the observed atmospheric state. An ML forecast can potentially be started at any analysis time *k*Δ*t* (*k*=0,1,…): The forecast is a discrete time series of model states
(*k**′*=*k*+1,*k*+2,…), where *k*Δ*t* is the *initial time*, **v**^{a}(*k*Δ*t*) is the *initial state*, Δ*t* is the *time step*, and (*k**′*−*k*)Δ*t* is the *forecast time*. The computational algorithm of the model is designed to take advantage of a massively parallel computer architecture.

### 2.1 Representation of the Model State

#### 2.1.1 The Global State Vector

We define **v**^{m}(*t*) by the grid-based state vector of the physics-based numerical model SPEEDY (Kucharski et al., 2013; Molteni, 2003). While SPEEDY is a spectral transform model, it uses the grid-based state vector to represent the input and output state of the model and to compute the nonlinear and parameterized terms of the physics-based prognostic equations. The horizontal grid spacing is 3.75° × 3.75° and the model has *n*_{v}=8 vertical *σ* levels ( at *σ* equals 0.025, 0.095, 0.20, 0.34, 0.51, 0.685, 0.835, and 0.95), where *σ* is the ratio of pressure to the pressure at the surface. The model has four three-dimensional dependent variables (the two horizontal coordinates of the wind vector, temperature, and specific humidity) and one two-dimensional dependent variable (the logarithm of surface pressure). Thus, the number of variables per horizontal location is *n*_{t}=4×*n*_{v}+1. Because there are *n*_{h}=96×48=36,864 horizontal grid points, the total number of model variables is *N*=*n*_{t}×*n*_{h}=1.52064 × 10^{5}. Before forming the state vector **v**^{m}(*t*), we standardize each state variable by subtracting its climatological mean and dividing by its standard deviation at the particular model level in the local region.

#### 2.1.2 Local State Vectors

The global model domain is partitioned into *L*=1,152 local regions. We use a Mercator (cylindrical) map projection to define the local regions, partitioning the three-dimensional model domain only in the two horizontal directions: Each local region has the shape of a rectangular prism with a 7.5° × 7.5° base (Figure 1). The model state in local region *ℓ* (*ℓ*=1,2,…,*L*) is represented by the *local state vector*
, whose components are defined by the *D*_{v}=4×*n*_{t}=132 components of the global state vector in the local region. The model computes the *L* evolved local state vectors
from **v**^{m}(*t*) in parallel, and the evolved global state vector **v**^{m}(*t*+Δ*t*) is obtained by piecing the *L* evolved local state vectors together.

### 2.2 The Computational Algorithm

#### 2.2.1 RC

The computation of
from **v**^{m}(*t*) requires the evaluation of a composite (chain) function for each local state vector. Because we use an RC algorithm, this composite function has only three layers: the *input layer, the reservoir,* and the *output layer.* A key feature of RC is that the trainable parameters of the model appear only in the output layer, which greatly simplifies the training process.

#### 2.2.2 The Input Layer and Reservoir

**W**

_{in,ℓ}[·] is the input layer. The dimension

*D*

_{r}of the

*reservoir state vector*is much higher than the dimension of the input vector . (The reservoir is a high-dimensional dynamical system.) The input vector is an

*extended local state vector*that represents the model state in an extended local region. In the present paper, we define by the grid points of local region

*ℓ*plus the closest grid points from the neighboring local regions (see Figure 1 for an illustration). In the terminology of Pathak, Wikner, et al. (2018), the

*locality parameter*of our model is 1. Using a nonzero value of the locality parameter is essential, because otherwise no information can flow between the local regions. The dimension of the extended local state vectors is =16 ×

*n*

_{t}=528 for most

*ℓ*. The exceptions are the local regions nearest to the two poles, because for those, we add no extra grid points in the poleward direction. The dimension of the input vectors in these local regions is .

The “local approach” of Dueben and Bauer (2018), which was introduced independently of the parallel technique of Pathak, Wikner, et al. (2018), employs a localization strategy that is formally similar to the one described here. There is, however, an important difference between the two localization techniques: Dueben and Bauer (2018) trained a single common neural network for the different local regions, while we train a different reservoir for each local region.

**A**

_{ℓ}is a

*D*

_{r}×

*D*

_{r}

*weighted adjacency matrix*that represents a low-degree, directed, random graph (Gilbert, 1959). Each entry of

**A**

_{ℓ}has a probability

*κ*/

*D*

_{r}of being nonzero, so that the expected degree of each vertex is a prescribed number

*κ*. Thus,

*κ*is the average number of incoming connections (edges) per vertex. The nonzero entries of

**A**

_{ℓ}are randomly drawn from a uniform distribution in the interval (0,1] and scaled so that the largest eigenvalue of

**A**

_{ℓ}is a prescribed number

*ρ*. The parameter

*ρ*, which controls the length of the memory of the ML model dynamics, is called the

*spectral radius*.

#### 2.2.3 The Output Layer

**W**

_{out,ℓ}[·,·] is the output layer. This function is chosen such that it is linear in the

*D*

_{v}×

*D*

_{r}

*matrix of trainable parameters*

**P**

_{ℓ}. To be precise,

#### 2.2.4 Synchronization and Training

We define the *local analysis*
by the components of the global analysis **v**^{a}(*k*Δ*t*) (*k*=−*K*,−*K*+1,…) that describe the state in local region *ℓ*. In other words,
is the observation-based estimate of the desired value of the model state
. Likewise, we define the *extended local analysis*
as the observation-based estimate of the extended local state vector
(*k*=−*K*,−*K*+1,…).

The synchronization and training of the ML model starts with feeding the past analyses to the reservoir, or more precisely, by substituting
(*k*=−*K*,−*K*+1,…,−1) for
in equation (1). Thus, the output layer, equation (3), is not needed to compute **r**_{ℓ}(*k*Δ*t*) for *k*=−*K*+1,−*K*+2,…,0: We generate **r**_{ℓ}(−*K*Δ*t*) randomly, discard the transient sequence **r**_{ℓ}(*k*Δ*t*), *k*=−*K*,−*K*+1,…,−*K*_{t}, and define
for *k*=−*K*_{t}+1,−*K*_{t}+2,…,0 according to equation (1), with **P**_{ℓ} as yet undetermined.

**P**

_{ℓ}that minimizes the cost function

*β*‖

**W**

_{out,ℓ}‖) (Tikhonov et al., 1997) of

*J*

_{ℓ}(

**P**

_{ℓ}) is to improve the numerical stability of the computations and prevent overfitting to the training data by choosing large values of the components of

**W**

_{out,ℓ}. Because

**W**

_{out,ℓ}depends linearly on

**P**

_{ℓ}, the solutions of the

*L*minimization problems can be obtained by a linear

*ridge regression*. That is,

**P**

_{ℓ}is computed by solving the linear problem:

*k*=−

*K*

_{t}+1,−

*K*

_{t}+2,…,0) and the columns of are (

*k*=−

*K*

_{t}+1,−

*K*

_{t}+2,…,0). Notice that the dimension of the linear problem of equation (6) does not depend on the length

*K*

_{t}of the training period. To conserve memory, the

*D*

_{r}×

*K*

_{t}matrix

**R**

_{ℓ}need not be stored; the

*D*

_{r}×

*D*

_{r}matrix and the

*D*

_{v}×

*D*

_{r}matrix can be built incrementally, passing the training data through the reservoir time step by time step (e.g., Lukoševičius & Jaeger, 2009; Lukoševičius, 2012).

### 2.3 Implementation on ERA5 Reanalysis Data

#### 2.3.1 Training

The global analyses **v**^{a}(*k*Δ*t*) (*k*=−*K*,−*K*+1,…) are hourly ERA5 reanalyses interpolated to the computational grid and adjusted to the topography of SPEEDY. The training starts at 0000 UTC, 1 January 1981 and ends at 2000 UTC 24 January 2000 (*K*≈1.66×10^{5}). We add a small-magnitude random noise ** ε**(

*t*) to (

*k*=−

*K*,−

*K*+1,…,−1) before we substitute it for in equation (1) in order to improve the robustness of the ML model to noise (Jaeger, 2001). The transient sequence of

*K*−

*K*

_{t}discarded reservoir states corresponds to the first 43 days of training.

#### 2.3.2 Code Implementation and Performance

The current computer code of the ML model is written in Fortran, using both MPI and OpenMP for parallelization and the LAPACK routine DGESV to solve the linear problem of equation (6). The computations of both the training and forecast phase are carried out on 1,152 Intel Xeon E5-2670 v2 processors. Training the model takes 67-min wall clock time and requires 2.2 Gb of distributed memory per processor. Our current code is designed to minimize the wall clock execution time given the available memory on a particular supercomputer, but the memory usage could be reduced (e.g., by not keeping all training data in memory simultaneously or using single-precision rather than double-precision arithmetic).

### 2.4 The Forecast Cycle

Beyond the training period, the analyses are used only to maintain the synchronization between the reservoirs and the atmosphere. We use the hourly reanalyses for synchronization but start a new 20-day forecast only once every 48 hr. (Preparing a 20-day forecast takes about 1 min of wall clock time.) We prepare a total of 171 forecasts for the period from 25 January to 28 December 2000. The forecast error statistics reported below are calculated based on these forecasts.

#### 2.4.1 Selection of the Hyperparameters

The dimension *D*_{r} of the reservoir, rank *κ* of the random network, spectral radius *ρ*, random noise ** ε**, and regularization parameter

*β*are the

*hyperparameters*of the RC algorithm. We found suitable combinations of these parameters by numerical experimentation, monitoring the accuracy and stability of the forecasts. All results reported in this paper are for

*D*

_{r}= 9,000,

*κ*= 6,

*β*=10

^{−5}, while

*ρ*monotonically increases from 0.3 at the equator to 0.7 at 45° and beyond. The components of

**are uncorrelated, normally distributed, random numbers with mean zero and standard deviation 0.28. For this combination of the hyperparameters, the ML model predicts realistic values of all state variables for the entire globe and 20-day forecast period.**

*ε*## 3 Forecast Verification Results

### 3.1 Benchmark Forecasts

We use daily climatology, persistence, and numerical forecasts for the evaluation of the ML model forecasts. Persistence is based on the assumption that the initial atmospheric state will persist for the entire time of the forecast. The numerical forecasts are prepared by Version 42 of the SPEEDY model. While SPEEDY has been developed for research applications rather than weather prediction, it can be considered a low-resolution version of today's numerical weather prediction models. Most importantly, similar to all operational models, it solves the system of atmospheric primitive equations and has a realistic climate. It provides a good benchmark in the current stage of our research, in which the primary goal is to prove a concept rather than improve operational forecasts.

### 3.2 Results

We verify all forecasts against ERA5 reanalyses interpolated to the computational grid and adjusted to the SPEEDY orography. The magnitude of the forecast error is measured by the mean of the area-weighted root-mean-square difference between the forecasts and the verification data for all forecasts. Results are shown for selected variables in the Northern Hemisphere (NH) midlatitudes for the first 72 forecast hours (Figure 2). In this region, the ML model outperforms both persistence and climatology by a large margin in the first 48 forecast hours. While the ML model forecasts remain more accurate than persistence in the next 24 forecast hours, their skill, with the exception of the temperature forecasts, degrades to that of climatology. In the tropics (results not shown) the accuracy of the ML model is very similar to that of persistence and climatology.

The performance of the ML model compared to SPEEDY is mixed: The ML forecasts are more accurate for the specific humidity near the surface, especially at 24- and 48-hr forecast times, while the SPEEDY forecasts are more accurate for the wind, particularly at the jet level. The ML temperature forecasts are also more accurate in the tropics (results not shown), where the SPEEDY forecasts rapidly develop a large bias in the upper troposphere.

To better understand the behavior of the root-mean-square error, we decomposed its square into a (square of) bias and variance component and also investigated the power spectrum of the variance in the NH midlatitudes with respect to the zonal wavenumber (results are not shown). On the positive side, the ML forecasts of the different variables have little or no bias, and the variance of the longer term forecasts saturates at a realistic level for zonal wave numbers larger than 6. On the negative side, the variance saturates at unrealistically high levels at the lower wave numbers, leading to an overprediction of the spatial variability of the forecast fields at the longer forecast times. The fast growth of the variance at the large scales, especially at Wave Number 4, is the main deficiency of ML model in the midlatitudes. Fixing this problem could extend the time range of forecast skill by days.

### 3.3 Near-Surface Humidity and Tropical Temperature Profiles

The short-term forecast advantage of the ML model over SPEEDY has two sources. First, while the SPEEDY forecasts rapidly develop a near-surface humidity bias, the ML model forecasts are free of such bias. Second, the variance of the ML model forecast errors is also lower initially. As forecast time increases, the advantage of the ML model remains in terms of the bias, but vanishes in terms of the variance. Because the variance becomes the dominant component at the later forecast times, climatology breaks even with the ML model forecasts by 72-hr forecast time (bottom right panel of Figure 2). The spatial distribution of the difference of the errors (Figure 3) suggests that the ML model performs better in regions where parameterized atmosphere-surface interactions play an important role in the moist processes in SPEEDY (e.g., regions of the ocean boundary currents). Likewise, the advantage of the ML model in predicting the tropical temperature profiles (not shown) is the result of large biases that are present only in the SPEEDY forecasts in the main regions of parameterized deep convection. Finally, it should be noted that while the current version of the ML model learns about atmosphere-surface interactions strictly from the atmospheric training data, SPEEDY uses a number of prescribed fields to describe the surface conditions (e.g., a spatiotemporally evolved sea surface temperature analysis.)

### 3.4 Rossby Wave Propagation

The forecast variable for which SPEEDY clearly outperforms the ML model is the meridional component of the wind: while the accuracy of the wind forecasts by the two models is similar at 24 hr, the error of the ML model forecasts grows more rapidly beyond that time. The difference between the errors of the two models grows the fastest in the layer around the jet streams of the NH midlatitudes (between 400 and 200 hPa). Because the variability of the meridional wind in this layer is dominated by dispersive synoptic-scale Rossby waves, the aforementioned result suggests that the ML model may be inferior to the numerical model in describing the Rossby wave dynamics. To investigate this possibility, we plot Hovmöller diagrams of the meridional wind for both forecasts and the verification data (Figure 4).

A pattern of negative (positive) values followed by a pattern of positive (negative) values indicate a trough (ridge). Because the eastward group velocity of the dispersive Rossby waves at the synoptic scales is larger than their eastward phase velocity, new troughs and ridges can develop downstream of the original waves. Such developments are marked by oriented dashed black lines in the figure. In the first three days, the ML model captures the dispersive dynamics of the wave packets accurately, but because the synoptic-scale wave packets are composed of Wave Number 4–11 waves (e.g., Zimin et al., 2003), the overintensification of the Wave Number 4–6 components at the later forecast times leads to a gradual shift of the carrier wave number toward lower values and a deceleration of the group velocity.

## 4 Conclusions

We demonstrated that a RC-based parallel ML model can predict the global atmospheric state in the same gridded format as a numerical (physics-based) global weather prediction model. We found that the 20-day ML model forecasts predicted realistic values of all state variables at all forecast times for the entire globe. The ML model predicted the weather in the midlatitudes more accurately than either persistence or climatology for the first three forecast days. This time range could be significantly extended by eliminating, or at least reducing, the overprediction of atmospheric spatial variability at the large scales (wave numbers lower than 7). The forecast variables for which the ML model performed best compared to a numerical (physics-based) model of identical prognostic state variables and resolution were the ones most affected by parameterized processes in the numerical model.

The results suggests that the current version of our ML model have potential in short-term weather forecasting. Because the parallel computational algorithm is highly scalable, it could be easily adapted to higher spatial resolutions on a larger supercomputer. As the algorithm is highly efficient in terms of wall clock time, it could be used for rapid forecast applications and could also be implemented in a limited area rather than a global setting. The ML modeling technique described here could also be applied to other geophysical fluid dynamical systems.

## Acknowledgments

This work was supported by DARPA Contract DARPA-PA-18-01 (HR111890044). The work of T. A. and I. S. was also supported by ONR Award N00014-18-2509. This research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing. This paper greatly benefitted from stimulating discussions with Sarthak Chandra, Michelle Girvan, Garrett Katz, and Andrew Pomerance. Peter Dueben of ECMWF and an anonymous reviewer provided valuable comments to improve the paper. The new data generated for the paper are available online (http://doi.org/10.5281/zenodo.3712157).