The paper investigates the applicability of machine learning (ML) to weather prediction by building a reservoir computing-based, low-resolution, global prediction model. The model is designed to take advantage of the massively parallel architecture of a modern supercomputer. The forecast performance of the model is assessed by comparing it to that of daily climatology, persistence, and a numerical (physics-based) model of identical prognostic state variables and resolution. Hourly resolution 20-day forecasts with the model predict realistic values of the atmospheric state variables at all forecast times for the entire globe. The ML model outperforms both climatology and persistence for the first three forecast days in the midlatitudes, but not in the tropics. Compared to the numerical model, the ML model performs best for the state variables most affected by parameterized processes in the numerical model.
- A low-resolution, global, reservoir computing-based machine learning (ML) model can forecast the atmospheric state
- The training of the ML model is computationally efficient on a massively parallel computer
- Compared to a numerical (physics-based) model, the ML model performs best for the state variables most affected by parameterized processes
The ultimate goal of our research is to develop a hybrid (numerical machine learning, ML) weather prediction model. We hope to achieve this goal by implementing algorithms developed by Pathak, Wikner, et al. (2018), Pathak, Hunt, et al. (2018), and Wikner et al. (2020): The first paper introduced an efficient ML algorithm for numerical model-free prediction of large, spatiotemporal dynamical systems, based solely on the knowledge of past states of the system; the second paper showed how to combine a ML algorithm with an imperfect numerical model of a dynamical system to obtain a hybrid model that predicts the system more accurately than either component alone; while the third paper combined the techniques of the first two into a computationally efficient hybrid modeling approach. The present paper implements the parallel ML technique of Pathak, Wikner, et al. (2018) to build a model that predicts the weather in the same format as a global numerical model. We train and verify the model on hourly ERA5 reanalysis data from the European Centre for Medium-Range Weather Forecasts (Hersbach et al., 2019).
The work presented here can also be considered an attempt to develop a ML model that can predict the evolution of the three-dimensional, multivariate, global atmospheric state. To the best of our knowledge, the only similar prior attempts were those by Scher (2018) and Scher and Messori (2019), but they trained their three-dimensional multivariate ML model on data that were produced by low-resolution numerical model simulations. In addition, Dueben and Bauer (2018) and Weyn et al., 2019 (2020) designed ML models to predict two-dimensional, horizontal fields of select atmospheric state variables. Similar to our verification strategy, they also verified the ML forecasts against reanalysis data. Compared to all of the aforementioned studies, an important new aspect of our work is that we employ reservoir computing (RC) (Jaeger, 2001; Lukoševičius & Jaeger, 2009; Lukoševičius, 2012; Maass et al., 2002) rather than deep learning (e.g., Goodfellow et al., 2016), which is primarily motivated by the significantly lower computer wall clock time required to train an RC-based model. This difference in training efficiency would allow for a larger number of experiments to tune the ML model at higher resolutions.
The structure of the paper is as follows. Section 2 describes the ML model, while section 3 presents the results of the forecast experiments, using as benchmarks persistence of the atmospheric state, daily climatology, and numerical forecasts from a physics-based model of identical prognostic state variables and resolution. Section 4 summarizes our conclusions.
2 The ML Model
The N components of the state vector vm(t) of the ML model are the grid point values associated with the spatially discretized fields of the Eulerian dependent variables of the model. Training the model requires the availability of a discrete time series of past observation-based estimates (analyses) va(kΔt) (k=−K,−K+1,…,0) of the atmospheric states that use the same N-dimensional representation of the state as the model. Beyond the training period, the analyses va(kΔt) (k=1,2,…) are used only to maintain the synchronization of the model state with the observed atmospheric state. An ML forecast can potentially be started at any analysis time kΔt (k=0,1,…): The forecast is a discrete time series of model states (k′=k+1,k+2,…), where kΔt is the initial time, va(kΔt) is the initial state, Δt is the time step, and (k′−k)Δt is the forecast time. The computational algorithm of the model is designed to take advantage of a massively parallel computer architecture.
2.1 Representation of the Model State
2.1.1 The Global State Vector
We define vm(t) by the grid-based state vector of the physics-based numerical model SPEEDY (Kucharski et al., 2013; Molteni, 2003). While SPEEDY is a spectral transform model, it uses the grid-based state vector to represent the input and output state of the model and to compute the nonlinear and parameterized terms of the physics-based prognostic equations. The horizontal grid spacing is 3.75° × 3.75° and the model has nv=8 vertical σ levels ( at σ equals 0.025, 0.095, 0.20, 0.34, 0.51, 0.685, 0.835, and 0.95), where σ is the ratio of pressure to the pressure at the surface. The model has four three-dimensional dependent variables (the two horizontal coordinates of the wind vector, temperature, and specific humidity) and one two-dimensional dependent variable (the logarithm of surface pressure). Thus, the number of variables per horizontal location is nt=4×nv+1. Because there are nh=96×48=36,864 horizontal grid points, the total number of model variables is N=nt×nh=1.52064 × 105. Before forming the state vector vm(t), we standardize each state variable by subtracting its climatological mean and dividing by its standard deviation at the particular model level in the local region.
2.1.2 Local State Vectors
The global model domain is partitioned into L=1,152 local regions. We use a Mercator (cylindrical) map projection to define the local regions, partitioning the three-dimensional model domain only in the two horizontal directions: Each local region has the shape of a rectangular prism with a 7.5° × 7.5° base (Figure 1). The model state in local region ℓ (ℓ=1,2,…,L) is represented by the local state vector , whose components are defined by the Dv=4×nt=132 components of the global state vector in the local region. The model computes the L evolved local state vectors from vm(t) in parallel, and the evolved global state vector vm(t+Δt) is obtained by piecing the L evolved local state vectors together.
2.2 The Computational Algorithm
The computation of from vm(t) requires the evaluation of a composite (chain) function for each local state vector. Because we use an RC algorithm, this composite function has only three layers: the input layer, the reservoir, and the output layer. A key feature of RC is that the trainable parameters of the model appear only in the output layer, which greatly simplifies the training process.
2.2.2 The Input Layer and Reservoir
The “local approach” of Dueben and Bauer (2018), which was introduced independently of the parallel technique of Pathak, Wikner, et al. (2018), employs a localization strategy that is formally similar to the one described here. There is, however, an important difference between the two localization techniques: Dueben and Bauer (2018) trained a single common neural network for the different local regions, while we train a different reservoir for each local region.
2.2.3 The Output Layer
2.2.4 Synchronization and Training
We define the local analysis by the components of the global analysis va(kΔt) (k=−K,−K+1,…) that describe the state in local region ℓ. In other words, is the observation-based estimate of the desired value of the model state . Likewise, we define the extended local analysis as the observation-based estimate of the extended local state vector (k=−K,−K+1,…).
The synchronization and training of the ML model starts with feeding the past analyses to the reservoir, or more precisely, by substituting (k=−K,−K+1,…,−1) for in equation (1). Thus, the output layer, equation (3), is not needed to compute rℓ(kΔt) for k=−K+1,−K+2,…,0: We generate rℓ(−KΔt) randomly, discard the transient sequence rℓ(kΔt), k=−K,−K+1,…,−Kt, and define for k=−Kt+1,−Kt+2,…,0 according to equation (1), with Pℓ as yet undetermined.
2.3 Implementation on ERA5 Reanalysis Data
The global analyses va(kΔt) (k=−K,−K+1,…) are hourly ERA5 reanalyses interpolated to the computational grid and adjusted to the topography of SPEEDY. The training starts at 0000 UTC, 1 January 1981 and ends at 2000 UTC 24 January 2000 (K≈1.66×105). We add a small-magnitude random noise ε(t) to (k=−K,−K+1,…,−1) before we substitute it for in equation (1) in order to improve the robustness of the ML model to noise (Jaeger, 2001). The transient sequence of K−Kt discarded reservoir states corresponds to the first 43 days of training.
2.3.2 Code Implementation and Performance
The current computer code of the ML model is written in Fortran, using both MPI and OpenMP for parallelization and the LAPACK routine DGESV to solve the linear problem of equation (6). The computations of both the training and forecast phase are carried out on 1,152 Intel Xeon E5-2670 v2 processors. Training the model takes 67-min wall clock time and requires 2.2 Gb of distributed memory per processor. Our current code is designed to minimize the wall clock execution time given the available memory on a particular supercomputer, but the memory usage could be reduced (e.g., by not keeping all training data in memory simultaneously or using single-precision rather than double-precision arithmetic).
2.4 The Forecast Cycle
Beyond the training period, the analyses are used only to maintain the synchronization between the reservoirs and the atmosphere. We use the hourly reanalyses for synchronization but start a new 20-day forecast only once every 48 hr. (Preparing a 20-day forecast takes about 1 min of wall clock time.) We prepare a total of 171 forecasts for the period from 25 January to 28 December 2000. The forecast error statistics reported below are calculated based on these forecasts.
2.4.1 Selection of the Hyperparameters
The dimension Dr of the reservoir, rank κ of the random network, spectral radius ρ, random noise ε, and regularization parameter β are the hyperparameters of the RC algorithm. We found suitable combinations of these parameters by numerical experimentation, monitoring the accuracy and stability of the forecasts. All results reported in this paper are for Dr = 9,000, κ = 6, β=10−5, while ρ monotonically increases from 0.3 at the equator to 0.7 at 45° and beyond. The components of ε are uncorrelated, normally distributed, random numbers with mean zero and standard deviation 0.28. For this combination of the hyperparameters, the ML model predicts realistic values of all state variables for the entire globe and 20-day forecast period.
3 Forecast Verification Results
3.1 Benchmark Forecasts
We use daily climatology, persistence, and numerical forecasts for the evaluation of the ML model forecasts. Persistence is based on the assumption that the initial atmospheric state will persist for the entire time of the forecast. The numerical forecasts are prepared by Version 42 of the SPEEDY model. While SPEEDY has been developed for research applications rather than weather prediction, it can be considered a low-resolution version of today's numerical weather prediction models. Most importantly, similar to all operational models, it solves the system of atmospheric primitive equations and has a realistic climate. It provides a good benchmark in the current stage of our research, in which the primary goal is to prove a concept rather than improve operational forecasts.
We verify all forecasts against ERA5 reanalyses interpolated to the computational grid and adjusted to the SPEEDY orography. The magnitude of the forecast error is measured by the mean of the area-weighted root-mean-square difference between the forecasts and the verification data for all forecasts. Results are shown for selected variables in the Northern Hemisphere (NH) midlatitudes for the first 72 forecast hours (Figure 2). In this region, the ML model outperforms both persistence and climatology by a large margin in the first 48 forecast hours. While the ML model forecasts remain more accurate than persistence in the next 24 forecast hours, their skill, with the exception of the temperature forecasts, degrades to that of climatology. In the tropics (results not shown) the accuracy of the ML model is very similar to that of persistence and climatology.
The performance of the ML model compared to SPEEDY is mixed: The ML forecasts are more accurate for the specific humidity near the surface, especially at 24- and 48-hr forecast times, while the SPEEDY forecasts are more accurate for the wind, particularly at the jet level. The ML temperature forecasts are also more accurate in the tropics (results not shown), where the SPEEDY forecasts rapidly develop a large bias in the upper troposphere.
To better understand the behavior of the root-mean-square error, we decomposed its square into a (square of) bias and variance component and also investigated the power spectrum of the variance in the NH midlatitudes with respect to the zonal wavenumber (results are not shown). On the positive side, the ML forecasts of the different variables have little or no bias, and the variance of the longer term forecasts saturates at a realistic level for zonal wave numbers larger than 6. On the negative side, the variance saturates at unrealistically high levels at the lower wave numbers, leading to an overprediction of the spatial variability of the forecast fields at the longer forecast times. The fast growth of the variance at the large scales, especially at Wave Number 4, is the main deficiency of ML model in the midlatitudes. Fixing this problem could extend the time range of forecast skill by days.
3.3 Near-Surface Humidity and Tropical Temperature Profiles
The short-term forecast advantage of the ML model over SPEEDY has two sources. First, while the SPEEDY forecasts rapidly develop a near-surface humidity bias, the ML model forecasts are free of such bias. Second, the variance of the ML model forecast errors is also lower initially. As forecast time increases, the advantage of the ML model remains in terms of the bias, but vanishes in terms of the variance. Because the variance becomes the dominant component at the later forecast times, climatology breaks even with the ML model forecasts by 72-hr forecast time (bottom right panel of Figure 2). The spatial distribution of the difference of the errors (Figure 3) suggests that the ML model performs better in regions where parameterized atmosphere-surface interactions play an important role in the moist processes in SPEEDY (e.g., regions of the ocean boundary currents). Likewise, the advantage of the ML model in predicting the tropical temperature profiles (not shown) is the result of large biases that are present only in the SPEEDY forecasts in the main regions of parameterized deep convection. Finally, it should be noted that while the current version of the ML model learns about atmosphere-surface interactions strictly from the atmospheric training data, SPEEDY uses a number of prescribed fields to describe the surface conditions (e.g., a spatiotemporally evolved sea surface temperature analysis.)
3.4 Rossby Wave Propagation
The forecast variable for which SPEEDY clearly outperforms the ML model is the meridional component of the wind: while the accuracy of the wind forecasts by the two models is similar at 24 hr, the error of the ML model forecasts grows more rapidly beyond that time. The difference between the errors of the two models grows the fastest in the layer around the jet streams of the NH midlatitudes (between 400 and 200 hPa). Because the variability of the meridional wind in this layer is dominated by dispersive synoptic-scale Rossby waves, the aforementioned result suggests that the ML model may be inferior to the numerical model in describing the Rossby wave dynamics. To investigate this possibility, we plot Hovmöller diagrams of the meridional wind for both forecasts and the verification data (Figure 4).
A pattern of negative (positive) values followed by a pattern of positive (negative) values indicate a trough (ridge). Because the eastward group velocity of the dispersive Rossby waves at the synoptic scales is larger than their eastward phase velocity, new troughs and ridges can develop downstream of the original waves. Such developments are marked by oriented dashed black lines in the figure. In the first three days, the ML model captures the dispersive dynamics of the wave packets accurately, but because the synoptic-scale wave packets are composed of Wave Number 4–11 waves (e.g., Zimin et al., 2003), the overintensification of the Wave Number 4–6 components at the later forecast times leads to a gradual shift of the carrier wave number toward lower values and a deceleration of the group velocity.
We demonstrated that a RC-based parallel ML model can predict the global atmospheric state in the same gridded format as a numerical (physics-based) global weather prediction model. We found that the 20-day ML model forecasts predicted realistic values of all state variables at all forecast times for the entire globe. The ML model predicted the weather in the midlatitudes more accurately than either persistence or climatology for the first three forecast days. This time range could be significantly extended by eliminating, or at least reducing, the overprediction of atmospheric spatial variability at the large scales (wave numbers lower than 7). The forecast variables for which the ML model performed best compared to a numerical (physics-based) model of identical prognostic state variables and resolution were the ones most affected by parameterized processes in the numerical model.
The results suggests that the current version of our ML model have potential in short-term weather forecasting. Because the parallel computational algorithm is highly scalable, it could be easily adapted to higher spatial resolutions on a larger supercomputer. As the algorithm is highly efficient in terms of wall clock time, it could be used for rapid forecast applications and could also be implemented in a limited area rather than a global setting. The ML modeling technique described here could also be applied to other geophysical fluid dynamical systems.
This work was supported by DARPA Contract DARPA-PA-18-01 (HR111890044). The work of T. A. and I. S. was also supported by ONR Award N00014-18-2509. This research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing. This paper greatly benefitted from stimulating discussions with Sarthak Chandra, Michelle Girvan, Garrett Katz, and Andrew Pomerance. Peter Dueben of ECMWF and an anonymous reviewer provided valuable comments to improve the paper. The new data generated for the paper are available online (http://doi.org/10.5281/zenodo.3712157).
- 2018). Challenges and design choices for global weather and climate models based on machine learning. Geoscientific Model Development, 11, 3999–4009.
- 1959). Random graphs. Annals of Mathematical Sciences, 30, 1141–1144.
- 2016). Deep learning. Cambridge, MA, USA:MIT Press.
- 2019). Global reanalysis: Goodby ERA-interim, hello ERA5. ECMWF Newsletter, 159, 17–24.
- 2001). The “echo state” approach to analyzing and training recurrent neural networks. GMD Report 148, German National Research Center for Information Technology.
- 2013). On the need of intermediate complexity general circulation models: A “SPEEDY” example. Bulletin of the American Meteorological Society, 94, 25–30.
- 2012). A practical guide to applying echo state networks, 2nd edition. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural networks: Tricks of the trade, (2nd edition) (pp. 659–686). Heidelberg: Springer.
- 2009). Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3, 127–149.
- 2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14, 2531–2560.
- 2003). Atmospheric simulations using a GCM with simplified parameterizations I: Model climatology and variability in multi-decadal experiments. Climate Dynamics, 20, 175–191.
- 2018). Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Physical Review Letters, 120, 024102.
- 2018). Hybrid forecasting of chaotic processes: Using machine learning in conjunction with a knowledge-based model. Chaos, 28, 041101.
- 2018). Toward data-driven weather and climate forecasting: Approximating a simple general circulation model with deep learning. Geoscientific Model Development, 12, 2797–2809.
- 2019). Weather and climate forecasting with neural networks: Using general circulation models (GCMs) with different complexity as a study ground. Geoscientific Model Development, 12, 2797–2809.
- 1997). Solutions of ill-posed problems. Washington, DC, USA:Winston.
- 2019). Can machines learn to predict weather? Using deep learning to predict gridded 500-hPa geopotential height from historical weather data. Journal of Advances in Modeling Earth Systems, 11, 2680–2693. https://doi.org/10.1029/2019MS001705
- 2020). Improving data-driven global weather prediction using deep convolutional neural networks on a cubed sphere. https://doi.org/10.1002/essoar.10502543.1
- 2020). Combining machine learning with knowledge-based modeling for scalable forecasting and subgrid-scale closure of large, complex, spatiotemporal systems. https://arxiv.org/pdf/2002.05514.pdf
- 2003). Extracting envelopes of Rossby wave packets. Monthly Weather Review, 131, 1011–1017.