Temporal Subsampling Diminishes Small Spatial Scales in Recurrent Neural Network Emulators of Geophysical Turbulence
Abstract
The immense computational cost of traditional numerical weather and climate models has sparked the development of machine learning (ML) based emulators. Because ML methods benefit from long records of training data, it is common to use data sets that are temporally subsampled relative to the time steps required for the numerical integration of differential equations. Here, we investigate how this often overlooked processing step affects the quality of an emulator's predictions. We implement two ML architectures from a class of methods called reservoir computing: (a) a form of Nonlinear Vector Autoregression (NVAR), and (b) an Echo State Network (ESN). Despite their simplicity, it is well documented that these architectures excel at predicting low dimensional chaotic dynamics. We are therefore motivated to test these architectures in an idealized setting of predicting high dimensional geophysical turbulence as represented by Surface Quasi-Geostrophic dynamics. In all cases, subsampling the training data consistently leads to an increased bias at small spatial scales that resembles numerical diffusion. Interestingly, the NVAR architecture becomes unstable when the temporal resolution is increased, indicating that the polynomial based interactions are insufficient at capturing the detailed nonlinearities of the turbulent flow. The ESN architecture is found to be more robust, suggesting a benefit to the more expensive but more general structure. Spectral errors are reduced by including a penalty on the kinetic energy density spectrum during training, although the subsampling related errors persist. Future work is warranted to understand how the temporal resolution of training data affects other ML architectures.
Key Points
-
Reducing training data temporal resolution by subsampling leads to overly dissipative small spatial scales in neural network emulators
-
A quadratic autoregressive architecture is shown to be inadequate at capturing small scale turbulence, even when data are not subsampled
-
Subsampling bias in Echo State Networks is mitigated but not eliminated by prioritizing kinetic energy spectrum during training
Plain Language Summary
The computer models that govern weather prediction and climate projections are extremely costly to run, causing practitioners to make unfortunate tradeoffs between accuracy of the physics and credibility of their statistics. Recent advances in machine learning have sparked the development of neural network-based emulators, that is, low-cost models that can be used as drop-in replacements for the traditional expensive models. Due to the cost of storing large weather and climate data sets, it is common to subsample these fields in time to save disk space. This subsampling also reduces the computational expense of training emulators. Here, we show that this pre-processing step hinders the fidelity of the emulator. We offer one method to mitigate the resulting errors, but more research is needed to understand and eventually overcome them.
1 Introduction
Weather and climate prediction requires the numerical integration of one or more computational models derived from the fundamental equations of motion and initialized with an estimate of the present-day system state (e.g., temperature, wind speeds, etc.). Due to the high cost of these computational models, prediction systems typically require suboptimal tradeoffs. On one hand, it is desirable to increase the credibility of the underlying numerical model as much as possible, for instance by increasing model grid resolution (e.g., Hewitt et al., 2016) or by explicitly simulating as many coupled components (e.g., atmosphere, land, ocean, ice) as possible (e.g., Penny et al., 2017). On the other hand, knowledge of the model initial conditions is imperfect and the governing equations will always contain necessary, inexact approximations of reality. As a result, prediction systems employ statistical methods like ensemble based forecasting in order to represent this uncertainty. Producing an ensemble with statistical significance requires integrating the underlying numerical model many times; usually in practice, but ideally or greater (Evensen et al., 2022). Therefore, the resulting computational costs require practitioners to compromise between the fidelity of the numerical model and credibility of the statistical method.
An ongoing area of research that aims to enable statistical forecasting subject to the dynamics of an expensive numerical model is surrogate modeling. The general approach relies on using a model that represents or “emulates” the dynamics of the original numerical model with “sufficient accuracy” for the given application, while being computationally inexpensive to evaluate. Historically, surrogate models have been an important tool for nonlinear optimization (e.g., Bouhlel et al., 2020; Li et al., 2019), and in the Earth sciences have been developed with techniques such as Linear Inverse Models (e.g., principal oscillation or interaction patterns; Hasselmann, 1988; Penland, 1989; Moore et al., 2022), kriging (Cressie, 1993), or polynomial chaos techniques (Najm, 2009), to name only a few. More recently, advances in computing power, the rise of general purpose graphics processing units, and the explosion of freely available data has encouraged the exploration of more expensive machine learning methods like neural networks for the emulation task (Schultz et al., 2021). A number of data-driven, neural network architectures have been developed to generate surrogate models for weather forecasting and climate projection applications. Some examples include models based on feed forward neural networks (Dueben & Bauer, 2018), convolutional neural networks (CNNs; Scher, 2018; Scher & Messori, 2019; Rasp & Thuerey, 2021; Weyn et al., 2019, 2020, 2021), recurrent neural networks (RNNs; Arcomano et al., 2020; X. Chen et al., 2021; Nadiga, 2021), graph neural networks (Keisler, 2022; Lam et al., 2022), Fourier neural operators (Pathak et al., 2022), and encoder-transformer-decoder networks (Bi et al., 2023).
A significant advancement in surrogate modeling for weather and climate prediction has been the rapid increase in spatial resolution. To the best of our knowledge, the current highest resolution neural network emulators for global atmospheric dynamics is ∼0.25° (∼31 km) (Bi et al., 2023; Lam et al., 2022; Pathak et al., 2022), which is the same resolution as the ERA5 Reanalysis (Hersbach et al., 2020) used to train these models. At this resolution, General Circulation Models (GCMs) of the atmosphere are capable of explicitly capturing important small scale processes like low-level jets and interactions with mountainous topography (Orlanski, 1975). However, it is not yet clear that neural networks are able to represent the same dynamical processes as the training data. Instead, based on our own experimentation, we hypothesize that without careful architectural modifications, neural network emulators will effectively operate at a coarser resolution than the original data set used in training.
To make the discussion concrete, we present a sample prediction from our own surrogate model in Figure 1. The panels show the time evolution of Sea Surface Temperature (SST) in the Gulf of Mexico at 1/25° horizontal resolution, using data from a Navy HYCOM, 3D-Var-based reanalysis product as “Truth” (upper row; see Appendix B for data details). We generate the prediction (middle row) with an RNN architecture described more fully in Section 3.4. Generally speaking, the RNN captures the largest scales of the SST pattern over a 36 hr window. However, as time progresses, the SST pattern becomes overly smooth. The RNN is unable to capture the spatial details that are well resolved in the reanalysis data set, with the largest errors evolving along sharp SST fronts. We note that a similar smoothing behavior can be observed in other neural network based emulators, see for example, (Bi et al., 2023, Figure 3) (Pathak et al., 2022, Figures 4c and 4d) (Keisler, 2022, Figure 5).
There are a number of reasons that could cause this smoothing behavior to manifest in the predictions. As we show in Sections 4 and 5, the blurring of small scale features is a high frequency spectral bias, which has been studied in relation to the training of feedforward neural networks (Xu et al., 2022) and numerical instabilities of neural network predictions for turbulent flows (Chattopadhyay & Hassanzadeh, 2023). One potential reason that we observe spectral bias in our predictions is that the training uses a mean-squared error loss function, which is known to prioritize large over small scale features (Rossa et al., 2008). Here, we suggest that any blurring effect from such a loss function is exacerbated by more fundamental decisions in the experimental design. Our primary goal is to explore how temporal subsampling in the training data set adds to this blurring effect. We are motivated to study the impact of this subsampling because many existing emulators, including our example in Figure 1, rely on reanalysis products as training data (e.g., Arcomano et al., 2020; Bi et al., 2023; Keisler, 2022; Lam et al., 2022; Pathak et al., 2022; Weyn et al., 2021). While there are excellent reasons to leverage the existence of reanalysis products, namely that they are constrained to observational data, the shear size of the data requires some degree of temporal subsampling. It is therefore important to understand how this highly routine data reduction step impacts the performance of data-driven prediction methods when used for training.
In our work, we explore the degree to which temporal subsampling impedes single layer autogregressive and recurrent neural network emulators from learning the true underlying dynamics of the system. In order to isolate this effect from the potential impacts of a data assimilation system and multivariate interactions, we do not rely on the Gulf of Mexico reanalysis data. Instead, we use a model for Surface Quasi-Geostrophic (SQG) turbulence (Blumen, 1978a; Held et al., 1995), which additionally gives us direct control over the data sets used for training, validation, and testing. The SQG model and data set generation is described more fully in Section 2.
The architectures that we use in this study stem from a broad class of machine learning techniques termed as reservoir computing (RC), which was independently discovered as Echo State Networks (ESNs; Jaeger, 2001), Liquid State Machines (Maass et al., 2002), and the Decorrelation Backpropagation Rule (Steil, 2004). One defining characteristic of RC models is that all internal connections are adjusted by global or “macro-scale” parameters, significantly reducing the number of parameters that need to be trained. The relatively simplified structure and training requirements of RC make it an attractive architecture for large scale prediction because it enables rapid development, and could be useful in situations requiring online learning. More importantly though, we are motivated to use RC because past studies have repeatedly shown that it can emulate low dimensional chaotic systems while often outperforming more complex RNNs such as those with Long Short-Term Memory units (LSTMs) (e.g., Griffith et al., 2019; Lu et al., 2018; Pathak et al., 2018; Platt et al., 2022; Vlachas et al., 2020). Additionally, Penny et al. (2022) showed that RC can be successfully integrated with a number of data assimilation algorithms, either by generating samples for ensemble based methods like the Ensemble Kalman Filter, or by generating the tangent linear model necessary for 4D-Var. Finally, we note that Gauthier et al. (2021) proposed a further simplification to the RC architecture based on insights from Bollt (2021) that unifies versions of RC with nonlinear vector autoregression (NVAR). For a variety of chaotic systems, this architecture has shown excellent prediction skill even with low order, polynomial-based feature vectors (Barbosa & Gauthier, 2022; T.-C. Chen et al., 2022; Gauthier et al., 2021), despite requiring a much smaller hidden state and less training data. Considering all of these advancements, we are motivated to use these simple yet powerful single layer NVAR and ESN architectures to emulate turbulent geophysical fluid dynamics, and study how they are affected by temporal subsampling (see Section 3 for architecture details).
2 Surface Quasi-Geostrophic Turbulence
Our model configuration is discretized in space with Nx = Ny = 64 and Nz = 2, uses a periodic boundary in both horizontal directions, and uses a timestep of Δt = 5 min. To generate data sets for the neural networks, we initialize the model with Gaussian i. i.d. noise and spinup for 360 days, which we define as one model year. The spinup period is discarded, and we then generate a 25 year-long data set that we partition into training (first 15 years), validation (next 5 years), and testing (final 5 years). For validation and testing, we randomly select without replacement 12 hr time windows from each respective data set.
3 Single Layer Autoregressive and Recurrent Neural Networks
The neural network architectures that we use employ a common structure that is relevant to the readout operator and training procedure; we discuss these details in Section 3.1. Additionally, we employ a similar strategy to parallelize the architecture for high dimensional systems, and this is discussed in Section 3.2. Finally, the specific form of f(⋅) for the ESN and NVAR architectures is provided in Sections 3.3 and 3.4, respectively.
3.1 Linear Readout and Training
3.2 Parallelization Strategy
The model architectures that we use inherit the gridded structure of the target state being emulated, and often require hidden states that are to times larger than the target system dimension. Atmosphere and ocean GCMs typically propagate high dimensional state vectors, ranging from to , so representing the system with a single hidden state would be intractable. Thus, we employ a parallelization strategy to distribute the target and hidden states across many semi-independent networks. Our strategy follows the algorithm introduced by Pathak et al. (2018), and follows a similar construction as Arcomano et al. (2020). We outline the procedure here and note an illustration of the process for the ESN architecture in Figure 3.
We subdivide the domain into Ng rectangular groups based on horizontal location, akin to typical domain decomposition techniques for atmosphere and ocean GCMs on structured grids. Each group contains horizontal grid cells, and all Nz vertical grid cells at each horizontal location. The global state vector, v, which consists of all state variables to be emulated at all grid cells, is partitioned into Ng local state vectors, vk. For example, Figure 3 shows a field v decomposed into nine groups, where each group is delineated by white lines. In our SQG predictions, we set , resulting in Ng = 64.
In order to facilitate interactions between nearby groups, each group has a designated overlap, or “halo,” region which consists of No elements from its neighboring groups. The local group and overlapping points are illustrated in Figure 3 with a black box. The local state vectors, plus elements from the overlap region, are concatenated to form local input state vectors, . The result from the network is the local output state vector, , which is expanded to fill the target group as illustrated by the white box on the prediction shown in Figure 3. Here we set No = 1, so that and , given that and Nz = 2.
During the training phase, each group acts completely independently from one another. Therefore, the training process is embarrassingly parallel and allows us to scale the problem to arbitrarily large state vectors across a distributed computing system, subject to resource constraints. During the prediction phase, neighboring elements must be passed between groups in order to fill each overlap region at each time step with the most accurate state estimate possible, to ensure spatial consistency across the domain.
3.3 Nonlinear Vector Autoregression Design
3.4 Echo State Network Design
4 Nonlinear Vector Autoregression Prediction Skill
We compute these quantities based on 50 twelve-hour predictions initialized from a random set of initial conditions taken from an unseen test data set. To compactly visualize the skill over all samples, each lineplot in the following subsections shows a sample-average value with a solid line, and the 99% confidence interval with shading. We note that in some cases the model trajectory becomes unstable to the point that infinite values are produced. In the event that any single sample from a distribution has produced infinity, we take the more conservative approach and cut off any statistical averaging or confidence interval computation at that point in time and carry it no further. Therefore, some plots of NRMSE over time do not extend over the full 12 hr window, even though some sample trajectories are still valid, for example, Figure 5 (left).
4.1 Temporal Subsampling
Figure 4 shows a qualitative comparison of NVAR predictions as a function of Nsub, that is, how frequently the training data are sampled and the model makes predictions. For this figure, we set Nlag = 1, and note that both the NRMSE and a snapshot of the KE density relative error corresponding to this configuration are shown in Figure 5.
At the model timestep (Δτ = Δt = 5 min; Nsub = 1), the NVAR predictions are qualitatively similar to the truth for short forecast lead times. That is, the NRMSE is near 0, and many of the small scale features that exist in the truth are also evident in the predictions. However, at longer lead times the predictions become unstable. NRMSE spikes rapidly at about 4 hr after numerical instabilities are generated, which causes the NVAR model to produce physically unrealistic results. For reference, Figure S1 shows a view of what these numerical instabilities look like at their onset.
As the temporal resolution of the data is reduced, that is, as Nsub increases, the predictions are generally stable for a longer period of time. Figure 5 shows that for Nsub = 4, predictions are stable for roughly 6 hr, and for Nsub = 16 no predictions generate numerical instabilities over the 12 hr window. However, this stability comes with a cost: as the temporal resolution is reduced, the model's representation of small scale features diminishes as these features become more blurry or smoothed. This blurring effect is apparent in Figure 4, where the prediction is qualitatively more blurry as Nsub increases in each panel from left to right.
This smoothing behavior is captured quantitatively in the right panel of Figure 5, which shows the KE relative error as in Equation 11. Here, we show the KE relative error after only 1.33 hr to show the behavior before instabilities dominate the Nsub = 1 predictions. The plot indicates the degree of spectral bias in each solution, which is largest at the smaller spatial scales, corresponding to higher wave numbers.
At Nsub = 1 there is a small positive bias at the smallest resolved spatial scales, indicating that this is when numerical instabilities are starting to generate. The subsampled runs, Nsub = {4, 16}, show a negative bias, which corresponds to a dampened energy spectrum at the scales that are not resolved in the qualitatively smooth predictions shown in Figure 4. This negative bias is clearly larger with higher subsampling, or reduced temporal resolution, suggesting that as the data are subsampled, the network becomes incapable of tracking the small scale dynamics. The result is an averaged view of what may be occurring in between each time stamp.
4.2 Prediction Skill as a Function of Memory
A key feature of RNNs and autoregressive models is that they retain memory of previous system states. Given the explicit nature of the NVAR architecture, we explore the effect of adding memory by increasing Nlag, the number of lagged states used to create the feature vector. We first summarize how memory impacts prediction skill in Figure 6, which shows the NRMSE as a function of Nlag (colors) for each subsampling factor Nsub = {1, 4, 16} (panels). For any value of Nsub, adding memory (increasing Nlag) reduces the short term error. However, adding memory also tends to increase error by the end of the forecast, often leading to the development of numerical instabilities and an incoherent solution. Similarly, for any fixed value of Nlag, increasing the temporal resolution (decreasing Nsub) shows the same behavior.
To shed some light on how this additional memory impacts the solution, we show the KE relative error for the case of Nsub = 16 as a function of time (panels) and Nlag (colors) in Figure 7. For about the first 4 hr, increasing memory improves prediction skill at all spatial scales. However, beyond this point, the overall NRMSE grows rapidly, the improvement at small scales (|K| > 4 ⋅ 10−3 rad km−1) is more muted, and error is propagated rapidly into the larger spatial scales.
We surmise that adding memory degrades the long term prediction skill in the quadratic NVAR because the relationship between points further back in history are governed by higher order nonlinear interactions that are incorrectly represented by the simple local-quadratic relation that is used here. As more terms are added that are incorrectly represented, the model becomes more and more unstable. We make this supposition based on the fact that despite theoretical similarities between NVAR and ESNs as highlighted by Bollt (2021), we attain stable predictions using an ESN architecture with a hyperbolic tangent activation function in Section 5.
The question for the NVAR architecture is therefore how to retain the short term benefit of added memory capacity throughout the forecast horizon while maintaining a stable trajectory. While it may seem natural to explore higher order polynomials to properly represent this history, we do not explore this further because the size of the feature vector grows dramatically with the polynomial order (T.-C. Chen et al., 2022). Another option would be to explore entirely different basis functions. While this could be a potential option for future work, we note the findings of Zhang and Cornelius (2022), who show the extreme sensitivity of NVAR to the form of nonlinearity imposed. Given that it is an entirely open question on how to represent the smallest scales of geophysical turbulence, we do not explore other basis functions, and instead turn to the more general ESN architecture.
5 Echo State Network Prediction Skill
5.1 Soft Constraints on Spectral Error
It is well known that ESN prediction skill is highly dependent on the global or “macro-scale” parameters noted in Equation 7 (θESN, e.g., Platt et al., 2022; Lukoševičius, 2012). Following the success of previous studies in using Bayesian Optimization methods to systematically tune these parameters (Griffith et al., 2019; Penny et al., 2022; Platt et al., 2022), we use the Bayesian Optimization algorithm outlined by Jones et al. (1998) and implemented by Bouhlel et al. (2019) to find optimal parameter values.
Figure 8 shows a qualitative view of how penalizing the KE density impacts ESN prediction skill when it operates at the original timestep of the SQG model (i.e., Nsub = 1). At γ = 0, the ESN parameters are selected based on NRMSE alone, and the prediction is relatively blurry. However, as γ increases to 10−1, the prediction becomes sharper as the small scale features are better resolved.
Figure 9 gives a quantitative view of how the KE density penalty changes ESN prediction skill, once again with Nsub = 1. The first two panels show that there is a clear tradeoff between NRMSE and KE error: as γ increases the NRMSE increases but the spectral representation improves. The final panel in Figure 9 shows the spatial scales at which the spectral error manifests in these different solutions. When γ = 0, the macro-scale parameters are chosen to minimize NRMSE, leading to blurry predictions and a dampened spectrum at the higher wavenumbers, especially for |K| > 2 ⋅ 10−3 rad km−1. We note that Lam et al. (2022) report the same behavior when using a cost function that is purely based on mean-squared error. On the other hand, when γ = 10−1, the global parameters are chosen to minimize both NRMSE and KE density error, where the latter treats all spatial scales equally. In this case, KE relative error is reduced by more than a factor of two and the spectral bias at higher wavenumbers is much more muted.
Of course, the tradeoff for the reduced spectral error is larger NRMSE, resulting from slight mismatches in the position of small scale features in the forecast. However, our purpose is to generate forecasts that are as representative of the training data as possible. Overly smoothed forecasts are not desirable, because this translates to losing local extreme values, which are of practical importance in weather and climate. Additionally, a key aspect of ensemble forecasting is that the truth remains a plausible member of the ensemble (Kalnay et al., 2006). Therefore, representing the small scale processes, at least to some degree, will be critical for integrating an emulator into an ensemble based prediction system.
Finally, we note that using a cost function with only KE_NRMSE produced inconsistent results. Therefore, we consider it important to keep the NRMSE term in the cost function, as this prioritizes the position of small scale features. That is, it helps maintain phase information. Additionally, we note that there is some irreducible high wavenumber error, which is most clearly seen by comparing the prediction skill to a persistent forecast. While the sample median NRMSE for each γ value beats persistence, the KE_NRMSE is more than double, due to this error at the small spatial scales. Ideally, our forecasts would beat persistence in both of these metrics, but obtaining the “realism” in the small spatial scales necessary to dramatically reduce this spectral error should be addressed in future work.
5.2 Temporal Subsampling
The NVAR predictions shown in Section 4.1 indicate that subsampling the training data systematically increases error at small spatial scales. However, the architecture was not specifically designed or constrained to have a good spectral representation of the underlying dynamics. On the other hand, the previous section (Section 5.1) showed that the spectral bias at high wavenumbers can be reduced by optimizing the global ESN parameters to the true KE density spectrum. Given these two results, we explore the following question: does temporal subsampling still increase spectral bias in the more general ESN framework, even when parameters are chosen to minimize this bias?
Figures 10 and 11 show that even when the macro-scale parameters are chosen to prioritize the KE density representation (i.e., γ = 10−1 is fixed), temporal subsampling does lead to an apparently inescapable spectral bias. This effect is shown qualitatively in Figure 10, where the predictions become smoother as the temporal subsampling factor, Nsub, increases. The effect is similar to what was seen with NVAR except the blurring effect is less pronounced. Quantitatively, Figure 11b shows that as Nsub increases, error in KE density spectrum generally increases, while panel (c) shows that this KE error is concentrated in the small spatial scales, |K| > 2 ⋅ 10−3 rad km−1. We note that the degree of spectral bias at Nsub = 16 is smaller than what was achieved with NVAR for the same Nsub value, cf. Figure 7, indicating that the optimization was successful in reducing the spectral bias.
Interestingly, there is little difference between NRMSE obtained by the ESNs at different Nsub values. Additionally, Figure 12 shows that there is little difference in both NRMSE and KE_NRMSE when γ = 0, that is, when NRMSE is the only criterion for parameter selection. This result shows that NRMSE alone is not a good criterion for model selection, given that we have shown success in reducing spectral errors by prioritizing the spectrum appropriately.
5.3 Impact of the Hidden Layer Dimension
The dimension of the hidden layer, Nr, also known as the reservoir size, determines the memory capacity available to the ESN (Jaeger, 2001; Lukoševičius, 2012). For systems with high dimensional input signals, it is crucial to use a sufficiently large hidden layer to afford the memory capacity necessary for accurate predictions (Hermans & Schrauwen, 2010). In all of the preceding sections we fixed Nr = 6, 000 for each local group, where for reference each local group has an input dimension of and an output dimension of . Here, we briefly address the effect of doubling the hidden layer dimension, while keeping the input and output dimensions constant, in order to test how sensitive our conclusions are on this crucial hyperparameter. Due to the computational expense of the parameter optimization discussed in Section 5.1, we only perform this experiment for Nsub = 16.
The impact of doubling Nr on prediction skill is shown in Figure 13, where for the sake of brevity we only show results for the case when γ = 10−1 in Equation 12. The left panel shows that the larger hidden layer actually increases the NRMSE slightly. However, the middle and right panels show that this increase is due to the improved spectral representation. The improvement in KE_NRMSE is nearly proportional to the improvement achieved by increasing the temporal resolution of the data. That is, doubling the hidden layer width reduces the average KE_NRMSE by 14%, while increasing the temporal resolution of the data by a factor of 4 reduces the KE_NRMSE by 30%. These results indicate a potential brute force approach to overcoming the subsampling related spectral errors. However, the larger hidden layer dimension has to be constrained with enough training data, and requires more computational resources.
5.4 Impact of Training Data Set Size
In all of the preceding experiments, the length of training time was fixed to 15 years, meaning that there are fewer training samples when the data are subsampled, that is, as Nsub grows. Specifically, 15 years of data at an original model timestep of 5 min means that there are approximately 1.6 ⋅ 106, 3.9 ⋅ 105, and 9.72 ⋅ 104 samples for each case previously shown: Nsub = 1, 4, and 16, respectively. Here, we show that even when the number of training samples is fixed, the subsampling related spectral errors are still present.
Figure 14 shows the prediction skill in terms of NRMSE and spectral errors when the number of training samples is fixed to 9.72 ⋅ 104. With this number of samples, the training data is exactly the same for Nsub = 16, but only spans 3.75 and 0.94 years for Nsub = 4 and Nsub = 1, respectively. However, we see the same general trend as before: subsampling the data improves NRMSE slightly but increases the KE_NRMSE. As before, the spectral error is largest in the higher wavenumbers, |K| > 2 ⋅ 10−3 rad km−1. We note that the difference in performance between Nsub = 4 and Nsub = 16 is marginal. The only notable difference between these two cases is that the ESN is less consistent, that is, the KE_NRMSE distribution is broader, when Nsub = 16. However, it is clear that spectral error is lowest when the data are not subsampled at all, even though less than a year of data is used. This result indicates that there could be a benefit to training a RNN on a relatively shorter model trajectory that is untouched, rather than a longer data set that is subsampled in time.
6 Discussion
Weather and climate forecasting necessitates the integration of expensive numerical models to make accurate predictions and projections. The computational cost of these models often results in tradeoffs, where practitioners must balance the spatial resolution of their model with other factors, such as the number of integrated model components or the ensemble size that can be afforded in the system. Model emulation or surrogate modeling aims to enable such predictions by emulating the dynamical system with adequate accuracy at a much lower computational expense. In this study, our primary interest was to shed light on the spatial scales that can be resolved by single layer autoregressive and recurrent neural network emulators in order to better understand the effective resolution that could be achieved in weather and climate applications. We used two relatively simple, single layer autoregressive and recurrent neural network architectures, mainly because it has been shown that they can successfully emulate low dimensional chaotic dynamics over multiple Lyapunov timescales (Gauthier et al., 2021; Pathak et al., 2017; Platt et al., 2022; Vlachas et al., 2020). We implemented a multi-dimensional parallelization scheme based on the concept introduced by Pathak et al. (2018) and similar to that of Arcomano et al. (2020) in order to scale up these architectures and test them in high dimensional systems. We note that an in-depth discussion of our software implementation using the task based scheduling system in python, Dask (Dask Development Team, 2016), will be covered in a forthcoming paper.
6.1 Main Result and Connections to Previous Work
Our main result is that we observe an inherent spectral bias that occurs when training data are subsampled in time, such that as the temporal resolution is reduced, the resolution of small scale features in NVAR and ESN predictions is diminished. High wavenumber spectral bias is a phenomenon that has been studied in the context of training feed forward neural networks (see Xu et al., 2022, for a comprehensive review on the topic). The authors show that while numerical Partial Differential Equation (PDE) solvers typically resolve small spatial scales first and iteratively refine the larger spatial scales, spectral biases arise while training neural networks because the reverse happens: the large scales are uncovered first and small spatial scales are slowly refined.
Here, we showed a similar bias that arises in NVAR and ESN architectures in relation to their temporal resolution. Given the sensitivity to model time step, this phenomenon bears resemblance to the Courant-Friedrich-Lewy (CFL) condition, which poses an upper bound on the time step size that can be used in the numerical solution of PDEs. The CFL condition is therefore a barrier to weather and climate model efficiency. However, sensitivity to the time step size manifests very differently in neural networks and numerical PDEs. While violating the CFL condition with too large of a time step leads to fundamental issues of numerical instability in numerical PDEs, here we see that increasing the time step adds a sort of numerical dissipation, which can actually stabilize an otherwise unstable model architecture (Section 4.1). Our results show that this occurs because the small scales are “lost” within the recurrent and autoregressive time stepping relations. Because of this, the models are trained to take on an interpolated or spatially averaged view of the intermediate dynamical behavior, which generates a blurred prediction.
We note that Bi et al. (2022) discuss a similar phenomenon relating to the timestepping of their autoregressive transformer model. Specifically, they devise a “Hierarchical Temporal Aggregation” scheme to make more stable and accurate forecasts (in terms of RMSE) over longer periods of time than they would potentially be able to if they were to use the original 1 hr cadence of the ERA5 data set. However, it is not clear how well small scale features are preserved with this approach. This is unclear first because they use a cost function that is purely based on RMSE. Second, the approach requires training multiple models at successively larger time intervals, and a forecast is made using the largest interval possibly available first. For instance, with trained models operating on 1 and 6 hourly increments, a 7 hr forecast would be made by first a 6 and then 1 hr prediction. Our results indicate that this could be problematic, as the model making the 6 hr prediction would filter out small scale features that would otherwise be captured by the second model, operating on a 1 hr timestep.
Finally, Chattopadhyay and Hassanzadeh (2023) show the connection between high wavenumber spectral bias and instabilities in neural network predictions of turbulent flows. Their focus was on achieving long term stability in neural network time stepping for climate applications, while the focus in our work has been on short term forecasting for weather applications - capturing the long term, climate statistics in turbulent geophysical fluid dynamics with an ESN or NVAR is future work. However, both works (a) draw some connection between high frequency spectral bias and the time stepping of the neural network, and (b) offer potential solutions by penalizing the solution's spectrum. In our work, we show that some of the spectral bias stems from the timestep size of the data used for training, while Chattopadhyay and Hassanzadeh (2023) devise a Runge-Kutta scheme to reduce the bias on subsampled data. Additionally, they use a spectral loss to train the internal weights of the network, along with the addition of a “corrector” network to make predictions of only the small scales. On the other hand, we use a spectral loss to guide the optimization of 5 “macro-scale” parameters, but the training of the network weights and operation of the network remain the same. Despite the differences in approach, the similarity of these two works indicates that the details of neural network time stepping schemes are crucial to their stability and accuracy in representing small scale processes. Additionally, it is clear that these small scale processes must be prioritized in some way, for instance through a loss function, and potentially additional “corrector” networks that propagate the small scales explicitly.
6.2 Implications for Training Data Sets in Weather and Climate
Our results have important implications for the rapidly developing field of neural network emulation for weather and climate forecasting because they show a potential limit to the effective resolution of an emulator relative to the original training data. If an emulator is used as a parameterization scheme for subgrid-scale dynamics, then a high wavenumber spectral bias will be detrimental to performance. Additionally, we surmise that such errors will reduce ensemble spread within data assimilation algorithms, which could limit their usefulness within a forecasting system (e.g., Kalnay et al., 2006). Our findings are pertinent to the field of neural network emulation development because of the widespread usage of reanalysis data sets for training. Currently, most existing neural network emulators in this field use the ERA5 reanalysis data set (Hersbach et al., 2020) for training (e.g., Arcomano et al., 2020; Bi et al., 2023; Keisler, 2022; Lam et al., 2022; Pathak et al., 2022; Weyn et al., 2021). Of course, reanalyses like ERA5 are an obvious choice for many reasons: the data sets are made freely available, they present a multi-decadal view of weather and climate, and, most importantly, they are constrained to observational data. However, we note that reanalysis products are imperfect for at least the following reasons: they contain jumps in the system state at the start of each DA cycle, they may contain inconsistencies reflective of changes in observational coverage, and they are only made available at large time intervals relative to the time step of the underlying integrated numerical model dynamics, due to the massive size of the data. Our study only addressed the latter of these issues, and showed that this simple space-saving step can have a negative impact on data-driven prediction methods. While we showed that adding spectral error as a weak constraint in the neural network training can reduce this time step related spectral bias, our results indicate that the underlying issue persists (Section 5.2). Moreover, as long as the data are not subsampled, we showed that ESNs perform only slightly worse when <1 year of data are used, compared to 15 years of training data (Section 5.4). This result suggests that it may be more effective to design an RNN-based emulator with a relatively short model trajectory that is not subsampled, rather than a long trajectory that is subsampled. In contrast to training the emulator on a reanalysis data set, a pure model-based emulator could then be used within a data assimilation system as by Penny et al. (2022) in order to additionally benefit from observational constraints.
6.3 Implications and Future Work Relating to Model Architecture
Due to the fact that RNNs require long, sequential data streams in order to learn the governing dynamics, it could be the case that RNNs suffer most dramatically from temporal subsampling. This hypothesis could be one reason for why the RNNs used by Agarwal et al. (2021) performed worse than other models on data that were subsampled every 10 days. Additionally, if RNNs are most dramatically affected by temporal subsampling, then they could be a suboptimal architecture choice for model emulators in cases where representing small scale dynamics is important but a coarse time step is required. This requirement is especially true when designing a parameterization scheme for subgrid-scale dynamics, where the emulator should ideally run at the same time step as the “large-scale” model.
However, given that we can qualitatively observe some degree of spectral error in a wide variety of neural network architectures that use subsampled data for training (e.g., Bi et al., 2023; Keisler, 2022; Lam et al., 2022; Pathak et al., 2022), the issue could be more general to other neural network architectures. Moreover, the similarities between our work and Chattopadhyay and Hassanzadeh (2023) as well as the reasons behind the hierarchical time stepping scheme introduced by Bi et al. (2022) (both discussed in Section 6.1) imply that the time stepping related spectral bias is a general issue. Therefore, future work should be directed at understanding the degree to which temporal resolution affects architectures other than RNNs. Potential avenues could include exploring how attention mechanisms (Dosovitskiy et al., 2021; Vaswani et al., 2017) handle this phenomenon. Additionally, in light of our results indicating that wider networks can mitigate the spectral bias at least to some degree (Section 5.3), it would be instructive to understand how successively adding layers to a neural network affects the spectral bias. Finally, we note the work of Duncan et al. (2022) who show success in using adversarial training to mitigate the spectral bias observed in FourCastNet, and suggest that such techniques deserve additional study to understand their robustness.
Of course, our neural network implementations are imperfect, and here we list some future avenues to improve their predictive capabilities. Both of the architectures relied on a mean-squared error micro-scale cost function to learn the readout matrix weights, even in the ESN models where the spectral errors were penalized in the macro-scale cost function. However, even when the spectrum was penalized and the data were not subsampled, the ESNs maintained a high wavenumber bias that resulted in KE_NRMSE far greater than that of a persistent forecast. While additional testing shows that a periodic sine activation function can reduce the high frequency bias in KE_NRMSE, following work by Sitzmann et al. (2020), the underlying problem still remains (see additional analysis in the Supporting Information S1). Therefore, in order to further reduce the high frequency bias, it may be necessary to move the spectral penalties to the micro-scale cost function, that is, to learn the readout matrix weights in the case of reservoir computing. The time stepping, spectral loss, and “small scale corrector network” employed by Chattopadhyay and Hassanzadeh (2023) would be appropriate starting points for such future work.
The NVAR architecture that we employed is incredibly simple. While we supposed that the local quadratic feature vector could learn quantities like derivatives and fluxes necessary to step the model forward in time, it is apparently not robust enough given the dramatic sensitivity to time step used. Future work could explore the possibility of using a larger library of analytic functions to improve the nonlinear expressions in the model, with the caution that this will lead to very high dimensional feature vectors. Such developments must sufficiently address the “Catch-22” described by Zhang and Cornelius (2022), who show that NVAR is inherently sensitive to the types of nonlinearity chosen. It is entirely possible, though, that an appropriate set of such basis functions exist for weather and climate emulation.
The ESN architecture that we employed is also relatively straightforward, and can undoubtedly be improved. In this work we took a somewhat brute force approach to emulate arbitrarily high dimensional systems by partitioning the system into subdomains and deploying parallel ESNs on each group. However, this process comes with overhead and can still lead to rather large networks on each group. The memory costs associated with these large networks coupled with any additional computational costs associated with timestepping, either by increasing the frequency or by using a more expensive method to represent small scale processes, will likely make the ESN implementation shown here too expensive to be considered for practical applications. Future work could explore dimension reduction techniques involving proper orthogonal decomposition (Jordanou et al., 2022), autoencoders (Heyder et al., 2022), or approaches involving self-organizing or scale invariant maps (Basterrech et al., 2011). Similarly, Whiteaker and Gerstoft (2022) show success in deriving a controllability matrix for the ESN, which leads to a reduced network size with minimal reduction in error. Finally, a number of studies claim to have developed ESN architectures that can capture dynamics occurring at many scales (Gallicchio et al., 2017, 2018; Ma et al., 2020; Malik et al., 2017; Moon et al., 2021), and these could be explored for geophysical turbulence emulation as well.
7 Conclusions
Recent advances in neural network based emulators of Earth's weather and climate indicate that forecasting centers could benefit greatly from incorporating neural networks into their future prediction systems. However, a common issue with these data-driven models is that they produce relatively blurry predictions, and misrepresent the small spatial scale features that can be resolved in traditional, physics-based forecasting models. Here, we showed that the simple space saving step of subsampling the training data used to generate recurrent neural network emulators accentuates this small scale error. While we show some success in mitigating the effects of this subsampling related, high wavenumber bias through an inner/outer loop optimization framework, the problem persists. Many neural network emulators use subsampled data sets for training, including most prominently the ERA5 Reanalysis. While our work suggests that there could be a benefit to using a training data set based on a relatively shorter model trajectory that is not subsampled, rather than a longer one that is, addressing the subsampling issue would provide more confidence in using already existing, freely available data sets like reanalyses. We therefore suggest that future work should focus on how other architectures and techniques like attention or adversarial training can address this subsampling related bias at the small spatial scales of turbulent geophysical fluid dynamics.
Acknowledgments
T.A. Smith and S.G. Penny acknowledge support from NOAA Grant NA20OAR4600277. S.G. Penny and J.A. Platt acknowledge support from the Office of Naval Research (ONR) Grants N00014-19-1-2522 and N00014-20-1-2580. T.A. Smith thanks Nora Loose for comments and discussion that improved the manuscript. The authors thank Jeffrey S. Whitaker for developing the open source SQG Turbulence model code which was the basis for our training, validation, and testing data set. The authors thank three anonymous reviewers, the associate editor, and editor Stephen Griffies for comments that improved the manuscript.
Appendix A: Matrix and Data Normalization for Echo State Networks
Here we describe several aspects of our ESN implementation that are unique with respect to previous works. Additionally, we provide some empirical justification for these choices, using the Lorenz96 model as a testbed (Lorenz, 1996), see Appendix A4 for a description of the data sets generated for these tests.
A1 Input Matrix Scaling
Additionally, we found empirical evidence to suggest that this normalization is advantageous even for small systems. Figure A1 shows the VPT achieved with the 20-Dimensional Lorenz96 system (Appendix A4), using a variety of normalization strategies for the input and adjacency matrices. In Figure A1, the two schemes used for the input matrix are (a) no normalization (indicated by cWin) and (b) normalization by the largest singular value (indicated by σmax(Win)). For a variety of reservoir sizes, Nr, we found that using the largest singular value often performed better, usually by about 0.5 MTU.
A2 Adjacency Matrix Scaling
In our experimentation, we have found a slight benefit from using the largest singular value to normalize the adjacency matrix. Figure A1 shows that, for fixed input matrix normalization, using the largest singular value rather than spectral radius achieves similar and up to ∼0.3 longer valid predictions. While the improvement may seem subtle, we note that using the largest singular value has the following practical benefit for our python-based implementation: the singular values can be computed directly on a Graphical Processing Unit using CuPy (Okuta et al., 2017), while a general, non-symmetric eigenvalue decomposition is not readily available.
A3 Data Normalization
Figure A2 compares the prediction skill when these two normalization strategies are used. Using the standard deviation normalization as in Equation A3 leads to an average VPT increase of 2 MTU. We suggest that this improvement is due to the fact that when the data are normalized by the full range, then all values are in the range [−1, 1]. In this case, once the input is mapped into the hidden space, it is more likely to lie on the linear regime of the tanh(⋅) activation function. While a large enough input scaling could eliminate this problem, it is apparently not easily obtained during the Bayesian optimization.
A4 Lorenz96 Data Sets
Appendix B: Gulf of Mexico Data Set and ESN Prediction
The Gulf of Mexico reanalysis data set used to generate the prediction in Figure 1 was provided by HYCOM (2016). The data consists of 3 hourly snapshots of 2D sea surface height and 3D temperature, salinity, and zonal and meridional velocities, covering 1993–2012 (inclusive). We used only the top level of temperature, and used the first 18 years as training, and the last 2 years as test data. Here we apply a parallelized ESN architecture, using , No = 1, and Nr = 6, 000. Because we use only the top level of temperature, Nz = 1, and therefore , . The grid cells that represent continental land are ignored in the input and output vectors, and in the corresponding rows of Wout. Therefore, the effect of the boundary conditions on the neighboring grid cells is implicitly learned from the data.
Open Research
Data Availability Statement
The model configurations used to generate the results in this manuscript can be found at https://github.com/timothyas/rc-gfd, and specifically used the version at Smith (2023). The repository uses a modified version of the SQG Turbulence model code developed by Jeffrey S. Whitaker at https://github.com/jswhit/sqgturb.