Applications of Deep Learning to Ocean Data Inference and Subgrid Parameterization

Oceanographic observations are limited by sampling rates, while ocean models are limited by finite resolution and high viscosity and diffusion coefficients. Therefore, both data from observations and ocean models lack information at small and fast scales. Methods are needed to either extract information, extrapolate, or upscale existing oceanographic data sets, to account for or represent unresolved physical processes. Here we use machine learning to leverage observations and model data by predicting unresolved turbulent processes and subsurface flow fields. As a proof of concept, we train convolutional neural networks on degraded data from a high‐resolution quasi‐geostrophic ocean model. We demonstrate that convolutional neural networks successfully replicate the spatiotemporal variability of the subgrid eddy momentum forcing, are capable of generalizing to a range of dynamical behaviors, and can be forced to respect global momentum conservation. The training data of our convolutional neural networks can be subsampled to 10–20% of the original size without a significant decrease in accuracy. We also show that the subsurface flow field can be predicted using only information at the surface (e.g., using only satellite altimetry data). Our results indicate that data‐driven approaches can be exploited to predict both subgrid and large‐scale processes, while respecting physical principles, even when data are limited to a particular region or external forcing. Our in‐depth study presents evidence for the successful design of ocean eddy parameterizations for implementation in coarse‐resolution climate models.


Introduction
Satellite observations have produced a wealth of information on the ocean circulation (Abernathey & Marshall, 2013;Chelton et al., 2007;Greatbatch et al., 2010;Le Traon & Morrow, 2001;Morrow et al., 1994;Scott & Wang, 2005).However, raw satellite altimetry data subsamples the ocean and does not measure subsurface quantities.Temporally, measurements at the same location are made twice every orbital cycle, while the spatial sampling depends upon the distance between ground tracks.To improve the subsampling rates, measurements from multiple satellites are combined (Le Traon et al., 1998) to produce an optimal estimate.The process of combining measurements from multiple satellites includes spatiotemporal filtering, which leads to a more smoothed view of the dynamical processes at the oceans surface, removing variability due to mesoscale and submesoscale eddies.The filtering can also lead to spurious physical signals, as studied by Arbic et al. (2013), which showed that filtering data can lead to exaggerated forward cascades of energy.The new Surface Water and Ocean Topography mission will have a large swath of 120 km, providing unprecedented detail on the oceans surface.Despite the high spatial sampling rate, measurements may still be limited by the temporal sampling rate of 11 days (Durand et al., 2010).

10.1029/2018MS001472
Similar to satellite observations, Ocean General Circulation Models (OGCMs) are useful for studying ocean dynamics.However, high-resolution models are computationally expensive, and the current resolution of models is not high enough to fully resolve the first baroclinic deformation radius at midlatitudes (Hallberg, 2013).Also, due to their finite resolution, they require large viscosity and diffusion coefficients in order to remain numerically stable (Jochum et al., 2008).The combination of finite resolution and artificially high viscosity, diffuses momentum and smooths out features such as jets and mesoscale eddies (Hewitt et al., 2016;Kjellsson & Zanna, 2017).
Therefore, both observations and models are missing the interactions of oceanic turbulence at small scales, which play an important role in maintaining the large-scale circulation (Greatbatch, Zhai, Claus, et al., 2010;Greatbatch, Zhai, Kohlmann, & Czeschel, 2010;Kang & Curchitser, 2015;Waterman & Jayne, 2010;Waterman et al., 2011); with satellite observations only providing surface information.We thus consider the general problem: given some smoothed view of the oceans surface, what information can be generated on small-scale turbulent interactions and subsurface quantities?Illuminating unresolved quantities using seen quantities would extend the reach of existing data sets and could potentially improve the representations of unresolved eddies in OGCMs.
We tackle this problem with machine learning.Machine learning has grown in popularity in recent years and has been applied to weather prediction (Esteves et al., 2018;McGovern et al., 2017), climate model parameter sensitivity studies (Anderson & Lucas, 2018), chaotic dynamical systems forecasting (Pathak, Hunt, et al., 2018;Pathak, Wikner, et al., 2018;Vlachas et al., 2018), and parameterizing unresolved atmospheric processes (Brenowitz & Bretherton, 2018;Gentine et al., 2018;Jiang et al., 2018;O'Gorman & Dwyer, 2018).The foundational principle of machine learning is extracting information from data.When used to improve our understanding of the Earth system, these data-driven methods are an empirical bottom-up approach, whereas the rationalist top-down approach considers physical principles and mechanisms.Here we take the empirical route by exploiting recent developments in machine learning.
Using empirical methods to leverage ocean observations is not new.For example, using satellite altimetry data, Keating et al. (2012) constructed a stochastic model to super-resolve the velocity field and predict the velocity at depth.Similarly, Keating and Smith (2015) used a stochastic model to produce a super-resolved sea surface temperature (SST) field, given a low-resolution observation of SST.With regard to machine learning, Chapman and Charantonis (2017) constructed a form of neural network known as a self-organizing map to reconstruct subsurface velocities in the Southern Ocean using satellite altimetry data and Argo floats.Other studies have used random forests to predict subsurface temperature anomalies (Su et al., 2018) and Southern Ocean oxygen content (Giglio et al., 2018).
In the previous studies that leverage oceanic observations, there is an abundance of coarse-resolution data (satellite altimetry) but limited data on the desired quantities (e.g., high-resolution SST or Argo subsurface velocities); as is the case with OGCMs, where high-resolution data are less readily available due to the computational cost.A similar challenge is when data are only available for particular regions, such as mooring data (Hogg, 1992) or gliders (Davis et al., 2008;Rudnick et al., 2004).A machine learning algorithm trained on region-limited data would have to adapt to new regions with different physics; this task is well suited to deep neural networks, which are known for a strong ability to generalize (Goodfellow et al., 2016;Krizhevsky et al., 2012;LeCun et al., 2015).However, deep neural networks are typically considered a black box; that is, they lack simple interpretations.It is therefore difficult to assess whether such data-driven methods respect physical principles (e.g., conservation of energy or momentum).For example, neural networks have been used to develop Reynolds-averaged turbulence models (Kutz, 2017;Tracey et al., 2015), where the studies of Ling, Jones, and Templeton (2016) and Ling, Kurzawski, and Templeton (2016), in particular, show that a neural network can respect Galilean invariance by utilizing the invariant tensors of Pope (1975).The studies of Ling, Jones, and Templeton (2016) and Ling, Kurzawski, and Templeton (2016) are important in moving toward data-driven approaches that respect the physical properties of the system.
In this paper we focus on a particular machine learning algorithm, namely, convolutional neural networks, in order to leverage observations and coarse-resolution model data.Our aim is to test whether they can be used to reveal information on unresolved turbulent processes and subsurface flow fields, and to determine 10.1029/2018MS001472 if they are suited to situations where data are limited to a particular region.To move toward these aims, as a proof of concept, we will address the following questions: 1. Can convolutional neural networks represent the spatiotemporal variability of the subgrid eddy momentum forcing? 2. How sensitive are the neural networks to the physical processes occurring within each region, and how well do they generalize to ocean models in different configurations?3. Is it possible to physically constrain neural networks to respect global momentum conservation?4. Using only information at the surface, can neural networks predict the subsurface flow fields?By using data from an idealized high-resolution ocean model, we show that convolutional neural networks can represent both the spatial and temporal variability of the eddy momentum forcing.The region the neural network is trained on and therefore the dynamical processes occurring within that region significantly impact the performance of the neural network.In particular, training on the most turbulent region produces the best overall performing neural network.The neural networks successfully generalize to models with different viscosity coefficients and external wind forcings.Initially, momentum is not conserved globally, but the neural networks can be constrained to respect momentum conservation without a significant reduction in accuracy.A neural network can accurately predict the subsurface flow field when there is a strong barotropic component to the flow.
The paper is organized as follows.The quasi-geostrophic ocean model, the degrading of model data, and convolutional neural network are introduced in section 2. Performance diagnostics of the neural networks, in terms of nonlocal predictions and generalizing to different model configurations, are presented in section 3. We explore methods of physically constraining the neural networks in section 4. Section 5 presents a neural network trained to predict subsurface flow fields using only information at the surface.We summarize and discuss our results in section 6.

Quasi-Geostrophic Ocean Model
We use the PEQUOD model which solves the three-dimensional baroclinic quasi-geostrophic (QG) potential vorticity equation, with constant wind forcing on a beta plane (e.g., Berloff, 2005).The model has a bounded-square domain with a flat bottom.
The configuration of this model leads to two large-scale circulation gyres separated latitudinally by a strong meandering zonal jet.The model is configured to represent an idealized version of current systems such as the Gulf Stream in the North Atlantic or the Kuroshio Extension in the North Pacific; both these current systems exhibit vigorous eddies interacting with a strong mean flow.The time mean stream function, which illustrates the double-gyre flow structure, can be seen in Figure 1a of Mana and Zanna (2014).
The potential vorticity q is given by where f = f 0 + y is the planetary vorticity, f 0 is the Coriolis parameter,  = df∕dy is the Rossby parameter, ∇ = (∕x, ∕y) is the horizontal gradient operator, N = (− 1 2 is the Brunt-Väisälä frequency, g is gravity,  is density, and  is the stream function for the nondivergent horizontal velocity u = ( − ∕y, ∕x).
The model has three layers (m = 1 upper, m = 2 middle, and m = 3 upper), with thicknesses H m of 250, 750, and 3,000 m, respectively.For each layer, the following prognostic equation is solved where  = ∇ 4  − r∇ 2  m,3 is the dissipation, and  = (∇ × ) z  m,1 ∕ 0 H 1 is the applied wind stress curl forcing, where  i,j is the Kronecker delta function.The horizontal resolution of the model is 7.5 km, such that the model is eddy resolving.The first term in the dissipation is a fourth-order term equivalent to Laplacian viscosity, with viscosity coefficient .The second dissipation term parameterizes the presence of an Ekman layer with bottom drag coefficient r (and therefore only acts on the bottom m = 3 layer).The wind stress forcing applied to the upper m = 1 layer is given explicitly by (3) where g(x) = L∕2 + 0.2(x − L∕2), L = 3, 840 km is the domain length, and  0 is the reference density.After the model has been integrated from rest to a statistically steady state, we save 10 years of model output at daily resolution of the turbulent double-gyre circulation.For further details on the QG model, see Mana and Zanna (2014) and Zanna et al. (2017), and for a list of the model parameters, see Table 1.We use the data generated by the ocean model to train various neural networks, but only after degrading the data, to make it similar to observations or low-resolution model.

Degrading High-Resolution Data
We degrade the fields from the high-resolution QG model using a spatial 2-D low-pass filter, in order to produce data that are similar to satellite altimetry or a model with a large numerical dissipation.From the filtering of the model data, we can then calculate the forcing from unresolved small-scale turbulent processes.
At every time slice in the data, we take a high-resolution variable a at a particular layer and apply a two-dimensional spatial Gaussian filter.We denote filtered variables as ā, and subfilter variables as the deviation from the filtered variable a ′ = a− ā.The value of a function a(x, y), after the Gaussian low-pass filtering operation G ⋆ a at a point (x 0 , y 0 ), is given by where  = 30 km is the standard deviation of the Gaussian filter, which determines the length scale at which information (below that length scale) is removed.Therefore, the filter acts to remove information on dynamical processes at spatial scales smaller that 30 km.
Using the low-pass filter defined in equation ( 4), we can now express the effects of the unresolved (subfilter) variables onto the resolved (filtered) variables.Ignoring vertical effects and planetary vorticity, the horizontal momentum equation is given by where F and D are the momentum forcing and dissipation, respectively.Applying a low-pass filter to equation ( 5), and then adding (u • ∇)u to both sides of the equation, leads to where Subfilter eddy momentum forcing. (8) The low-pass filtering operation results in an additional forcing term in equation ( 7) for the filtered momentum; the additional momentum forcing S is given by equation ( 8), the divergence of a Reynolds stress.The vector S = (S x , S y ) represents the effects of the subfilter momentum field on the filtered momentum field, that is, the interaction between small-scale eddies and the large-scale flow.As the subfilter eddy forcing S depends on the subfilter variables, it requires a physical parameterization or closure.

Predictive Algorithm: Convolutional Neural Networks
Convolutional Neural Networks (CNNs) have proven successful in many areas of computer vision (Dong et al., 2016;Krizhevsky et al., 2012;Simonyan & Zisserman, 2014), where the primary objective is to extract information from an image, in order to perform a particular task.CNNs work by applying successive layers of convolutions (a form of spatial filtering) to the input; the complexity of the extracted information increases with the number of convolution layers.The powerful property of CNNs is that the filters of each convolution are learned as part of the training process-they are not specified a priori.Therefore, CNNs learn to extract the most useful information from the input variable, given training on a particular data set.
We chose to use CNNs, as opposed to a deep neural network of multiple fully connected layers, due to their superior performance in computer vision tasks where the inputs have a two-dimensional structure (Krizhevsky et al., 2012).We wanted a machine learning algorithm that could exploit the two dimensional lateral structure of turbulent fluids.Spatial filtering of the equations of motion of turbulent fluids is not new and is used in large eddy simulation (Moeng, 1984;Sagaut, 2006).Therefore, the learned filtering operations of a CNN appeared to be a natural choice of data-driven algorithm to apply to geophysical flows.
The training process involves the minimization of an appropriately defined loss function, which measures the difference between the output of the CNN, and the desired targets.If the optimization procedure was successful, such that the loss function on previously unseen data converges, the CNN will have learned to extract the most important information from the input.The CNN then uses the information to predict continuous values.The CNN constructs the final prediction through a linear regression layer, which regresses the desired output onto the final feature maps (feature maps are the intermediate results of each convolution layer).

10.1029/2018MS001472
Here we use CNNs to represent the subfilter eddy momentum forcing.The input is the filtered-stream function ψ of the upper vertical layer, which represents our resolved variable that the neural networks will extract information from.The output variables are the zonal S x and meridional S y components of the subfilter momentum forcing S, defined by equation ( 8).An example input and output is shown in Figure 1.Separate CNNs are trained for each component of the subfilter momentum forcing S x and S y .We only consider data from the upper layer of the model; this is because the flow is surface-intensified, and we are assuming that our filtered quantities are similar to satellite altimetry data, which only provide information at the surface.
In addition to testing whether it is possible to train a neural network to predict S x and S y , from ψ, we explore how a neural network trained on one region performs on another previously unseen region, that is, how important local versus nonlocal information is for different regions.We therefore construct three different data sets from the QG model data, one for each region being studied.We choose regions which differ most in their dynamical behavior, and are shown in Figure 1a: Region 1 is near the jet separation point of the western boundary, where there is a strong, inertial zonal jet.Region 2 is near the eastern boundary downstream of the jet extension, where the dynamics are more wave-like in nature.Region 3 is in the center of the southern gyre, which is energetically less active than regions 1 and 2.
Data from the three regions are split temporally into training and validation data sets.The 10 years of daily data (3,650 days) are split into the first ∼9 years (3,300 days) to train the neural networks, and the final year (350 days) is set aside for validation.To reduce the computational cost and the number of parameters of each CNN we split each region spatially from the initial 160 × 160 grid points, to sixteen 40 × 40 grid point subregions, as depicted in Figure 1c.Reducing the input and output size of the neural network from 160 × 160 to 40 × 40 significantly decreases the number of trainable weights, and therefore the computational cost (we attempted to make predictions for the full 160 × 160 of each training region, but this led to a neural network with over 250,000,000 parameters, which was computationally impractical).
Making predictions for a 40 × 40 area instead of a 160 × 160 area also increases the amount of training and validation data by a factor of 16, from 3,300 and 350 samples to 52,800 and 5,600, respectively, where a sample is defined as a single input-output pair of the neural network.We therefore have 52,800 spatial maps (size 40 × 40 grid points) of input-output pairs to train the neural networks, and 5,600 spatial maps of input-output pairs set aside for validation.
We train CNNs to separately predict S x and S y , using data from three different regions of the model; this gives a total of six neural networks.Each neural network is denoted by  i ( ψ, w R ), where i = (x, y) refers to the component of S being predicted, w R are the trained weights of the neural network, and R = 1, 2, 3 refers to the region on which the neural network has been trained.For example, the neural network trained on region 2 to predict the meridional component S y is denoted by   ( ψ, w 2 ).
To distinguish predictions from the true values, we label neural network predictions as Sx =  x ( ψ, w R ), and S =   ( ψ, w R ), while the true values of the subfilter momentum forcing remain as S x , S y .We use the mean-square error as the loss function, which quantifies the difference between the neural network predictions and the truth, and where the summation is over all samples.The neural networks are trained (i.e., optimized) using a form of stochastic gradient descent, namely, the Adam optimization algorithm (Kingma & Ba, 2014), which minimizes the loss function L defined in equation ( 9).The training of each neural network  i ( ψ, w R ), iteratively adjusts the values of the weights w R , such that the loss function in equation ( 9) is minimized.Therefore, each neural network has a different set of weights w R ; it is these weights which determine how each neural network extracts information and makes predictions.
The architecture used for each  i ( ψ, w R ) contains three convolution layers, a max pooling layer and a final fully connected layer (Figure 1).The max pooling layer reduces the dimensionality of the previous layer, by selecting the maximum value within a 2 × 2 grid point area-max pooling is effective when there is significant correlation between points in the feature maps.The specific architecture was constructed by adjusting all parameters and observing which configuration most effectively minimizes the loss function on the validation data.See Table 1 for more details of the architecture and training procedure.The total number of parameters of each neural network is 325,728.
We train and implement each neural network using Keras (Chollet, 2015), with the Tensorflow backend (Abadi et al., 2016).Before training, all data sets are separately normalized to zero mean and unit variance.Each CNN is trained for 200 epochs (1 epoch = 1 full pass of all the training data through the optimization algorithm), taking approximately 10 CPU hours, after which there is negligible change in the loss function of the validation data.
Once all six neural networks are trained, we make the predictions Sx and S using the filtered-stream function ψ from the validation data set, that is, the final year of withheld data.We make predictions for the full domain to determine how each neural network generalizes to unseen, dynamically distinct, regions.As the input and output size of each neural network is 40 × 40 grid points, we tile together predictions for the full domain of size 512 × 512; the tiling leads to errors at the boundaries of each tile, where discontinuities can emerge.To reduce the tiling error, we make predictions using overlapping tiles and then average the results at each grid point.
In order to make predictions of the subsurface flow field, using only information at the surface, we train a new neural network.The new neural network has an identical architecture to those discussed previously and is trained to predict the middle-layer stream function using the upper-layer stream function as the input; this neural network is described in more detail in section 5.

Nonlocal Predictions
The filtered-stream function represents, for example, observational measurements from satellite altimetry or coarse-resolution model data.The subfilter eddy momentum forcing represents unresolved turbulent processes.Our goal is to replicate the complex spatiotemporal variability of S x and S y using neural networks  i ( ψ, w R ).However observational data such as moorings (Hogg, 1992) or gliders (Davis et al., 2008;Rudnick et al., 2004), may only be available for a particular region; we therefore only train the neural networks using data from specific regions of the full domain, as described in section 2.3.Our aims are to both successfully train the neural networks and to study how they generalize to previously unseen regions.
We study the spatiotemporal variability of S x and Sx , by examining snapshots, the time mean, and the standard deviation, shown in Figure 2. Diagnostics are calculated over the full 512 × 512 domain, using the final year of withheld data.Both the spatial and temporal variability of the true S x are dominated by the jet dynamics (Figures 2a, 2e, and 2i).In particular, strong meanders which extend eastward from the western boundary are visible.The amplitude of the spatiotemporal variability of S x (1.4 × 10 −6 m/s 2 ) is of similar magnitude to the time mean (1.5 × 10 −6 m/s 2 ).
All neural networks trained on three different regions, shown in Figure 1a and described in section 2.3, successfully reproduce the spatial patterns of the true S x , as shown by snapshots of the predictions Sx (Figures 2b-2d).Their magnitudes however vary significantly.The predictions of  x ( ψ, w 1 ), trained on data from the western boundary, are almost identical to the true S x and successfully reproduces the correct amplitude and variability (Figures 2b,2f,and 2j).The neural network  x ( ψ, w 2 ), trained on data from the eastern boundary, underestimates the magnitude of the true S x by approximately 50%, despite reproducing the correct spatial patterns.The predictions of  x ( ψ, w 3 ), trained on the southern gyre, underestimates the true S x by an order of magnitude (Figures 2d,2h,and 2l).
As the variability of S x is dominated by the jet, it is difficult to assess the accuracy of the neural network predictions Sx in quiescent regions such as the eastern boundary or within the gyres.We therefore calculate the Pearson correlation, a dimensionless quantity, between the true S x and the predictions Sx .The predictions of  x ( ψ, w 1 ) and  x ( ψ, w 2 ) are highly correlated with the truth (r > 0.9) within the jet but tend toward zero or negative correlation near the eastern boundary (Figures 2m and 2n).The predictions of  x ( ψ, w 3 ) have a more consistent positive correlation across the gyres and other more quiescent regions, (Figure 2o).The first column contains the diagnostics using the true zonal subfilter momentum forcing S x , while columns two, three, and four use predictions Sx from the neural networks  x ( ψ, w 1 ),  x ( ψ, w 2 ), and  x ( ψ, w 3 ), respectively.All diagnostics were produced using the validation data.

10.1029/2018MS001472
We observe similar results for the spatial and temporal variability of S y , shown in Figure 3: the variability within the jet dominates, with an amplitude (1 × 10 −6 m/s 2 ) similar to S x .The meandering of the jet again produces complex spatial patterns in S y , which when averaged in time, produce a distinct sign change moving across the jet latitudinally.For the predictions S , the neural network trained on the western boundary,   ( ψ, w 1 ), most effectively reproduces the true S y .However, the time mean of   ( ψ, w 1 ) (Figure 3f) has a positive bias everywhere in the domain, whereas the time means of   ( ψ, w 2 ) and   ( ψ, w 3 ) (Figures 3g and  3h, respectively) do not.
The correlations between S y and S are similar to the zonal component:   ( ψ, w 1 ) and   ( ψ, w 2 ) are highly correlated (r > 0.8) within the jet but not in the gyres.In contrast,   ( ψ, w 3 ) has a consistently positive correlation across the full domain, despite failing to reproduce the amplitude within the jet.In fact, the correlation of   ( ψ, w 3 ) within the jet (Figure 3o) is negative (r ≈ − 0.3).The negative correlation implies that the dynamical processes occurring within region 3, the southern gyre, have an opposite effect to the eddy momentum forcing occurring within region 1.The opposing effects of eddies could be an example of regional variation in eddy forcing, as in Waterman and Jayne (2010), who found that whether eddies were driving the large-scale flow or not, depended critically on along-stream position.
Across all neural networks, the correlation decreases at the eastern boundary, which is partly caused by the subfilter momentum forcing being orders of magnitude lower than elsewhere in the domain.The low magnitude of S x and S y is due to the wave-like behavior of the flow having a larger spatial scale.The larger spatial scale at the eastern boundary leads to little variability at small scales, reducing the eddy momentum forcing to almost zero and therefore causing the performance of neural networks to deteriorate.
Overall, we see that training neural networks on the western boundary is most successful when generalizing to other areas of the domain (in terms of correlations and reproducing the variability).Training on the eastern boundary produced good correlations in the western boundary, but underestimated the magnitude of the eddy forcing by approximately 50%.Training on the southern gyre did not correlate well within the western boundary and underestimated the truth by an order of magnitude.
Hence, to successfully reproduce the correct amplitude and variability across the domain, the training data must contain a diverse range of scale interactions, which here corresponds to training on the most turbulent region.However, training on the turbulent regions can lead to significant net biases in the predictions, as seen in Figure 3f.How to correct for such biases will be discussed in section 4.

Generalizing to Different Reynolds Numbers
In section 3.1, we investigated how neural networks trained on different regions of the domain generalize to other previously unseen regions.We now test how the neural networks generalize to different regimes, in particular, different Reynolds number.In section 3.1, we found that the neural networks trained on region 1, the western boundary, successfully generalized to different regions; we therefore apply  x ( ψ, w 1 ) to new model data with different wind stress amplitudes and viscosity coefficients to test its performance.We use models with higher and lower wind forcings, to test regimes which are both more and less turbulent than the original model, which had a wind stress amplitude of  0 = 0.8 N/m 2 and viscosity  = 75 m 2 /s 2 .
We use the low-pass filter on the upper-layer stream function from each different model run, with the following:  = 200 m 2 /s 2 and  0 = 0.3, 0.6, and 0.9 N/m 2 , and then apply the already trained neural network  x ( ψ, w 1 ) to generate predictions Sx .The standard deviation of the true S x , the standard deviation of the  x ( ψ, w 1 ) predictions Sx , and the correlation between them are shown in Figure 4.
The neural network  x ( ψ, w 1 ) reproduces the variability within the jet almost exactly, across all runs, as can be seen by comparing the standard deviations in the first and second columns, which represent the standard deviation of the true S x and predicted Sx , respectively.The correlation within the jet remains high (r > 0.9) in all runs, including the model with an increased wind forcing ( 0 = 0.9 N/m 2 ) in Figure 4o.
The correlations weaken at the eastern boundary for the lowest wind forcing ( 0 = 0.3 N/m 2 ), shown in Figure 4f; this may be caused by an increase in the wave-like behavior at the eastern boundary, which is not well captured by the neural networks.In general, the higher the Reynolds number, the better the correlations, that is, more dark red areas of r > 0.8.
The mean biases of the predictions of the new models are similar in magnitude to the biases of the original model configuration.These biases showed no relationship with the Reynolds number and are therefore not discussed further.Examining the ability to generalize to new regimes: using the trained neural network  x ( ψ, w 1 ), we make predictions for model runs of different viscosities and wind forcings.From each model run, we use 1 year of the upper-layer filtered-stream function to generate predictions Sx from  x ( ψ, w 1 ) to see how they compare to the true S x .We study a run of higher viscosity (a-c)  = 200 m 2 /s 2 , and runs with wind stress amplitude (d-f)  0 = 0.3, (g-h) 0.6, (g-i) 0.8, and (j-l) 0.9 N/m −2 .Note that  x ( ψ, w 1 ) was trained on a run with  = 75 m 2 /s 2 and  0 = 0.8 N/m 2 , the standard deviation and correlation maps of which are included again here in panels (j), (k), and (l). 10.1029/2018MS001472

Sensitivity of Neural Networks to Undersampling
We have so far trained the neural networks with densely sampled data; that is, we have data at each grid point for both the input and output variables.However, most observational data sets are spatially sparse, for example, Argo floats (Roemmich et al., 2009).We therefore explore the impact of undersampling with a new collection of neural networks trained on region 1 to predict S x , but with the training data subsampled.At each time slice of the training data, we randomly sample (without replacement) N points of the 40 × 40 input variables, ψ, and output variables S x .Using these N randomly sampled values, we use a cubic interpolation to reconstruct the full 40 × 40 grid point input and output (with a nearest neighbor interpolation for grid points that fall outside the convex hull of the cubic interpolation).
These reconstructed time slices from subsampled data are used to train a new set of neural networks.We vary the number of points N subsampled from >90% to <5% of the original 1,600 points of the input and output variables.We have a neural network for each value of N, the subsampling rate.Using the neural networks trained on undersampled data, we calculate the root-mean-square error (RMSE) on the final year of validation data over the entire domain.The validation data are not subsampled, providing a stronger and more accurate test of the neural network's performance.
The RMSE is shown as a function of percentage of points sampled (Figure 5c).We find that the RMSE increases significantly only when the percentage of spatial points sampled drops below 10% (the error doubles at a subsampling rate of 4.7%).Note that the RMSE is not a monotonic function of percentage of points sampled due to the stochastic nature of the training procedure and the use of a nonlinear interpolation.The spatial map of RMSE of the neural network trained with 18.75% subsampled data (Figure 5b) shows minimal changes relative to the neural network trained on the original (unaltered) training data (Figure 5a).The result further suggests that the use of sparse interpolated observations can be successfully used to accurately train and predict the eddy momentum forcing as shown in sections 3.1 and 3.2.
We also tested an alternative method of undersampling, where the 40 × 40 input and output grid of the neural network is spaced out over the entire domain.In other words, we subsample the input and output variables of the original 512 × 512 grid to a regularly spaced 40 × 40 grid.However, training a convolutional neural network with this methodology did not work and led to severe overfitting (i.e., increasing validation loss during training).The neural networks presented in section 2 learn to take first-and second-order derivatives of the input stream function (see GitHub repository), which correspond to the velocities and velocity shears.Both velocities and velocity shears are important features to provide for accurate predictions of the eddy momentum forcing.By severely subsampling the input stream function, the local information relevant to estimate velocities and velocity shears is lost.
Therefore, the success of training convolutional neural networks on observational data sets will likely depend on the horizontal sampling rate of the product being considered.Our results suggest that high horizontal resolution sampling is necessary to capture any small-scale gradients and avoid the sharp rise in error in Figure 5.We anticipate that the use of (interpolated) data sets with horizontal resolution of 1 • or less (e.g., Argo float, altimetry) is needed to train the neural networks and predict the eddy momentum forcing.

Physically Constrained Neural Networks
We proceed to examine the net input of momentum from the neural network predictions Sx and S , which should vanish.If neural networks are used to leverage the use of observational data sets and coarse-resolution models, then spurious sources of momentum would violate physical conservation laws.We therefore need to constrain the neural networks to respect the physical properties of the system.Here we diagnose the momentum biases of the neural networks  i ( ψ, w R ), and then explore different methods of imposing conservation of momentum globally.

Momentum Biases
Each subregion (including those used to train the neural networks) may have a nonzero spatially integrated momentum tendency.However, globally, the true subfilter momentum forcing S should redistribute momentum and not act as a source or sink, that is, ∫ ∫ Sdxd = 0. We therefore need the neural networks to not introduce spurious sources of momentum, to respect the physical properties of the system.By training each neural network on a subregion, we expect to have imperfect momentum conservation, which will depend upon the particular dynamical processes within each region.For example, if eddies within a par- ticular region are driving the mean flow, then we would expect a positive source of momentum locally-a neural network trained on such a region would likely generalize the (local) input of momentum to the rest of the domain.A net source or sink of momentum will manifest as a nonzero bias after spatial averaging.
At a single point in space, the time series of the predictions Sx and S show that the neural networks trained on regions 1 and 2 track the true S x and S y closely (Figures 6a and 6b), reproducing a significant proportion (>80%) of the variance.However, if at each time step we spatially average the neural network predictions Sx and S (Figures 6c and 6d, respectively) over the full domain, we observe significant nonzero biases.
Consider the zonal component of the eddy momentum forcing in Figure 6c:  x ( ψ, w 1 ) has a net positive bias, implying a global positive increase of zonal momentum at all times, while both  x ( ψ, w 2 ) and  x ( ψ, w 3 ) have negative biases, indicating a net decrease in zonal momentum.We can estimate the magnitude of the resulting change in zonal velocity from these net biases, over a period of a year, by assuming Δu =< Sx > Δt, where <> denotes the spatial average over the full domain.For  x ( ψ, w 1 ),  x ( ψ, w 2 ), and  x ( ψ, w 3 ), we obtain values of < Sx >= 0.03, 0.02, and 0.0008 (10 −6 m/s 2 ), respectively; this leads to zonal velocity changes of u = 0.95, 0.63, and 0.025 (m/s).These changes are of similar magnitude to the time mean zonal flow, which peaks at approximately 0.9 m/s within the jet core.

10.1029/2018MS001472
There are also significant biases in the predictions of the meridional component S , shown in Figure 6d.The positive bias of   ( ψ, w 1 ) is visible in the time mean S shown in Figure 3f.We can again estimate the change in meridional velocities by assuming Δv =< S > Δt.Using values of < S >= 0.02, −0.01, and 0.002 (10 −6 m/s 2 ) for   ( ψ, w 1 ),   ( ψ, w 2 ), and   ( ψ, w 3 ), respectively, leads to the following changes: v = 0.63, −0.31, and 0.06 (m/s).Some of these changes are the same magnitude as the time mean meridional flow.

Toward Momentum-Conserving Neural Networks
The predictions of neural networks  x ( ψ, w 1 ) and   ( ψ, w 1 ), described in section 3.1, correctly reproduce the correct amplitude and variability of the true eddy momentum forcing S x and S y , as seen in Figures 2 and  3.However, training on region 1 also produced some of the largest nonzero biases in Sx and S after spatial averaging at each time step.We therefore test whether we can reduce the biases when training on region 1, while preserving the accuracy of predictions from the neural network.We try three approaches (A, B, and C) to reduce the biases identified in Figures 6c and 6d.
(A) Architecture alteration: Train neural networks on region 1, but with the final fully connected layer modified such that the spatial mean is removed from the final output.The associated neural networks of each approach are labeled as  i ( ψ, w A 1 ),  i ( ψ, w B 1 ), and  i ( ψ, w C 1 ), respectively, where i = (x, y) denotes either the zonal S x or meridional S y component being predicted.
All neural networks are optimized using the same training parameters given in Table 1.Approach A, which alters the architecture, and approach B, which alters the training data, are enforcing momentum conservation not just globally, but within the 40 × 40 subregion being predicted.This local conservation is useful for enforcing global conservation.However, local conservation may not be desirable if there is convergence of eddy momentum fluxes in a particular region, which can impact the large-scale flow; for example, if eddies are fluxing momentum into the jet at a particular along-stream position, enforcing local conservation in a neural network may lead to missing these effects.Therefore, caution must be taken with restricting architectures in this way.
We now explore the performance of the newly constrained neural networks and the net momentum input relative to that of the original neural networks trained on region 1:  x ( ψ, w 1 ) and   ( ψ, w 1 ).The spatial averages of neural networks based on approaches A, B, and C are shown in Figure 7, with the same scale axes as in Figure 6.
Approach B has significant biases of approximately −0.01 and −0.015 (10 −6 m/s 2 ) in the zonal and meridional components, respectively; the optimization procedure aims to reproduce the variability in the training data, and not spatial means; therefore, preprocessing the training data does not remove the biases.Compared to the original neural networks trained on region 1, the biases of approaches A and C are 3 to 5 orders of magnitude lower, in both the zonal and meridional components.The postprocessing approach is exactly zero by construction, while the altered architecture approach A is not exactly zero due to the overlapping tiling procedure.The biases of  x ( ψ, w A 1 ) and   ( ψ, w A 1 ) are approximately −0.002 and −0.0005 (10 −6 m/s 2 ) which, over the course of a year, would lead to velocity changes of u = −0.06 and v = −0.01(m/s), respectively-now an order of magnitude smaller than the time mean flow.
The correlation maps of all momentum-conserving approaches (not shown) change little from the original correlation maps of  x ( ψ, w 1 ), and   ( ψ, w 1 ), shown in Figures 2m and 3m, respectively.All approaches reproduce the correct spatial patterns of the true S y and S y (e.g., Figure 7 for standard deviations).However, approaches A and B underestimate the amplitude of S x and S y by approximately 20-30%, whereas there is a little difference between approach C and the truth (<10%).
In summary, approach C of postprocessing successfully enforces momentum conservation, without sacrificing accuracy in the predictions of the eddy momentum forcing.Approach B, altering the training data, was not efficacious at reducing the net biases.The physically constrained architecture of approach A successfully reduced the net bias, but at the expense of 20-30% accuracy.Though further altering of the architecture (e.g., increasing number of convolution layers and filters) or training procedure (decreasing the learning rate, with increased number of training epochs) could reduce this drop in accuracy by countering the restriction placed on the architecture.

Predicting Subsurface Flow
We have shown that neural networks, by using the filtered-stream function as the input variable, can provide information on unresolved turbulent processes, namely, the subfilter momentum forcing.We have assumed that the filtered-stream function represents some limited set of observations or data from a coarse-resolution ocean model.However, coarse-resolution ocean models still produce data for below the surface, whereas satellite observations do not.Here we address the issue of inferring subsurface information solely from surface fields.Our approach is conceptually similar to Chapman and Charantonis (2017),

10.1029/2018MS001472
which used a form of neural network called a self-organizing map to reconstruct subsurface velocities in the Southern Ocean, using satellite altimetry and Argo float data.Using the QG model data described in section 2.1, we test whether a neural network can predict the middle-layer stream function, using only the surface-filtered-stream function.
We train a new neural network ψ2 =  ( ψ1 , W) (which has the same architecture as before, but with a different output and weights) to minimize the mean-square-error loss function L ∝ ( 2 − ψ2 ) 2 , where ψ1 is the filtered-stream function of the upper layer,  2 is the true stream function of the middle layer, and ψ2 is the neural network predictions.Again, to assess the ability to generalize to unseen regions, we only train the neural network on the western boundary (training region 1).Diagnostics of the true  2 and predictions ψ2 , including the correlation between them, are shown in Figures 8a-8e.The neural network accurately reproduces the middle-layer time mean and standard deviation of the stream function within the jet region.The neural network accurately reproduces the correct amplitude of the true  2 within the jet, but underestimates the amplitude by ≈50% within the gyres.Independent of the amplitude, the predictions ψ2 are highly correlated (r > 0.8) almost everywhere in the domain with the true  2 .
The decrease in accuracy in the gyres is likely due to only training within the western boundary, where the stream functions of the upper-and middle layers are more tightly coupled due to the strong barotropic nature of the flow.Within the gyres, the barotropic component is not as dominant-this could cause the neural networks to underestimate the amplitude away from the jet.Alternatively, the adjustment time scales of the upper and middle layers are not the same, which perhaps requires more training data in order to capture interactions over longer time scales.
We take the approach one step further, by predicting the bottom-layer stream function, using the same neural network and its weights  ( ψ1 , W), but now using the predictions of the middle-layer stream function as the input, that is, ψ3 =  ( ψ2 , W).We test whether a neural network trained to predict the middle-layer stream function can provide any information on the bottom-layer stream function (without retraining), by inputting the middle-layer stream function as an input.Mathematically, this is written as ψ3 =  (  ( ψ1 , W), W).
Diagnostics of the true ( 3 ) and predicted ( ψ3 ) bottom-layer stream function are shown in Figures 8f-8j.Despite a moderate correlation of r ≈ 0.5 across the domain, the predictions fail to reproduce the correct time mean, which has a circulation in the opposite direction to the truth.This is due to the neural network being trained to predict the middle-layer flow, which on average is more aligned with the upper layer.Therefore, when the neural network is given the middle-layer stream function as an input, it predicts the bottom-layer flow as on average being in the same direction, which is not the case.The neural network also has not been trained to predict the effects of the additional bottom drag, decreasing the accuracy further-more data could improve this issue, as the longer time scales associated with bottom drag may be absent from the training data set.
An alternative approach would be to train a new neural network to map directly from the surface flow to the bottom-layer flow, that is, ψ3 =  ( ψ1 , W).Having separate neural networks for the middle and bottom layers, you could then reconstruct the flow at all depths using just information at the surface (although an additional neural network does increase computational costs).Independent of the abyssal flow however, we have shown that neural networks can provide information on the flow at intermediate depths.

Summary
In this study, we have demonstrated as a proof of concept that machine learning algorithms can provide information on unresolved turbulent processes, when given a smoothed view of the dynamics (i.e., the filtered-stream function).We degrade data from a high-resolution eddy-resolving QG model using a spatial low-pass filter and train convolutional neural networks to predict the relationship between turbulent processes and their effect on the large-scale flow, that is, the eddy momentum forcing.Our results show that convolutional neural networks can successfully represent both the spatial and temporal variability of the eddy momentum forcing.
We determine how neural networks trained on one area of the domain, perform in other previously unseen areas (Figures 2 and 3), representing when observational data are limited to only particular regions, for example, mooring data (Hogg, 1992) or gliders (Davis et al., 2008;Rudnick et al., 2004).Training on a sub- region tests the sensitivity of the neural network performance to the underlying physical processes.We find that the region on which the neural network is trained significantly impacts the accuracy, as well as the mean bias, which impacts momentum conservation.In particular, training on the least energetically active region, the southern gyre, leads to the lowest accuracy; these neural networks could not reproduce the variability in more energetic regions, such as within the meandering jet.However, training on the western boundary leads to the best generalization, in terms of reproducing the correct amplitude of the eddy momentum forcing in the rest of the domain.
The variation in performance between regions implies that training on the most turbulent region leads to the best performing neural networks for eddy momentum forcing prediction.It is possible that data from 10.1029/2018MS001472 the most turbulent regions exhibits the highest variance or contains a more diverse range of scale interactions.However, two regions may be as turbulent or energetically active as each other, but the nature of the eddy-mean flow interactions within them may differ.For example, Waterman and Jayne (2010) showed that in an idealized model the effect of eddies on the mean flow depended critically on along-stream position: up-stream eddies are generated by an unstable jet, while downstream the eddies drive the time mean circulation.Therefore, training neural networks on different along-stream positions may lead to different dynamical processes being learned, despite both regions being energetically active.Here we have shown how the performance varies between regions of differing energetic activity, but how the specific effects of eddies-for example, driving the mean flow, versus eddies extracting momentum and energy from the jet-impacts the neural network performance remains to be determined.
Without further training, we show that a neural network trained on one QG model configuration generalizes exceedingly well to QG models with different viscosity coefficients and wind forcings (Figure 4).The neural network within the jet reproduces the correct spatiotemporal variability (<10% error) in all configurations, and the correlation between the predicted Sx and the true S x within the gyres increases with the Reynolds number of the model configuration.While the neural networks do not conserve momentum globally (Figure 6c and 6d), we show that momentum conservation can be enforced without a significant reduction in accuracy (Figure 7), through either a physically constrained architecture or postprocessing of the predictions.
We also show that a new neural network can be trained to predict the middle-layer stream function, using only the upper-layer stream function as the input, that is, predicting the flow at depth using information at the surface (Figure 8).The highest accuracy occurs where the barotropic component of the flow is most dominant, which coincides with a strong zonal mean flow.However, when using the stream function to predict the bottom-layer stream function, the neural network captures some of the variability, but fails to replicate the time mean of the true bottom-layer stream function  3 (Figure 8), primarily due to the presence of bottom drag.

Implications for Leveraging Observations
Our work has implications for inference from sparse observations.While previous studies have used machine learning to leverage observational data sets (Chapman & Charantonis, 2017;Giglio et al., 2018;Su et al., 2018), the present work demonstrates that convolutional neural networks in particular are an excellent tool for such tasks.Neural networks should be further tested and exploited in the future for data inference due to • their resilience, such that accurate predictions for the full domain can be generated by training on a subregion; • their generalization to different external forcings, without any further training such that predictions outside the regime trained on can be successful; and • their ability to be successfully trained with undersampled data (Figure 5).
Collectively, these results suggest that sparse interpolated observational data sets can be leveraged by such data-driven techniques.For example, satellite altimetry data can be used to predict the subsurface flow; or data from moorings deployed in Drake Passage as part of the Diapycnal and Isopycnal Mixing Experiment in the Southern Ocean can be used to infer eddy momentum or heat flux divergences in other parts of the Southern Ocean.In addition, data sets from Argo floats (Chapman & Sallée, 2017), mooring data, ADCPs, and SSH from altimetry, could be combined to reconstruct physically and biogeochemically important quantities such as energy reservoirs, or air-sea fluxes, interior transport and/or storage of heat, and carbon and oxygen in the ocean (Giglio et al., 2018;Su et al., 2018).

Implications for Parameterizations
Although we have motivated our study through the leverage of observations and coarse-resolution model data, our results have implications for eddy parameterizations of momentum and more generally for subgrid parametrizations.As discussed previously, machine learning has been used to parameterize unresolved processes in the atmosphere (Brenowitz & Bretherton, 2018;Gentine et al., 2018;Jiang et al., 2018;O'Gorman & Dwyer, 2018).
We have shown that neural networks can successfully represent the spatiotemporal variability of the eddy momentum forcing, implying potential for data-driven oceanic turbulence closures in the future, as sug-Figure 8. Predicting the middle-and bottom-layer stream functions  2 and  3 using the upper-layer filtered-stream function ψ1 .We first train a new neural network to predict  2 from ψ1 , that is,  2 =  ( ψ1 , W); diagnostics of the true  2 and the predictions ψ2 are shown in the top-half of the figure (a-e).We then take the same neural network that was trained to predict  2 from ψ1 , and now predict the bottom layer stream function  3 using the predicted middle-layer stream function as the input, that is,  3 =  ( ψ2 , W); the diagnostics of the true  3 and the predictions ψ3 are shown in the bottom-half of the figure (f-j).
10.1029/2018MS001472 gested by Zanna et al. (2018).The generalization ability of the neural networks shows that only a limited amount of observations or high-resolution model data may be needed to successfully represent subgrid-scale processes.While the CNNs are successful at representing relationship between the eddy momentum forcing and their effect on the resolved flow, the low-resolution climate models might have biases that are too severe (e.g., weak transport and velocity shears) to lead to a successful representation of the eddy momentum forcing from CNNs as trained here.Nonetheless, our results also show that they perform very well for different amplitudes of forcing and dissipation.Therefore, until the CNNs are implemented into a coarse-resolution ocean model, their success in improving numerical simulations is speculative but deserves to be investigated.
Whether neural networks are being used to leverage observations, or more importantly to construct a data-driven eddy parameterization, caution must be taken to ensure that the laws of physics are respected.More work into physically constrained machine learning algorithms is crucial, and successful applications of data-driven techniques should incorporate physical knowledge.Indeed, the neural network turbulence model of Ling, Kurzawski, and Templeton 2016 out-performed more simple linear models only when Galilean invariant stress tensors from Pope (1975) were used, which are also a key ingredient of the eddy parameterization proposed by Anstey and Zanna (2017).As previously discussed, we successfully enforce global momentum conservation in the present work, such that future implementations of data-driven parameterizations, despite being semiempirical, can be altered to respect physical principles.Specifically, physical constraints can be incorporated into the architecture of the predictive algorithms.
One potential disadvantage of convolutional neural networks may be the computational cost of the matrix operations of each convolution layer to make a prediction given an input.The total time complexity (ignoring any fully connected layers) of a CNN (He & Sun, 2015) is given by  , where d is the total number of convolution layers, l is the index of a convolution layer, n l is the number of filters, s l is the filter size, and m l is the size of the output feature map.The time complexity is larger than that of a traditional eddy closure (e.g., a simple Laplacian dissipation of momentum which only involves a few matrix additions and subtractions).One way to reduce the time complexity is to instead use depth-wise separable convolution layers (e.g.,Howard et al., 2017), which treat the input channels of a convolution layer more independently.This reduces the number of parameters and hence computational cost.An alternative way of reducing time complexity is to simply reduce the sizes of the input and outputs; that is, make predictions for a region smaller than 40 × 40 grid points.The amount of information available to make predictions is therefore reduced.The computational cost is an area which needs addressing if CNNs are to be routinely implemented in models in the future.However, unlike other parameterizations, the training of the neural networks is only done once.

Future Work
Our study is a step toward using convolutional neural networks to extend the reach of currently available observational or model data.Our proof-of-concept study was conducted in an idealized QG model.The next stage involves training neural networks on actual observational data sets (as described in section 6.2) or on more realistic model data (e.g., a 1/40th • global model which resolves the mesoscale and submesoscale eddy fields, such as in Rocha et al., 2016).
We used 9 years of data to train the neural networks and 1 year for validation.Gentine et al. (2018) showed, with regard to parameterizing convection with neural networks, that the training data set could be reduced in size from 12 to 3 months, with little change in the overall mean-square error.The sensitivity our neural networks to reductions in the amount of training data needs to be systematically explored.We have only determined the impact of spatial undersampling on the neural networks.Further work is needed to determine the impact of using a few number of time slices (e.g., using 3 years of training data as opposed to 9 years).
Training on the western boundary produces the best performance.However, the high skill within the jet does not fully translate to high skill in all parts of the gyres.The best correlations in the gyres occur instead when training on the southern gyre, and not the western or eastern boundaries (Figure 2 and 3).This implies that there may be an optimal combination of the predictions of the neural networks trained on different regions, in order to produce the best overall generalization and potentially include nonlocal effects.For example, each neural network has a weight a i , and the optimal predictions for the full domain is a combination of all 10.1029/2018MS001472 neural networks where the summation is over all regions, and SOPT x is the corresponding optimal prediction (with an analogous SOPT  for the meridional component).Combining predictions from multiple neural networks in this manner could be a useful way of capturing the distinct eddy-mean flow interactions observed by Waterman and Jayne (2010).Alternatively, if the computational resources are available, you could train a single neural network on data from all three regions, in the hope that it remembers the physical processes occurring in each region.The risk with this approach is that one loses specialization, and the skill reduces as the single neural network simply averages the effects of the three regions together.We will attempt to implement the neural networks (as trained here, or as a combination of neural networks) into a coarse-resolution version of the QG model to test their performance as a subgrid-scale parametrization.
Although this study is a proof of concept, the merging of data-driven methods with physical knowledge has the potential to change the way the physics of the ocean are studied, including the designs of future parameterizations.The combination of physical theory and machine learning could prove more effective than either component in isolation.

Figure 1 .
Figure 1.Panel (a) illustrates the upper-layer filtered-stream function ψ of the quasi-geostrophic model, including the three regions in which we train the neural networks: region 1 (white dashed) is on the western boundary, region 2 (black solid) is on the eastern boundary, and region 3 (gray dash dotted) is centered on the southern gyre.Panel (b) shows a close-up of the filtered-stream function ψ within training region 1 while panel (c) illustrates how training region 1 is split into sixteen 40 × 40 grid point subregions-the size of the input and output arrays of the neural network is 40 × 40 grid points.The input variable of each neural network is the filtered-stream function ψ, and the output variable is either the zonal component Sx or meridional component S of the subfilter eddy momentum forcing.The architecture of the convolutional neural network, with an example input ψ and output Sx , is illustrated underneath panels (a)-(c).

Figure 2 .
Figure 2. Examining the nonlocal prediction ability.Comparisons of the true zonal component of the subfilter momentum forcing S x , with the neural networks trained using data from three different regions.The first three rows compare (a-d) snapshots, (e-h) time means,and (i-l) the standard deviation, respectively, while the bottom row (m-o) shows the correlation between the true S x and the predictions Sx .The first column contains the diagnostics using the true zonal subfilter momentum forcing S x , while columns two, three, and four use predictions Sx from the neural networks  x ( ψ, w 1 ),  x ( ψ, w 2 ), and  x ( ψ, w 3 ), respectively.All diagnostics were produced using the validation data.

Figure 3 .
Figure3.The same diagnostics as Figure2, but for the meridional component of the subfilter momentum forcing: the true S y and the predictions S from the neural networks   ( ψ, w 1 ),   ( ψ, w 2 ), and   ( ψ, w 3 ).
Figure 4.Examining the ability to generalize to new regimes: using the trained neural network  x ( ψ, w 1 ), we make predictions for model runs of different viscosities and wind forcings.From each model run, we use 1 year of the upper-layer filtered-stream function to generate predictions Sx from  x ( ψ, w 1 ) to see how they compare to the true S x .We study a run of higher viscosity (a-c)  = 200 m 2 /s 2 , and runs with wind stress amplitude (d-f)  0 = 0.3, (g-h) 0.6, (g-i) 0.8, and (j-l) 0.9 N/m −2 .Note that  x ( ψ, w 1 ) was trained on a run with  = 75 m 2 /s 2 and  0 = 0.8 N/m 2 , the standard deviation and correlation maps of which are included again here in panels (j), (k), and (l).

Figure 5 .
Figure 5. Determining how undersampling of the training data impacts neural network error.Panel (a) shows the root-mean-square error (RMSE) of the neural network  x ( ψ, w 1 ) trained with dense (unaltered) training data, while panel (b) shows the RMSE of the neural network trained with subsampled (18.75%) data.Panel (c) shows the RMSE as a function of the percentage of spatial points sampled at each time slice of the training data.Note that the RMSE is calculated over the full domain during the validation period (the final year of data).

Figure 6 .
Figure 6.Panels (a) and (b) show time series of the zonal and meridional components of the subfilter momentum forcing, respectively, at a single point near the middle of the domain.Panels (c) and (d) also show time series of the zonal and meridional components of the subfilter momentum forcing, but this time spatially averaged over the entire domain.

Figure 7 .
Figure 7.The standard deviation and spatial-average time series of the predictions Sx and S of the momentum conversing approaches A, B, and C. Panels (a), (b), and (c) show the standard deviation of Sx from  x ( ψ, w A 1 ),  x ( ψ, w B 1 ), and  x ( ψ, w C 1 ), respectively, while panels (e), (f), and (g) show the standard deviation of S from   ( ψ, w A 1 ),   ( ψ, w B 1 ), and   ( ψ, w C 1 ), respectively.The spatial averages of these predictions Sx and S are shown in panels (d) and (h).

Table 1
Details on the Following: The Quasi-Geostrophic Ocean Model Parameters, the Data Sets Used to Train the Neural Networks, the Architecture Parameters, and the Optimization Parameters variance, removing the need for batch normalization-batch normalization enforces zero mean and unit variance at each stage of the network but requires additional training.
Klambauer et al., 2017)orks the ability to learn nonlinear functions, activation functions are added between layers.Here we use the scaled exponential linear unit (SELU;Klambauer et al., 2017).SELU activation functions scale the data toward zero mean and 10.1029/2018MS001472 unit The neural networks will therefore be trained to reproduce the subfilter momentum forcing, but with momentum conservation intrinsically embedded, that is, same training data, but altered architecture.The motivation behind this approach is that if the local source of momentum within the 40 × 40 output grid is zero, then this may reduce the global net source of momentum.(B)Preprocessing of input: Train on region 1 with the original architecture described in Table1but with the spatial mean removed from each 40 × 40 training snapshot of S x .In other words, remove the local bias from each training snapshot.If the local source of momentum of each 40 × 40 sample is zero within the training data, then training on such data may reduce local biases when making future predictions on unseen data.However, this does not guarantee that subsequent predictions will have zero local bias.This approach tries to increase the probability of learning local momentum conservation.(C) Postprocessing of output: Train on region 1 and enforce global momentum conservation after the predictions have been made, that is, no changes to training data or architecture, but with additional processing of the full-domain predictions Sx and S .