Spatially Extended Tests of a Neural Network Parametrization Trained by Coarse-graining

General circulation models (GCMs) typically have a grid size of 25--200 km. Parametrizations are used to represent diabatic processes such as radiative transfer and cloud microphysics and account for sub-grid-scale motions and variability. Unlike traditional approaches, neural networks (NNs) can readily exploit recent observational datasets and global cloud-system resolving model (CRM) simulations to learn subgrid variability. This article describes an NN parametrization trained by coarse-graining a near-global CRM simulation with a 4~km horizontal grid spacing. The NN predicts the residual heating and moistening averaged over (160 km)^2 grid boxes as a function of the coarse-resolution fields within the same atmospheric column. This NN is coupled to the dynamical core of a GCM with the same 160 km resolution. A recent study described how to train such an NN to be numerically stable when coupled to specified time-evolving advective forcings in a single column model, but feedbacks between NN and GCM components cause spatially-extended simulations to crash within a few days. Analyzing the linearized response of such an NN reveals that it learns to exploit a strong synchrony between precipitation and the atmospheric state above 10 km. Removing these variables from the NN's inputs stabilizes the coupled simulations, which predict the future state more accurately than a coarse-resolution simulation without any parametrizations of sub-grid-scale variability, although the mean state slowly drifts.


Introduction
Current global climate and weather models cannot explicitly resolve many important physical processes because of computational expenses. Global weather and short-range climate forecast models support grid sizes of 10 to 50 km (ECMWF, 2018;NOAA, 2018), while current climate models typically have 25-to 100-km grids. Unresolved processes, including cumulus convection, turbulence, and subgrid-cloud variability, are approximated by subgrid-scale parametrizations (Palmer, 2001). These are usually designed by human physical intuition, informed by high-resolution simulations and observations. Cumulus convection in the tropics is one of the most dynamically important processes in the atmosphere, yet it is extremely difficult to parameterize because of the multiscale nature of moist flows (Majda, 2007;Nakazawa, 1988). There is a long-standing debate about whether convective parametrizations should be based on moisture convergence or quasi-equilibrium closures (Arakawa, 2004). Moreover, important mean state biases in climate modeling such as the double Intertropical Convergence Zone bias are sensitive to convective parametrization (Woelfle et al., 2018;Zhang & Wang, 2006). Climate models also struggle to simulate observed aspects of tropical variability such as the diurnal cycle of continental precipitation (Stratton & Stirling, 2012) and the Madden Julian Oscillation (Jiang, 2017;Jiang et al., 2015). Thus, an improved convective parametrization could help improve weather and climate simulations.
With sufficient computational resources, global cloud-system resolving models (GCRMs) with sub-5-km grid spacing can be run in global domains without deep convective parametrizations (Bretherton & Khairoutdinov, 2015;Satoh et al., 2008). Recent studies have used such simulations to study the moist static energy budget (Bretherton & Khairoutdinov, 2015) and cloud feedbacks (Narenpitak et al., 2017). Centennial-scale climate GCRM simulations are not yet feasible, so improving parametrizations in coarse-resolution models remains an important goal.
It remains challenging for human experts to translate the vast volume of information in new high-resolution data sets and models into better parametrizations of moist physical processes incorporating subgrid variability. On the other hand, parametrizations can also be built using powerful regression tools developed in the field of machine learning (ML; Brenowitz & Bretherton, 2018

Review of ML Parametrization
One attractive use of ML is to automatically tune existing GCM parametrizations, which build in physical insights and constraints that may help them apply across a range of climates. Proposed techniques include ensemble Kalman inversion (Schneider et al., 2017), adjoint-based tuning (Lyu et al., 2018), and genetic algorithms (Langenbrunner & Neelin, 2017). These techniques can tune a few free parameters but may not scale to larger numbers of parameters. Moreover, existing parametrizations may not be flexible enough to be realistic in part because they have so few parameters. For instance, adjusting the entrainment rates in a convection scheme can improve mean state bias but harm the variability (Kim et al., 2011;Mapes & Neale, 2011).
A more ambitious approach replaces an existing GCM parametrization with a ML model, which has enough parameters to capture arbitrarily complex relationships (Cybenko, 1989;Hanin, 2017) present within a training data set. ML models are typically trained by minimizing a loss function, such as the mean-square error (MSE) compared to some reference outputs from the training data; the choice of the loss function is subjective and a key to good performance. Early studies (Chevallier et al., 1998;Krasnopolsky et al., 2005) trained NNs (Goodfellow et al., 2016) to emulate the outputs of radiative transfer codes in order to decrease computational expense. Subsequent studies also trained NNs to emulate existing convective parametrizations (O'Gorman & Dwyer, 2018) and superparametrizations (SPs; Rasp et al., 2018). In each of these instances, the ML model is trained with a nearly ideal data set, where the inputs are the state of the atmosphere, and the output is a tendency that has actually driven a GCM.
Training a parametrization by coarse-graining a GCRM is more difficult because the data are not cleanly divided into coarse-grid dynamics and parametrized physics. On the other hand, GCRMs are the most realistic models available because they do not impose an arbitrary scale separation as SP does. In this setting, how does one define the target tendency? In the first study of its kind, Krasnopolsky et al. (2010Krasnopolsky et al. ( , 2013 defined the output as the apparent heating and moistening (Yanai et al., 1973) of limited-area simulations. They diagnosed heating rates and cloud fractions with high accuracy but did not present any prognostic simulations. Brenowitz and Bretherton (2018) extended this work in two ways. First, their training data set was a near-global CRM with rich multiscale convective organization. Second, they found SCM simulations with NNs trained in this way diverged to infinity after just a few time steps. They avoided this divergence by minimizing the loss accumulated over several predicted time steps.
In summary, when coarse-graining a GCRM simulation, there is no clear target to emulate. While Brenowitz and Bretherton (2018) developed an approach for ensuring stability in a SCM with prescribed dynamics, they did not test it within a full three-dimensional GCM where the dynamics interact with the NN parametrization.

Training Data
We use the same training data set as Brenowitz and Bretherton (2018): a near-global aquaplanet simulation (NG-Aqua) using the System for Atmospheric Modeling (SAM) version 6.10 (Khairoutdinov & Randall, Journal of Advances in Modeling Earth Systems 10.102910. /2019MS001711 2003. SAM is run in a cloud-system resolving configuration with a horizontal grid spacing of 4 km in a tropical channel domain measuring 20,480 km zonally by 10,240 km meridionally. The simulation has 34 vertical levels with a spacing that increases from 75 m at the surface to 1,200 m in the troposphere and stratosphere. The atmospheric circulation is driven by a zonally symmetric sea surface temperature (SST) profile which peaks at 300.15 K at the equator (y = 5, 120 km) and decreases to 278.15 K at the poleward boundaries. Because SAM was originally designed for small-scale modeling, the grid is Cartesian and no spherical effects are included, except for a meridionally varying Coriolis parameter. Despite the simplifications described above, NG-Aqua features realistic multiscale organization of tropical and midlatitude systems.
The simulation uses the radiation scheme from version 3 of the Community Atmosphere Model with a zonally symmetric diurnal cycle and a bulk microphysics scheme (Khairoutdinov & Randall, 2003). It was initialized with small random perturbations to the temperature, spun up on a 20-km grid, then interpolated to 4 km and run for 100 days, storing the full 3-D outputs every 3 hr. To allow for the circulation to statistically equilibrate at 4 km, all analyses described below are performed on the final 80 days. More details about the model configuration are described by Brenowitz and Bretherton (2018) and Narenpitak et al. (2017).

Coarse-Resolution Model Configuration
We test the NN schemes within an atmospheric model with a grid resolution of 160 km, which is representative of coarse-resolution GCMs. This scale is coarse enough that the coarse-grid moist-convective dynamics have sufficient memory to be resolved by the 3-hr sampling rate of the NG-Aqua data; the grid mean precipitation has an autocorrelation of 0.54 at a lag of 3 hr.
We use SAM rather than a more traditional GCM as our coarse-grid model to ensure that it has the same geometry and dry dynamics as the training data. This model, coarse-SAM (cSAM), is run with the same vertical grid as the NG-Aqua simulation. Its default microphysics scheme has the same three prognostic thermodynamic variables as SAM: the total nonprecipitating water mixing ratio q T ; the precipitating water mixing ratio q p , which is a fast variable that we will neglect in the parametrizations below; and the liquid-ice static energy s L . See Khairoutdinov and Randall (2003, Appendix A) for more details.
SAM is typically used for cloud-resolving simulations so running it efficiently and stably with coarse resolution required several modifications. Most importantly, horizontal hyper-diffusion −K 4 ∇ 4 of s L , q T , and the three wind components, u, v, and w, was added to suppress grid-scale oscillations and model blow-up (divergence of the solutions to machine infinity in a finite amount of time). With a value of K 4 = 1 × 10 15 m 4 /s, simulations have more near-grid-scale power in kinetic energy than NG-Aqua, and maps of humidity and other tracers appear grainy compared to NG-Aqua. Increasing K 4 to 1 ×10 16 m 4 /s damps these grid-scale oscillations, so we hereafter use that value.
For NN simulations, we use a simplified momentum damping rather than SAM's default turbulence closure, which severely limits the time step on the 160 km grid. This forcing damps the velocity toward 0 at a rate which decreases linearly from k f = 1 day −1 at the surface to 0 at a -level of 0.7 (Held & Suarez, 1994). Because this is different than NG-Aqua, it leads to an inevitable drift of cSAM simulations away from NGAqua over the course of a few days. In the future, we hope to use learned coarse-grained momentum tendencies from NG-Aqua in place of this approach.
We compare simulations with cSAM coupled to our NN parametrization with a base version of cSAM with the microphysics and radiation tendencies calculated from grid-scale mean thermodynamics. Unlike the NN simulations, the surface fluxes are computed interactively and coupled to SAM's default turbulence scheme. The base cSAM will only precipitate when a coarse-resolution grid cell achieves saturation, so we expect it will produce noisy simulations. Ideally, we would have preferred to compare the NN parametrizations against a traditional GCM parametrization suite, as did Rasp et al. (2018) and Brenowitz and Bretherton (2018), but no such scheme has been implemented in SAM. In this article, we assess the accuracy of 10-day weather simulations. They are initialized with the coarse-grained outputs from a particular time, Day 100.625, of the NG-Aqua simulation. Our goal is to obtain cSAM simulations that best match the actual evolution of NG-Aqua over the following 10 days. All of the simulations presented in this paper use a 120-s time step. The thermodynamic variables and vertical velocity from the 4-km NG-Aqua data are averaged over 40 2 grid cell blocks, but the horizontal velocities must be averaged along the interfaces of these blocks to respect the anelastic mass conservation equation.

Coarse-Graining Problem
An NN will parameterize the unknown sources of the thermodynamic variables q T and s L . Their coarse-resolution budgets are given by where X is the horizontal average of any variable X over the coarse-grid boxes. The first term on the right-hand side of these budgets are the GCM's tendencies, which include advection and any other explicitly treated process. The apparent heating Q 1 and moistening Q 2 are defined as a budget residual from these GCM tendencies (Yanai et al., 1973). Because they are residuals, they contain the effects of all unresolved and untreated physics including latent heating, turbulent mixing, and radiation. This parametrization neglects precipitating water q p , which evolves on a much faster time scale than q T and s L .
The goal of any parametrization is to approximate Q 1 and Q 2 as functions of the coarse-resolution variables alone. However, Palmer (2001) argues that stochastic and nonlocal effects must be included for certain problems. Because radiative heating and moist physical processes couple the atmosphere more strongly in the vertical than the horizontal direction, we follow the assumption of horizontal locality typically made in moist physics parametrizations; this reduces the dimensionality of the training problem, allowing robust results to be obtained from our NG-Aqua training data set. For simplicity, we also treat Q 1 and Q 2 as deterministic functions of the inputs.
The inputs are the vertical profiles of the prognostic variables q T and s L , as well as the SST and downwelling insolation (SOLIN) at the top of the atmosphere, which are external parameters. Unlike Brenowitz and Bretherton (2018), we do not use the sensible or latent heat fluxes because they depend on the surface winds, which tend to be inaccurate because they are forced by the simplified damping scheme described in section 3 rather than a coarse-grained source from NG-Aqua.
Let the vector x o i concatenate the observed prognostic variables q T and s L over all coarse-grained grid levels in grid column i, and let y o i contain the auxiliary variables SST and SOLIN at this location. With these assumptions, (1) and (2) can be combined: where the subscript i is the index of the given horizontal grid cell and the superscript o indicates that these are observed quantities. The function f and its parameters are invariant for all locations. Here, g is the GCM tendency from coarse-grid advection (which involves other grid columns and hence is nonlocal); f is the portion of the apparent sources Q 1 and Q 2 that can be modeled deterministically, which the NN will be used to parameterize; and i includes stochastic and structural error. As with Brenowitz and Bretherton (2018), f will be parameterized as an NN where the parameters include the weights and biases of the network's layers.

Loss Functions for Stable Parametrizations
A key challenge for ML parameterizations is that the predicted state should not diverge to machine infinity after multiple steps. Traditionally, there are two ways this type of instability can arise. First, numerical instability arises from the spatial and/or temporal discretization of an otherwise stable dynamical system. Second, dynamical instability arises from the dynamical system itself.
Since the parameterization f represents a time-continuous processes, but is trained and evaluated with discrete data, it is hard to classify into these categories. With that acknowledgement, we simply define a "stable" scheme as one which does not produce diverging solutions in time. Brenowitz and Bretherton (2018) trained a stable NN for use in SCM simulations by minimizing the multiple step prediction error (MSPE) over a 1-to 3-day prediction window T. SCM dynamics decouple the dynamics of an atmospheric column from its surroundings and are described mathematically by

Journal of Advances in Modeling Earth Systems
where x SCM i (t; t 0 , ) is the single-column time series for a given location i, starting prediction time t 0 , and set of parameters . Compared to (3), the global state x o and the auxiliary variables y o are prescribed functions of time, which are decoupled from the local dynamics. Brenowitz and Bretherton (2018) define the MSPE as where the sum ranges over all possible prediction start times t 0 and spatial locations i. The inner norm is given by where 0 is SAM's reference density profile. M = ∫ 0 dz is the mass of an atmospheric column, while s and q are weight constants for the temperature and humidity, respectively. In this study, we find that when setting q = 1.0 kg 2 /g 2 and s = 1.0 K −2 each variable contributes comparably to the overall loss. This resulted in a highly accurate single-column time-stepping scheme which closely replicated the q T and s L time series of the NG-Aqua data.
Unfortunately, MSPE training produces NN schemes which implicitly depend on the temporal discretization and time step used to simulate (4). The main symptom of this is a "spin-up" error where the SCM only produces accurate tendencies after a few time steps of prediction (see supporting information Figure  S1): The spun-up predictions (panels c and d) more closely resemble the truth (panel a) than the instantaneous prediction (b) does. Because this spin-up process occurs after 3 hr, it did not cause large errors in the single-column simulations of Brenowitz and Bretherton (2018). However, for spatially extended simulations, the coupling between the GCM dynamics and the NN occurs at every 120-s time step, so the NN must produce accurate predictions from the start. Thus, MSPE-trained NNs are not suitable for use within cSAM because they spin up on a time scale longer than the cSAM time step.
Our goal in this article is to develop a loss function which produces an NN that gives spatially extended coarse-grid simulations that are stable, unbiased, and insensitive to the model time step. To encourage the NN to both predict Q 1 and Q 2 accurately and remain stable in time, this loss includes the two terms given by The first term, J instant , is the mass-weighted MSE of the predicted Q 1 and Q 2 , which is given by In practice, the storage terms dx o i dt are estimated from the 3-hourly sampled data using finite differences. Thus, minimizing J instant demands that f approximates the apparent heating and moistening.
On the other hand, stability requires that the small perturbations to the inputs x do not grow under the dynamics given by (4). In practice, we enforce this using a heuristic penalty given by where ⟨·⟩ is the time average operator. x mean i (t, t 0 ) is the single-column simulation under the action of the time-mean forcing, whose dynamics are given by Its initial condition is x mean . This penalty demands that the SCM simulation does not diverge too quickly for a prediction of length T (2.5 days). Because J stability uses constant forcing, it is less sensitive than the MSPE to the temporal discretization of (4). In practice, NNs trained to minimize J total avoid spin-up errors and accurately predict Q 1 and Q 2 at the initial time. Like MSPE-trained NNs, they do produce stable, albeit less accurate, SCM simulations. Since our main goal is to produce accurate coupled GCM-NN simulations, this deteriorated SCM performance is acceptable.

Coupled Instability
The methods discussed still do not prevent model blow-up when coupled to cSAM, as we will show in section 5. This occurs because the single-column training strategy does not account for feedbacks between cSAM and the NN so corresponding instabilities are not penalized.
A similar feedback causes grid-scale storms when a GCM is coupled to a moisture-convergence closure (Arakawa, 2004). Emanuel et al. (1994) argue this closure fundamentally confuses cause and effect and that precipitation drives moisture convergence rather than vice versa. In fact, such misleading correlations are common in nonlinear dynamical systems such as the atmosphere, a phenomenon known as synchronization (Pecora et al., 1997), so data-driven models must be taught to ignore inputs that are synchronized to the desired outputs.

Interpreting the Trained Networks Using Saliency Maps
What input variables do our NNs depend on most sensitively? Saliency maps are a common tool for interpreting the dependency structure of NN. They measure the sensitivity of the output variables to small perturbations in the inputs (Simonyan et al., 2013). Mathematically, they are given by the jacobian matrix where the derivative is with respect to the phase space variables x. For ease of discussion, we defer the model architecture and training details to section 4.4. We first train an NN using the stability-penalized Journal of Advances in Modeling Earth Systems 10.1029/2019MS001711 loss in (7) and then compute its saliency map for the mean equatorial sounding (Figure 1) using automatic differentiation (Paszke et al., 2017).
Our NN predicts that heating and moistening at all heights are very sensitive to the humidity near and above the tropopause (200 mbar). A 1-g/kg increase of humidity at this level leads to more than 10 g·kg −1 ·day −1 of drying at all lower levels. While this result is potentially consistent with formation of high clouds, it is not desirable that the NN depends so strongly on these upper atmospheric inputs, especially because it does not accurately predict Q 1 and Q 2 at these levels (cf. Figure 4). Thus, the NN depends most sensitively on the inputs that it performs worst on. To demonstrate the robustness of this sensitivity to changes in the base-state profile, Figure S2 shows ∇ q T Q 2 for 20 profiles randomly drawn from between 3 • S and 3 • N. While the saliency maps do vary from sample to sample, the sensitivity to the upper atmospheric humidity is robust.
The saliency maps are analogous to the so-called linearized response functions (LRFs) of Kuang (2018), which were computed by perturbing the steady forcing of a CRM. The LRFs have an upward dependence structure (a mainly upper-triangular structure in the panels of Figure 2 of Kuang, 2018), where the heating and, to a lesser extent, the moistening at a given vertical level depend only on data from below. This is physically realistic since cumulus convection initiates at lower elevations and deepens into cumulonimbus; evaporation of precipitation leads to some low-triangular structure as well. Unfortunately, we cannot compare our saliency maps directly to the LRFs of Kuang (2018) for several reasons: they use different thermodynamic variables (e.g., temperature and humidity); they ignore moisture inputs above 150 mbar; and they disaggregate radiation from other heat sources. Nonetheless, some loose comparisons are possible. For instance, both our saliency maps and the LRFs of Kuang (2018) show that decreasing the lower tropospheric temperature, which destablizes the atmosphere, leads to heating/drying throughout the column (Figure 1d). Likewise, the lower tropospheric moisture induces convective heating above (Figure 1c). Future studies will explore the comparison with LRFs more completely.
The NN is likely sensitive to upper-level humidity because the latter is synchronized to the convective processes below. This occurs because upper-level total water is (1) strongly damped and (2) forced by supply of moisture from advection. Figure 2a shows the effective damping rate given by where ⟨·⟩ is the zonal and time averaging operator. Near the tropopause, the moisture has a fast damping time scale of 3 hr, which is the sampling rate of the NG-Aqua data. A conceptual mathematical model for this behavior is given by where F adv is the supply of moisture by advection, which we treat as an external forcing. When the right-hand side of this equation dominates over the left, which is the case for the upper atmospheric humidity, the total water is approximately given by q T ≈ −1 F adv . In other words, the total water in the upper atmosphere becomes synchronized to the supply of moisture from below, as shown in Figure 2b. Kuang (2018) justified ignoring the humidity variables above 150 mbar on these grounds. Likewise, this synchronization could harm our ML procedure by polluting the inputs with information about the desired outputs.
If used as a closure assumption, the synchronization of the upper atmospheric total water to the convective processes below could cause a dynamical instability as follows. A positive moisture anomaly in the tropical upper troposphere will increase the predicted precipitation and heating. This will then increase the upward velocity, which closes the feedback loop by increasing the supply of moisture to the upper atmosphere. Similarly to moisture-convergence closures, this feedback can lead to grid-scale storms.
This instability might be less problematic had we been able to customize the 4-km NG-Aqua simulations for the training process so as to avoid this synchronization issue, for example, by using higher output time resolution and calculation of instantaneous heating and moistening rates. However, we were forced to work around it, as described in the next section.

Stabilizing the NN by Ignoring Upper-Atmospheric Inputs
We regularize the NN by not using the humidity and temperature above certain heights as predictors. This is motivated by (1) the synchronization argument above ( Figure 2) and (2) the NN's poor skill at predicting the upper atmospheric Q 1 and Q 2 . In any event, we obtain a stable simulation without grid-scale storms by training the NN without including q T and s L at or above levels 15 (450 mbar) and 19 (225 mbar), respectively, as inputs. However, the NN still predicts the apparent heating and moistening at these levels. We have run cSAM simulations with this configuration for up to 100 days without model blow-up or grid point storms.
Admittedly, this is a crude approach that future iterations of this work should improve. Figure 3 shows the saliency map of the modified NN. This modification did not greatly alter sensitivity of the NN to inputs below the ignored levels. Overall, the modified NN is slightly more sensitive to tropospheric s L and q T , but the structure is similar to Figure 1. That said, there are some strong oscillations in the response of Q 1 and Q 2 to q T near 500 mbar. Thus, removing upper atmospheric inputs did not harm the physically plausible characteristics of the original model.

Network Architecture and Training
NNs are nonlinear regression models which are parametrized by sequentially passing an input vector though alternating linear transformations and nonlinear function applications. The action of each layer on an input vector x is given by (Ax + b), where the matrix A and vector b are free parameters which will be tuned through the training process. The nonlinear "activation function" (·) is applied element-wise to the output  of the linear transformation. For problems with continuous output spaces such as parameterization, the final layer will have the identity as its activation function. We refer the interested reader to Goodfellow et al. (2016) for more details. Rasp et al. (2018) found that an NN needed at least nine hidden layers to be stable when coupled to a dynamical core. Unfortunately, we could not replicate this result. Increasing the depth of the our NNs had no impact on stability. However, since we are training with global data, we use a slightly deeper network than the single layer model Brenowitz and Bretherton (2018) used for the tropics alone. Each model trained in this paper has three densely connected hidden layers with 256 nodes each and rectified linear unit activations. Altogether, this architecture has 160,266 free parameters.
Prior to input, each input variable is normalized by subtracting the global mean and dividing by the standard deviation plus a small number (10 −7 ) for each level independently. The small number ensures that the denominator is nonzero. This level-by-level normalization is a departure from past studies (Brenowitz & Bretherton, 2018;Rasp et al., 2018) and avoids prejudicing the training procedure against input variables with small variance. That said, the choice of normalization has little impact upon our results. In particular, NNs trained with inputs normalized as Brenowitz and Bretherton (2018) also learn to depend strongly on the upper atmospheric humidity and suffer from coupled instabilities discussed above.
The networks weights and biases are optimized by stochastic gradient descent (SGD) with the popular Adam (Kingma & Ba, 2014) optimizer with an initial learning rate of 0.005. If a sample is defined as an individual atmospheric column for a given horizontal location and time step, then there are 5,242,880 samples in the training data set. These samples are randomly in horizontal space and formed into batches of 64 time series, each containing the full 80 days of data. Five epochs (i.e., passes through the entire training data set) of training are performed. In practice, the SGD converges quickly, but time and zonal mean bias of the predictions fluctuates from epoch to epoch. This occurs because mean state biases in predicted heating and moistening only account for a small proportion of the overall error because these are such noisy fields. To mitigate this bias, we select for further analysis the epoch which minimizes the global mean-square bias of the predicted net precipitation. Future work should attempt to improve the convergence of SGD. The NN training and linear regression are implemented in Python using PyTorch (Paszke et al., 2017). The NN models are then coupled to cSAM using a python interface. The predicted tendencies are disabled within three grid cells (480 km) of the poleward boundaries because the NN is very inaccurate there as measured by its R 2 coefficient compared to Q 1 and Q 2 (cf. Figure 4). We suspect this inaccuracy occurs because the flow in these regions is greatly influenced by the no-normal flow boundary conditions there. In particular, these boundary conditions could either introduce nonlocal dependencies that violate the horizontal locality assumptions of the parameterization or simply inject harmful noise into the training data.

Results
We now present our spatially extended simulations. As described in section 3, the simulations are initialized with the NG-Aqua data at Day 100.625, and our discussion focuses on the prediction accuracy in the first 10 forecast days. Two NN parametrizations will be compared with the training data and the base simulation with resolved-scale microphysics and radiation. The first NN simulation, labeled as "NN-All," is performed using the NN which includes all levels as input. The discussion in section 4.3 indicates that this setup leads to instability. The NN in the second simulation, known as "NN-Lower," only uses humidities below level 16 (375 mbar) and temperature below level 19 (226 mbar) as described in section 4.3.2. When evaluated directly on the NG-Aqua data, NN-Lower has an R 2 value of around 60% for Q 1 and Q 2 in the lower troposphere ( Figure 4). Interestingly, Figure 5 shows that the predicted heating and moistening for a single time point are less noisy than the "true" apparent heating Q 1 and apparent moistening Q 2 . Thus, the NN-Lower parametrization performs well in an instantaneous sense. Figure 6b demonstrates that the NN-Lower configuration is stable and produces a reasonable prediction of the precipitable water (PW) after 5.5 days, albeit with a visible loss in large-scale moisture variability in the tropics. On the other hand, NN-All shows a narrow line of moisture concentrated at the equator, coinciding with intense precipitation and ascent (not shown). Within this line are grid-scale artifacts devoid of humidity, and this simulation became unstable and crashed immediately after this snapshot. Likewise, the base simulation suffers from grid point storms in the tropics and frontal regions in the extratropics, and the simulation crashes (i.e., produces infinite vertical velocity) after 9 days with a 120-s time step. It is possible that using a smaller time step could stabilize this simulation, but 9 days is sufficient to evaluate the accuracy of its forecasts.  Figure 7 shows the root-mean-square error (RMSE) of s L , q T , and the vertical component of vorticity for each configuration. The RMSE values are vertically integrated and mass weighted and averaged over different meridional bands. The tropics (11.5 • S to 11.5 • N) lie within 1,280 km of the equator, the subtropics (11.5-23 • ) are the 1,280-km-wide bands lying north and south of the tropics, and the extratropics (23-46 • ) are the outermost 2,560 km of the domain. For all regions and variables, NN-Lower has the lowest RMSE. This improvement is most striking for the thermodynamic variables q T and s L , which have parameterized source terms. Compared to the base simulation, NN-Lower has 50% lower errors in both s L and q T . The vorticity error improves less, which is not surprising since we do not attempt to correctly represent the source of momentum (Q 3 ). However, in the extratropics, both NN-Lower and NN-All outperform the base simulation.

Short-Term Forecast Accuracy of the Coupled Simulations
Overall, NN-Lower produces the most accurate predictions.
The NN-Lower simulation also predicts the net precipitation rate more accurately than the base simulation does. The net precipitation rate is given by where P is precipitation and E is the evaporation (Yanai et al., 1973). Net precipitation is the closest analog to precipitation that our NNs can predict because they predict the combined effect of precipitation and evaporation.
The net precipitation is much less noisy than in NN-Lower than the base run. Figure 8 shows maps of net precipitation after 2 days of prediction (Day 102.625) for the NG-Aqua training data and the NN-Lower and base simulations. Overall, the NN-Lower simulation produces the correct extratropical pattern of net precipitation but fails to produce enough variability in the tropics. In NG-Aqua, the tropical precipitation occurs in small-scale clusters and in a larger-scale Kelvin wave centered around x = 15, 000 km at Day 102.625; elsewhere there is column moistening due to evaporation. On the other hand, the base simulation produces far too strong and noisy precipitation, as expected for a scheme which only rains when a large-scale grid cell becomes saturated.
Does the NN predict overly smooth net precipitation at the initial time or does this smoothness only appear as the coupled simulation evolves? The NN's "semiprognostic" net precipitation (i.e., predicted from the true coarse-grained NG-Aqua state at that time) is also shown in Figure 8b. Even in this instantaneous sense, the NN-Lower prefers to lightly rain throughout the tropics, rather than moisten in certain regions and strongly dry in others. This suggests that ML parametrizations are still susceptible to the same biases as traditional schemes. Figure 9 shows the pattern correlation compared to NG-Aqua of net precipitation for the NN-Lower and base simulations. NN-Lower has higher pattern correlations than the base simulation at all times and retains a correlation of 0.5 up to Day 101.5 in the tropics and beyond Day 102.5 in the extratropics and subtropics. By this measure, the base simulation has little predictive skill in the tropics.

Mean State Bias
Short-term predictions with NN-Lower are reasonably accurate, but the mean state drifts away from the true climate. While the short-term prediction results above are robust, the details of this drift are sensitive to the noise in the SGD process used to train the NN. For example, the biases of an NN saved after four and five epochs of training will differ. With that caveat, Figure 10 shows time evolution of the zonal mean PW and net precipitation for this particular NN-Lower simulation. The fields are shown for Days 100-120 for NG-Aqua and the NN-Lower simulation; the base simulation crashed in Day 110. The tropical net precipitation and PW in NN-Lower are reasonably similar to NG-Aqua, but the width of the rainy/moist region shrinks throughout the simulation. Also, there is much less negative evaporation (i.e., negative net precipitation) in the subtropical regions. On the other hand, the net precipitation in the base simulation becomes too concentrated at the equator and atmosphere dries out significantly. Overall, NN-Lower outperforms the base case. Figure 11 shows the zonal mean bias versus NG-Aqua for the zonal winds, meridional stream function, humidity q T , and temperature s L in the NN-Lower simulation, time-averaged over forecast Days 5-10. The humidity bias is relatively small with an amplitude of less than 2 g/kg. The bias of s L is also small and peaks near the meridional boundaries of the domain where the NN is not active. On the other hand, the meridional circulation substantially weakens from a peak mass transport of 80 × 10 12 to 35 × 10 12 kg/s. This is associated with the decreasing P − E in the tropics seen in Figure 10. The zonal mean zonal winds also dif- fer substantially. The surface westerlies and eddy-driven-jets become weaker, likely due to the Held-Suarez momentum damping. At the same time, the jets shift southward and superrotation develops near 250 mbar, likely as a transient response to the weakening Hadley circulation. Overall, the small biases in the thermodynamic variables q T and s L belie growing biases in the circulation and zonal wind. In addition, the biases in q T and s L fluctuate and are sensitive to the NN training, but the biases in circulation and zonal wind are quite robust.
Several aspects of our coarse-graining procedure could cause the mean state to drift, which Rasp et al. (2018) did not observe with their SP data. First, unlike in SP, NG-Aqua contains many subgrid-scale sources of momentum, which we neglected altogether. This seems consistent with the large biases in the zonal winds and the meridional stream function. Second, SP assumes a perfect separation between resolved and unresolved scales (Majda, 2007), which is only approximately true when coarse-graining GCRM data like NG-Aqua. For instance, it is unknown how precisely the apparent heating and moistening should approximate the coarse-grid average of the radiative and microphysical processes. Third, our training procedure depends implicitly on the resolution of the training data, but cSAM has both explicit dissipation (e.g., hyperdiffusion and damping) and implicit numerical dissipation which the training data lack. Thus, the effective resolution of the coupled simulations is less than that of the training data, so developing a scheme which adapts across an array of resolutions could ameliorate these biases. Finally, the biases could develop because the parameterization is deterministic rather than stochastic.  To elaborate on this final point, Figure 12 shows that the joint distribution of PW and net precipitation has much less variance in NN-Lower (panel b) than in NG-Aqua (panel a). Furthemore, the predicted net precipitation has too little variance even when evaluated on the true state (panel c). Such lack of variability could lead to mean state bias, because the expected precipitation depends exponentially on PW (Bretherton et al., 2004;Rushley et al., 2018), so a reduction in the zonal variance of PW (e.g., fewer extremely moist or dry coarse grid cells) will tend to decrease the zonal mean precipitation for a given zonal mean PW. Thus,  enhancing the variance of the conditional distributions of net precipitation could ameliorate the circulation biases in these simulations.

Conclusions
We use a NN to build a unified physics parametrization of diabatic heating and moistening, which include the effects of unresolved subgrid processes. The NN is trained by coarse-graining a near-global aquaplanet simulation with a grid size of 4 km to a resolution of 160 km. It learns the coarse-grid heating and moistening profiles, calculated as residuals from the coarse-grid dynamics, from the temperature and moisture profiles, plus auxiliary parameters. A key challenge of the coarse-graining approach is that it lacks the natural hierarchical structure of other training data sets used in the literature, such as SP (Rasp et al., 2018). In the SP application, for instance, the target model and the training model share the same software interface for the physical process tendencies. These tendencies are predicted by a high-resolution submodel in the SP training data set and must be learned by the NN parametrization for application to the target model. This application is thus a clear target for emulation.
We extend the SCM results of Brenowitz and Bretherton (2018) to a spatially extended simulation. The coarse-grid dynamical core is the same anelastic atmospheric model that generated the 4-km training data set but is run with a 160-km resolution. This model needed additional damping in the form of horizontal hyper-diffusion and Newtonian relaxation near the surface to run stably in a base configuration with its default resolved-scale microphysics and radiation schemes, or with our NN unified-physics parametrization. We developed a deeper version of our NN capable of simulating the diabatic heating and moistening over the entire 46 • S to 46 • N domain of our training simulations, rather than just the tropical subdomain used by Brenowitz and Bretherton (2018).
An attempt to couple a preliminary version of this NN to this GCM caused the model to blow up. Analyzing the linearized response of that NN showed that it inadvertently learns to exploit a strong correlation between upper atmospheric humidity and precipitation. This correlation is likely an artifact of the short lifetime of water vapor at those heights and the strong forcing from atmospheric convection below.
The spatially extended simulations were stabilized by removing the upper atmospheric humidity and temperature from the NN inputs. This configuration can run stably indefinitely, without blowing up. Thus, stabilizing the parametrization required a rather crude human intervention. Future studies will need to explore automatic ways to discover true causal relationships and forestall model blow-up in a dynamically coupled setting.
The resulting NN predicted the weather of NG-Aqua with much higher forecast skill and lower bias than a base coarse-grid simulation with only resolved microphysics and radiation parametrizations, which is the only reference case we could easily use as a metric. For a fair comparison, we will need to train and implement the ML parametrization in a global model whose coarse-grid version is typically run with a traditional suite of physical parametrizations.
More work is needed to quantify and reduce the long-term biases in these simulations. While temperature and humidity biases are small in the first 10 days, the Hadley circulation is dramatically weakened. To better