# Earth System Modeling 2.0: A Blueprint for Models That Learn From Observations and Targeted High-Resolution Simulations

## Abstract

Climate projections continue to be marred by large uncertainties, which originate in processes that need to be parameterized, such as clouds, convection, and ecosystems. But rapid progress is now within reach. New computational tools and methods from data assimilation and machine learning make it possible to integrate global observations and local high-resolution simulations in an Earth system model (ESM) that systematically learns from both and quantifies uncertainties. Here we propose a blueprint for such an ESM. We outline how parameterization schemes can learn from global observations and targeted high-resolution simulations, for example, of clouds and convection, through matching low-order statistics between ESMs, observations, and high-resolution simulations. We illustrate learning algorithms for ESMs with a simple dynamical system that shares characteristics of the climate system; and we discuss the opportunities the proposed framework presents and the challenges that remain to realize it.

## Key Points

- Earth system models (ESMs) and their parameterization schemes can be radically improved by data assimilation and machine learning
- ESMs can integrate and learn from global observations from space and from local high-resolution simulations
- Ensemble Kalman inversion and Markov chain Monte Carlo methods show promise as learning algorithms for ESMs

## 1 Introduction

Climate models are built around models of the atmosphere, which are based on the laws of thermodynamics and on Newton's laws of motion for air as a fluid. Since they were first developed in the 1960s (Kasahara & Washington, 1967, Manabe et al., 1965; Mintz, 1965; Smagorinsky, 1963, 1965), they have evolved from atmosphere-only models, via coupled atmosphere-ocean models with dynamic oceans, to Earth system models (ESMs) with dynamic cryospheres and biogeochemical cycles (Bretherton et al., 2012; Intergovernmental Panel on Climate Change, 2013). Atmosphere and ocean models compute approximate numerical solutions to the laws of fluid dynamics and thermodynamics on a computational grid. For the atmosphere, the computational grid currently consists of *O*(10^{7}) cells, spaced *O*(10 km)–*O*(100 km) apart in the horizontal; for the oceans, the grid consists of *O*(10^{8}) cells, spaced *O*(10 km) apart in the horizontal. But scales smaller than the mesh size of a climate model cannot be resolved yet are essential for its predictive capabilities. The unresolved scales are modeled by a variety of semiempirical parameterization schemes, which represent the dynamics on subgrid scales as parametric functions of the resolved dynamics on the computational grid (Stensrud, 2007). For example, the dynamical scales of stratocumulus clouds, the most common type of boundary layer clouds, are *O*(10 m) and smaller, which will remain unresolvable on the computational grid of global atmosphere models for the foreseeable future (Wood, 2012; Schneider et al., 2017). Similarly, the submesoscale dynamics of oceans that may be important for biological processes near the surface have length scales of *O*(100 m), which will also remain unresolvable for the foreseeable future (Fox-Kemper et al., 2014). Such smaller-scale dynamics in the atmosphere and oceans must be represented in climate models through parameterization schemes. Additionally, ESMs contain parameterization schemes for many processes for which the governing equations are not known or are only poorly known, for example, ecological or biogeochemical processes.

All of these parameterization schemes contain parameters that are uncertain, and the structure of the equations underlying them is uncertain itself. That is, there is parametric and structural uncertainty (Draper, 1995). For example, entrainment and detrainment rates are parameters or parametric functions of state variables such as the vertical velocity of updrafts. They control the interaction of convective clouds with their environment and affect cloud properties and climate. But how they depend on state variables is uncertain, as is the structure of the closure equations in which they appear (e.g., de Rooy et al., 2013; Holloway & Nee, 2009; Neelin et al., 2009; Nie & Kuang, 2012; Romps & Kuang, 2010; Stainforth et al., 2005). Or, as another example, the residence times of carbon in different reservoirs (e.g., soil, litter, and plants) control how rapidly and where in the biosphere carbon accumulates. They affect the climate response of the biosphere. But they are likewise uncertain, differing by *O*(1) factors among models (Bloom et al., 2016, Friend et al., 2014; Friedlingstein et al., 2006, 2014). Typically, parameterization schemes are developed and parameters in them are estimated independently of the model into which they are eventually incorporated. They are tested with observations from field studies at a relatively small number of locations. For processes such as boundary layer turbulence that are computable if sufficiently high resolution is available, parameterization schemes are increasingly also tested with data generated computationally in local process studies with high-resolution models (e.g., Jakob, 2003, 2010). After the parameterization schemes are developed and incorporated in a climate model or ESM, modelers adjust (“tune”) parameters to satisfy large-scale physical constraints, such as a closed energy balance at the top of the atmosphere (TOA), or selected observational constraints, such as reproduction of the twentieth century global-mean surface temperature record. This model tuning process currently relies on knowledge and intuition of the modelers about plausible ranges of the tunable parameters and about the effect of parameter changes on the simulated climate of a model (Flato et al., 2013, Golaz et al., 2013; Hourdin et al., 2013, 2017; Mauritsen et al., 2012; Randall & Wielicki, 1997). But because of the nonlinear and interacting multiscale nature of the climate system, the simulated climate can depend sensitively and in unexpected ways on settings of tunable parameters (e.g., Suzuki et al., 2013; Zhao et al., 2016). It also remains unclear to what extent the resulting parameter choice is optimal, or how uncertain it is. Moreover, typically only a minute fraction of the available observations is used in the tuning process, usually only highly aggregated data such as global or large-scale mean values accumulated over periods of years or more. In part, this may be done to avoid overfitting, but more importantly, it is done because the tuning process usually involves parameter adjustments by hand, which each must be evaluated by a forward integration of the model. This makes the tuning process tedious and precludes adjustments of a larger set of parameters to fit more complex observational data sets or a wider range of high-resolution process simulations. It also precludes quantification of uncertainties (Hourdin et al., 2017; Schirber et al., 2013).

Climate models have improved over the past decades, leading, for example, to better simulations of El Niño, storm tracks, and tropical waves (Flato et al., 2013; Guilyardi et al., 2009; Hung et al., 2013). Weather prediction models, the higher-resolution siblings of the climate models' atmospheric component, have undergone a parallel evolution. Along with data assimilation techniques for the initialization of weather forecasts, this has led to great strides in the accuracy of weather forecasts (Bauer et al., 2015). But the accuracy of climate projections has not improved as much, and unacceptably large uncertainties remain. For example, if one asks how high CO_{2} concentrations can rise before Earth's surface will have warmed 2°C above preindustrial temperatures—the warming target of the 2015 Paris Agreement, of which about 1°C remains because about 1°C has already been realized—the answers range from 480 to 600 ppm across current climate models (Schneider et al., 2017). A CO_{2} concentration of 480 ppm will be reached in the late 2030s or early 2040s; 600 ppm may not be reached before 2060 even if CO_{2} emissions continue to increase rapidly. Between these extremes lie vastly different optimal policy responses and socioeconomic costs of climate change (Hope, 2015).

These large and long-standing uncertainties in climate projections have their root in uncertainties in parameterization schemes. Parameterizations of clouds dominate the uncertainties in physical processes (Brient & Schneider, 2016, Bony et al., 2006; Cess et al., 1989, 1990; Soden & Held, 2006; Stephens, 2005; Vial et al., 2013; Webb et al., 2013). There are uncertainties both in the representation of the turbulent dynamics of clouds and in the representation of their microphysics, which control, for example, the distribution of droplet sizes in a cloud, the fraction of cloud condensate that precipitates out, and the phase partitioning of cloud condensate into liquid and ice (e.g., Bodas-Salcedo et al., 2014; Golaz et al., 2013; Jiang et al., 2012; Kay et al., 2016; Stainforth et al., 2005; Suzuki et al., 2013; Zhao et al., 2016). Additionally, there are numerous other parameterized processes that contribute to uncertainties in climate projections. For example, it is not precisely known what fraction of the CO_{2} that is emitted by human activities will remain in the atmosphere, and so it is uncertain which emission pathways will lead to a given atmospheric CO_{2} concentration target (Friedlingstein, 2015; Knutti et al., 2008; Meinshausen et al., 2009). Currently, only about half the emitted CO_{2} accumulates in the atmosphere. The other half is taken up by oceans and on land. It is unclear in particular what fraction of the emitted CO_{2} terrestrial ecosystems will take up in the future (Canadell et al., 2007, Friend et al., 2014; Friedlingstein et al., 2006, 2014; Knorr, 2009; Le Queéré et al., 2013; Todd-Brown et al., 2013). Reducing such uncertainties through the traditional approach to developing and improving parameterization schemes—attempting to develop one “correct” global parameterization scheme for each process in isolation, on the basis of observational or computational process studies that are usually focused on specific regions—has met only limited success (Jakob, 2003, 2010; Randall, 2013).

Here we propose a new approach to improving parameterization schemes. The new approach invests considerable computational effort up front to exploit global observations and targeted high-resolution simulations through the use of data assimilation and machine learning within physical, biological, and chemical process models. We first outline in broad terms how we envision ESMs to learn from global observations and targeted high-resolution simulations (section 2). Then we discuss in more concrete terms the framework underlying such learning ESMs (section 3). We illustrate the approach by learning parameters in a relatively simple dynamical system that mimics characteristics of the atmosphere and oceans (section 4). We conclude with an outlook of the opportunities the framework we outline presents and of the research program that needs to be pursued to realize it (section 5).

## 2 Learning From Observations and Targeted High-Resolution Simulations

### 2.1 Information Sources for Parameterization Schemes

*Global observations*. We live in the golden age of Earth observations from space (L'Ecuyer et al., 2015). A suite of satellites flying in the formation known as the A-train has been streaming coordinated measurements of the composition of the atmosphere and of physical variables in the Earth system. We have nearly simultaneous measurements of variables such as temperature, humidity, and cloud and sea ice cover, with global coverage for more than a decade (Jiang et al., 2012, Simmons et al., 2016; Stephens et al., 2002, 2017). Space-based measurements of biogeochemical tracers and processes, such as measurements of column average CO_{2}concentrations and of photosynthesis in terrestrial ecosystems, are also beginning to become available (e.g., Bloom et al., 2016; Crisp et al., 2004; Eldering et al., 2017; Frankenberg et al., 2011; Frankenberg et al., 2014; Joiner et al., 2011; Liu et al., 2017; Sun et al., 2017; Yokota et al., 2009), and so are more detailed observations of the cryosphere (e.g., Gardner et al., 2013; Shepherd et al., 2012; Vaughan et al., 2013). Parameterization schemes can learn from such space-based global data, which can be augmented and validated with more detailed local observations from the ground and from field studies.*Local high-resolution simulations*. Some processes parameterized in ESMs are in principle computable, only the globally achievable resolution precludes their explicit computation. For example, the turbulent dynamics (though currently not the microphysics) of clouds can be computed with high fidelity in limited domains in large-eddy simulations (LES) with grid spacings of*O*(10 m) (Khairoutdinov et al., 2009, Matheou & Chung, 2014; Pressel et al., 2015, 2017; Siebesma et al., 2003; Stevens et al., 2005; Schalkwijk et al., 2015). Increased computational performance has made LES domain widths of*O*(10 km)–O(100 km) feasible in recent years, while the horizontal mesh size in atmosphere models has shrunk, to the point that the two scales have converged. Thus, while global LES that reliably resolve low clouds such as cumulus or stratocumulus will not be feasible for decades, it is possible to nest LES in selected grid columns of atmosphere models and conduct high-fidelity local simulations of cloud dynamics in them (Schneider et al., 2017). Local high-resolution simulations of ocean mesoscale turbulence or sea ice dynamics can be conducted similarly. Parameterization schemes can learn from such nested high-resolution simulations.

Of course, both observations and high-resolution simulations have been exploited in the development of parameterization schemes for some time. For example, data assimilation techniques have been used to estimate parameters in parameterization schemes from observations. Parameters especially in cloud, convection, and precipitation parameterizations have been estimated by minimizing errors in short-term weather forecasts over time scales of hours or days (e.g., Aksoy et al., 2006; Emanuel & Živković Rothman, 1999; Grell & Dévényi, 2002; Ruiz et al., 2013, 2015; Schirber et al., 2013) or by minimizing deviations between simulated and observed longer-term aggregates of climate statistics, such as global-mean TOA radiative fluxes accumulated over seasons or years (e.g., Jackson et al., 2008; Jarvinen et al., 2010; Neelin et al., 2010; Solonen et al., 2012; Tett et al., 2013). High-resolution simulations have been used to provide detailed dynamical information such as vertical velocity and turbulence kinetic energy profiles in convective clouds, which are not easily available from observations. They have often been employed to augment observations from local field studies, and parameterization schemes have been fit to and evaluated with the observations and the high-resolution simulations used in tandem (e.g., de Rooy et al., 2013; Hohenegger & Bretherton, 2011; Liu et al., 2001; Siebesma et al., 2003, 2007; Stevens et al., 2005; Romps, 2016). High-resolution deep convection-resolving simulations with *O*(1 km) horizontal grid spacing and, most recently, LES with *O*(100 m) horizontal grid spacing have also been nested in small, usually two-dimensional subdomains of atmospheric grid columns, as a parameterization surrogate that explicitly resolves some aspects of cloud dynamics (e.g., Grabowski & Smolarkiewicz, 1999; Grabowski, 2001; Grabowski, 2016; Khairoutdinov & Randall, 2001; Khairoutdinov et al., 2005; Randall et al., 2003; Randall, 2013; Parishani et al., 2017). Such multiscale modeling approaches, often called superparameterization, have led to markedly improved simulations, for example, of the Asian monsoon, of tropical surface temperatures, and of precipitation and its diurnal cycle, albeit at great computational expense (e.g., Benedict & Randall, 2009; Pritchard & Somerville, 2009a, 2009b; Stan et al., 2010; DeMott et al., 2013). However, multiscale modeling relies on a scale separation between the global-model mesh size and the domain size of the nested high-resolution simulation (Weinan et al., 2007). Multiscale modeling is computationally advantageous relative to global high-resolution simulations as long as it suffices for the nested high-resolution simulation to subsample only a small fraction of the footprint of a global-model grid column, and to extrapolate the information so obtained to the entire footprint on the basis of statistical homogeneity assumptions. As the mesh size of global atmosphere models shrinks to horizontal scales of kilometers—resolutions that are already feasible in short integrations or limited areas and that will become routine in the next decade (Ban et al., 2015; Ohno et al., 2016; Palmer, 2014; Schneider et al., 2017)—the scale separation to the minimum necessary domain size of nested high-resolution simulations will disappear, and with it the computational advantage of multiscale modeling.

What we propose here combines elements of these existing approaches in a novel way. At its core are still parameterization schemes that are based on physical, biological, or chemical process models, whose mathematical structure is developed on the basis of theory, local observations, and, where possible, high-resolution simulations. But we propose that these parameterization schemes, when they are embedded in ESMs, learn directly from observations and high-resolution simulations that both sample the globe. High-resolution simulations are employed in a targeted way—akin to targeted or adaptive observations in weather forecasting (Bishop et al., 2001; Lorenz & Emanuel, 1998; Palmer et al., 1998)—to reduce uncertainties where observations are insufficient to obtain tight parameter estimates. Instead of incorporating high-resolution simulations globally in a small fraction of the footprint of each grid column like in multiscale modeling approaches, the ESM we envision deploys them locally, in entire grid columns, albeit only in a small subset of them. High-resolution simulations can be targeted to grid columns selected based on measures of uncertainty about model parameters. If the nested high-resolution simulations feed back onto the ESM, this corresponds to a locally extreme mesh refinement; however, two-way nesting may not always be necessary (e.g., Moeng et al., 2007; Zhu et al., 2010). The model learns parameters from observations and from nested high-resolution simulations in a computationally intensive learning phase, after which it can be used in a computationally more efficient manner, like models in use today. Nonetheless, even in simulations of climates beyond what has been observed, bursts of targeted high-resolution simulations can continue to be deployed to refine parameters and estimate their uncertainties.

### 2.2 Computable and Noncomputable Parameters

Learning from high-resolution simulations and observations is aimed at determining two different kinds of parameters in parameterization schemes: *computable* and *noncomputable* parameters. (Since parameters and parametric functions of state variables play essentially the same role in our discussion, we simply use the term parameter, with the understanding that this can include parametric functions and even nonparametric functions.) Computable parameters are those that can, in principle, be inferred from high-resolution simulations alone. They include parameters in radiative transfer schemes, which can be inferred from detailed line-by-line calculations; dynamical parameters in cloud turbulence parameterizations, such as entrainment rates, which can be inferred from LES; or parameters in ocean mixing parameterizations, which can be inferred from high-resolution simulations. Noncomputable parameters are parameters that, currently, cannot be inferred from high-resolution simulations, either because computational limitations make it necessary for them to also appear in parameterization schemes in high-resolution simulations, or because the microscopic equations governing the processes in question are unknown. They include parameters in cloud microphysics parameterizations, which are still necessary to include in LES, and many parameters characterizing ecological and biogeochemical processes, whose governing equations are unknown. Cloud microphysics parameters will increasingly become computable through direct numerical simulation (Devenish et al., 2012; Grabowski & Wang, 2013), but ecological and biogeochemical parameters will remain noncomputable for the foreseeable future. Both computable and noncomputable parameters can, in principle, be learned from observations; the only restrictions to their identifiability come from the well-posedness of the learning problem and its computational tractability. But only computable parameters can be learned from targeted high-resolution simulations. To be able to learn computable parameters, it is essential to represent noncomputable aspects of a parameterization scheme consistently in the high-resolution simulation and in the parameterization scheme that is to learn from the high-resolution simulation. For example, radiative transfer and microphysical processes need to be represented consistently in a high-resolution LES and in a parameterization scheme if the parameterization scheme is to learn computable dynamical parameters such as entrainment rates from the LES.

This approach presents challenges for parameter learning, since it implies the need to use observational data and high-resolution simulations in tandem to improve model parameterizations. But it also presents an opportunity: in doing so, the reliability and predictive power of ESMs can be improved, and uncertainties in parameters and predictions can be quantified.

### 2.3 Objectives: Bias Reduction and Exploitation of Emergent Constraints

Computational tractability is paramount for the success of any parameter learning algorithm for ESMs (e.g., Annan & Hargreaves, 2007; Jackson et al., 2008; Neelin et al., 2010; Solonen et al., 2012). The central issue is the number of times the objective function needs to be evaluated, and hence an ESM needs to be run, in the process of parameter learning. Standard parameter estimation and inverse problem approaches may require *O*(10^{5}) function or derivative evaluations to learn *O*(100) parameters, especially if uncertainty in the estimates is also required (Cotter et al., 2013). This many forward integrations and/or derivative evaluations of ESMs are not feasible if each involves accumulation of longer-term climate statistics. Fast parameterized processes in climate models often exhibit errors within a few hours or days of integration that are similar to errors in the mean state of the model (Klocke & Rodwell, 2014; Ma et al., 2013; Phillips et al., 2004; Rodwell & Palmer, 2007; Xie et al., 2012). This has given rise to hopes that it may suffice to evaluate objective functions by weather hindcasts over time scales of only hours, making many evaluations of an objective function feasible (Aksoy et al., 2006; Ruiz et al., 2013; Wan et al., 2014). But experience has shown that such short-term optimization may not always lead to the desired improvements in climate simulations (Schirber et al., 2013). Additionally, slower parameterized processes, for example, involving biogeochemical cycles or the cryosphere, require longer integration times to accumulate statistics entering any meaningful objective function. Therefore, we focus on objective functions involving climate statistics accumulated over windows that we anticipate to be wide compared with the *O*(10 days) time scale over which the atmosphere forgets its initial condition. Then the accumulated statistics do not depend sensitively on atmospheric initial conditions. This reduces the onus of correctly assimilating atmospheric initial conditions in parameter learning, which would be required if one were to match simulated and observed trajectories, as in approaches that assimilate model parameters jointly with the state of the system by augmenting state vectors with parameters (e.g., Aksoy et al., 2006; Anderson et al., 2009; Dee, 2005). The minimum window over which climate statistics will need to be accumulated will vary from process to process, generally being longer for slower processes (e.g., the cryosphere) than faster processes (e.g., the atmosphere). For slower processes whose initial condition is not forgotten over the accumulation window, it will remain necessary to correctly assimilate initial conditions.

The objective functions to be minimized in the learning phase can be chosen to directly minimize biases in climate simulations, for example, precipitation biases such as the longstanding double-ITCZ bias in the tropics (Adam et al., 2016, 2017; Lin, 2007; Li & Xie, 2014; Zhang et al., 2015), or cloud cover biases such as the “too few–too bright” bias in the subtropics (Karlsson et al., 2008; Nam et al., 2012; Webb et al., 2001; Zhang et al., 2005). Because the sensitivity with which an ESM responds to increases in greenhouse gas concentrations correlates with the spatial structure of some of these biases in the models (e.g., Tian, 2015; Siler et al., 2017), minimizing regional biases will likely reduce uncertainties in climate projections, in addition to leading to more reliable simulations of the present climate. To minimize biases, the objective function needs to include mean-field terms penalizing mismatch between spatially and at least seasonally resolved simulated and observed mean fields, for example, of precipitation, ecosystem primary productivity, and TOA radiative energy fluxes.

Additionally, there is a growing literature on “emergent constraints,” which typically are fluctuation-dissipation relationships that relate measurable fluctuations in the present climate to the response of the climate system to perturbations (Collins et al., 2012; Hall & Qu, 2006; Klein & Hall, 2015). For example, how strongly tropical low-cloud cover covaries with surface temperature from year to year or even seasonally in the present climate correlates in climate models with the amplitude of the cloud response to global warming (Qu et al., 2014, 2015; Brient & Schneider, 2016). Therefore, the observable low-cloud cover covariation with surface temperature in the present climate can be used to constrain the cloud response to global warming. Or as another example, how strongly atmospheric CO_{2} concentrations covary with surface temperature in the present climate correlates in climate models with the amplitude of the terrestrial ecosystem response to global warming (e.g., the balance between CO_{2} fertilization of plants and enhanced soil and plant respiration under warming) (Cox et al., 2013; Wenzel et al., 2014). Therefore, the observable CO_{2} concentration covariation with surface temperature can be used to constrain the terrestrial ecosystem response to global warming. Such emergent constraints are usually used post facto, in the evaluation of ESMs. They lead to inferences about the likelihood of a model given the measured natural variations, and they therefore can be used to assess how likely it is that its climate change projections are correct (e.g., Brient & Schneider, 2016). But emergent constraints usually are not used directly to improve models. In what we propose, they are used directly to learn parameters in ESMs and to reduce uncertainties in the climate response. To do so, covariance terms (e.g., between surface temperature and cloud cover or TOA radiative fluxes, or between surface temperature and CO_{2} concentrations) need to be included in the objective function.

The choice of objective functions to be employed is key to the success of what we propose. The use of time-averaged statistics such as mean-field and covariance terms will make the objective functions smoother and hence reduce the computational cost of minimization, compared with minimizing objective functions that directly penalize mismatch between simulated and observed trajectories of the Earth system. From the point of view of statistical theory, the objective functions should contain the sufficient statistics for the parameters of interest, but what these are is not usually known a priori. In practice, the choice of objective functions will be guided by expertise specific to the relevant subdomains of Earth system science, as well as computational cost. Given that current ESM components such as clouds and the carbon cycle exhibit large seasonal biases (e.g., Lin et al., 2014; Keppel-Aleks et al., 2012; Karlsson & Svensson, 2013), and their response to long-term warming in some respects resembles their response to seasonal variations (e.g., Brient & Schneider, 2016; Wenzel et al., 2016), accumulating seasonal statistics in the objective functions suggests itself as a starting point.

## 3 Machine Learning Framework for Earth System Models

### 3.1 Models and Data

**= (**

*θ*

*θ*_{c},

*θ*_{n}) denote the vector of model parameters to be learned, consisting of computable parameters

*θ*_{c}that can be learned from high-resolution simulations, and noncomputable parameters

*θ*_{n}that can only be learned from observations (for example, because high-resolution simulations themselves depend on

*θ*_{n}). The parameters

**appear in parameterization schemes in a model, which may be viewed as a map , parameterized by time**

*θ**t*, that takes the parameters

**to the state variables**

*θ***,**

*x***can include temperatures, humidity variables, and cloud, cryosphere, and biogeochemical variables, and the map may depend on initial conditions and time-evolving boundary or forcing conditions. The map typically represents a global ESM. The state variables**

*x***are linked to observables**

*x***through a map representing an observing system, so that**

*y*The observables ** y** might represent surface temperatures, CO

_{2}concentrations, or spectral radiances emanating from the TOA. The map in practice will be realized through an observing system simulator, which simulates how observables

**are impacted by a multitude of state variables**

*y***. The actual observations (e.g., space-based measurements) are denoted by , so is the mismatch between simulations and observations. Since**

*x***is parameterized by**

*y***, while is independent of**

*θ***, mismatches between**

*θ***and can be used to learn about**

*y***.**

*θ***of the ESM to simulated state variables ,**

*x*

*θ*_{n}and time

*t*, and it can involve the time history of the state variables

**up to time**

*x**t*. The vector contains statistics of high-resolution variables whose counterparts in the ESM are computed by parameterization schemes, such as the mean cloud cover or liquid water content in a grid box. The corresponding variables

**in the ESM are obtained by a time-dependent map that takes state variables**

*z***and parameters**

*x***to**

*θ***,**

*z*The map
typically represents a single grid column of the ESM with its parameterization schemes, taking as input ** x** from the ESM. It is structurally similar to
. Crucially, however,
generally depends on all parameters

**= (**

*θ*

*θ*_{c},

*θ*_{n}), while only depends on noncomputable parameters

*θ*_{n}. Thus, the mismatch can be used to learn about the computable parameters

*θ*_{c}.

The same framework also covers other ways of learning about parameterizations schemes from data. For example, the map may represent a single grid column of an ESM, driven by time-evolving boundary conditions from reanalysis data at selected sites. Observations at the sites can then be used to learn about the parameterization schemes in the column (Neggers et al., 2012). Or, similarly, the map may represent a local high-resolution simulation driven by reanalysis data, with parameterization schemes, for example, for cloud microphysics, about which one wants to learn from observations.

### 3.2 Objective Functions

Objective functions are defined through mismatch between the simulated data ** y** and observations
, on the one hand, and simulated data

**and high-resolution simulations , on the other hand. We define mismatches using time-averaged statistics, because they do not suffer from sensitivity to atmospheric initial conditions; indeed, matching trajectories directly requires assimilating atmospheric initial conditions, which would make it difficult to disentangle mismatches due to errors in climatically unimportant atmospheric initial conditions from those due to parameterization errors. However, the time averages can still depend on initial conditions for slowly evolving components of the Earth system, such as ocean circulations or ice sheets.**

*z**ϕ*(

*t*) over the time interval [

*t*

_{0},

*t*

_{0}+

*T*] by

*Σ*

_{y}. The function

**of the observables typically involves first- and second-order quantities, for example,**

*f**ϕ*, denotes the fluctuation of

*ϕ*about its mean 〈

*ϕ*〉

_{T}. With

**given by 8, the objective function penalizes mismatch between the vectors of mean values 〈**

*f***〉**

*y*_{T}and and between the covariance components and for some indices

*i*and

*j*. The least squares form of the objective function 6 follows from assuming an error model

*Σ*

_{y}encoding an assumed covariance structure of the noise vector

**. The relevant components of**

*η**Σ*

_{y}may be chosen very small for quantities that are used as constraints on the ESM (e.g., the requirement of a closed global energy balance at TOA).

*ϕ*〉

_{E}, and define an objective function analogously to that for the observations through

Like the function ** f** above, the function

**typically involves first- and second-order quantities, and the least squares form of the objective functions follows from the assumed covariance structure**

*g**Σ*

_{z}of the noise.

### 3.3 Learning Algorithms

Learning algorithms attempt to choose parameters ** θ** that minimize

*J*

_{o}and

*J*

_{s}. However, minimization of

*J*

_{o}and

*J*

_{s}does not always determine the parameters uniquely, for example, if there are strongly correlated parameters or if the number of parameters to be learned exceeds the number of available observational degrees of freedom. In such cases, regularization is necessary to choose a good solution for the parameters among the multitude of possible solutions. This may be achieved in various ways: by adding to the least-squares objective functions 6 and 10, regularizing penalty terms that incorporate prior knowledge about the parameters (Engl et al., 1996), by Bayesian probabilistic regularization (Kaipio & Somersalo, 2005), or by restriction of the parameters to a subset, as in ensemble Kalman inversion (Iglesias et al., 2013).

- Classical regularized least squares leads to an optimization problem that is typically tackled by gradient descent or Gauss-Newton methods, in which derivatives of the parameter-to-data map are employed (Nocedal & Wright, 2006). Such methods usually require
*O*(10^{2}) integrations of the forward model or evaluations of its derivatives with respect to parameters. - Bayesian inversions usually employ Markov chain Monte Carlo (MCMC) methods (Brooks et al., 2011) and variants such as sequential Monte Carlo (Del Moral et al., 2006) to approximate the posterior probability density function (PDF) of parameters, given data and a prior PDF. A PDF of parameters provides much more information than a point estimate and consequently MCMC methods typically require many more forward model integrations, sometimes on the order of
*O*(10^{5}). The computational demands can be decreased by an order of magnitude by judicious use of derivative information where available (see Beskos et al., 2017, and references therein) or by improved sampling strategies (e.g., Jackson et al., 2008; Jarvinen et al., 2010; Solonen et al., 2012). Nonetheless, the cost remains orders of magnitude higher than for optimization techniques. - Ensemble Kalman methods are easily parallelizable, derivative-free alternatives to the classical optimization and Bayesian approaches (Houtekamer & Zhang, 2016). Although theory for them is less well developed, empirical evidence demonstrates behavior similar to derivative-based algorithms in complex inversion problems, with a comparable number of forward model integrations (Iglesias, 2016). Ensemble methods for joint state and parameter estimation have recently been systematically developed (Bocquet & Sakov, 2013, 2014; Carrassi et al., 2017), and they are emerging as a promising way to solve inverse problems and to obtain qualitative estimates of uncertainty. However, numerical experiments have indicated that such uncertainty information is qualitative at best: the Kalman methods invoke Gaussian assumptions, which may not be justified, and even if the Gaussian approximation holds, the ensemble sizes needed for uncertainty quantification may not be practical (Iglesias et al., 2013; Law & Stuart, 2012).

An important consideration is how to blend the information about parameters contained in the high-resolution simulations and in the observations. One approach is as follows, although others may turn out to be preferable. Minimizing the high-resolution objective function *J*_{s} in principle gives the computable parameters *θ*_{c} as an implicit function of the noncomputable parameters *θ*_{n}. This implicit function may then be used as prior information to minimize the observational objective function *J*_{o} over ** θ**. Bayesian MCMC approaches may be feasible for fitting

*J*

_{s}, since the single-column model is relatively cheap to evaluate, and the ensemble of high-resolution simulations needed may not be large. Although Bayesian approaches may not be feasible for fitting

*J*

_{o}, for which accumulation of statistics of the model is required, this hierarchical approach does have the potential to incorporate detailed uncertainty estimates coming from the high-resolution simulations.

The choice of normalization (i.e., *Σ*_{y} and *Σ*_{z}) in the objective functions plays a significant role in parameter learning, and learning about it has been demonstrated to have considerable impact on data assimilation for weather forecasts (Dee, 1995; Stewart et al., 2014). We will not discuss this issue in any detail, but note it may be addressed by the use of hierarchical Bayesian methodology and ensemble Kalman analogs. Nor will we dwell on the important issue of structural uncertainty—model error—other than to note that this can, in principle, be addressed through the inverse problem approach advocated here: additional unknown parameters, placed judiciously within the model to account for model error, can be learned from data (Dee, 2005; Kennedy & O'Hagan, 2001). The choice of normalization is especially important in this context as it relates to disentangling learning about model error from learning about the other parameters of interest.

- Minimization of the objective functions
*J*_{o}and*J*_{s}may be performed by online filtering algorithms, akin to those used in the initialization of weather forecasts, which sequentially update parameters as information becomes available (Law et al., 2015). This can reduce the number of forward model integrations required for parameter estimation, and it can allow parameterization schemes to learn adaptively from high-resolution simulations during the course of a global simulation. - Where to employ targeted high-resolution simulations can be chosen to optimize aspects of the learning process. The simplest approach would be to deploy them randomly, for example, by selecting regions with a probability proportional to their climatological cloud fraction for high-resolution simulations of clouds. More efficient would be techniques of optimal experimental design (see Alexanderian et al., 2016, and references therein), within online filtering algorithms. With such techniques, high-resolution simulations could be generated to order, to update aspects of parameterization schemes that have the most influence on the global system with which they interact.

Progress along these lines will require innovation. For example, filtering algorithms need to be adapted to deal with strong serial correlations such as those that arise when averages
are accumulated over increasing spans *T*_{i}<*T*_{i + 1} and parameters are updated from one average
to a longer average
. And optimal experimental design techniques require the development of cheap computational methods to evaluate sensitivities of the ESM to individual aspects of parameterization schemes.

## 4 Illustration With Dynamical System

We envision ESMs eventually to learn parameters online, with targeted high-resolution simulations triggering parameter updates on the fly. Here we want to illustrate in off-line mode some of the opportunities and challenges of learning parameters in a relatively simple dynamical system. We use the Lorenz-96 model (Lorenz, 1996), which has nonlinearities resembling the advective nonlinearities of fluid dynamics and a multiscale coupling of slow and fast variables similar to what is seen in ESMs. The model has been used extensively in the development and testing of data assimilation methods (e.g., Anderson, 2001; Lorenz & Emanuel, 1998; Ott et al., 2004).

### 4.1 Lorenz-96 Model

*K*slow variables

*X*

_{k}( ), each of which is coupled to

*J*fast variables

*Y*

_{j,k}( ) (Lorenz, 1996):

*j*,

Both the slow and fast variables are taken to be periodic in *k* and *j*, forming a cyclic chain with *X*_{k + K}=*X*_{k}, *Y*_{j,k + K}=*Y*_{j,k}, and *Y*_{j + J,k}=*Y*_{j,k + 1}. The slow variables *X* may be viewed as resolved-scale variables and the fast variables *Y* as unresolved variables in an ESM. Each of the *K* slow variables *X*_{k} may represent a property such as surface air temperature in a cyclic chain of grid cells spanning a latitude circle. Each slow variable *X*_{k} affects the *J* fast variables *Y*_{j,k} in the grid cell, which might represent cloud-scale variables such as liquid water path in each of *J* cumulus clouds. In turn, the mean value of the fast variables over the cell,
, feeds back onto the slow variables *X*_{k}. The strength of the coupling between fast and slow variables is controlled by the parameter *h*, which represents an interaction coefficient, for example, an entrainment rate that couples cloud-scale variables to their large-scale environment. Time is nondimensionalized by the linear-damping time scale of the slow variables, which we nominally take to be 1 day, a typical thermal relaxation time of surface temperatures (Swanson & Pierrehumbert, 1997). The parameter *c* controls how rapidly the fast variables are damped relative to the slow; it may be interpreted as a microphysical parameter controlling relaxation of cloud variables, such as a precipitation efficiency. The parameter *F* controls the strength of the external large-scale forcing and *b* the amplitude of the nonlinear interactions among the fast variables. Following Lorenz (1996), albeit relabeling parameters, we choose *K* = 36, *J* = 10, *h* = 1, and *F* = *c* = *b* = 10, which ensures chaotic dynamics of the system.

The quadratic nonlinearities in this dynamical system resemble advective nonlinearities, for example, in the sense that they conserve the quadratic invariants (“energies”)
and
(Lorenz & Emanuel, 1998). The interaction between the slow and fast variables conserves the “total energy”
. Energies are damped by the linear terms; they are prevented from decaying to zero by the external forcing *F*. Eventually, the system approaches a statistically steady state in which driving by the external forcing *F* balances the linear damping.

*X*

_{k}are statistically identical, as are all fast variables

*Y*

_{j,k}, so we can use the generic symbols

*X*and

*Y*in statistics of the variables. Multiplication of 11 by

*X*

_{k}, using that all variables

*X*

_{k}are statistically identical, and averaging shows that, in the statistically steady state, second moments of the slow variables satisfy

*j*. That is, the interaction coefficient

*h*can be determined from estimates of the one-point statistics and . Its inverse is proportional to the regression coefficient of the fast variables onto the slow: . So the regression of the fast variables onto the slow can be viewed as providing an “emergent constraint” on the system, insofar as the interaction coefficient

*h*affects the response of the system to perturbations (e.g., in

*F*). Estimates of and provide an additional constraint 14 on the parameters

*F*and

*c*. Taking mean values of the dynamical equations 11 and 12 would provide further constraints on these parameters, as well as on

*b*, in terms of two-point statistics involving shifts in

*k*and

*j*, for example, covariances of

*X*

_{k}and

*X*

_{k − 1}.

In what follows, we demonstrate the performance of learning algorithms in a perfect-model setting, first focusing on one-point statistics to show how to learn about parameters in the full dynamical system from them. Subsequently, we use two-point statistics to learn about parameters in a single “grid column” of fast variables only.

### 4.2 Parameter Learning in Perfect-Model Setting

**= (**

*θ**F*,

*h*,

*c*,

*b*) set to . The role of “observations” in the perfect-model setting is played by data and generated by the dynamical system with parameters

**set to their “true” values . That is, the dynamical system 11 and 12 with parameters**

*θ***stands for the global model , the observing system map is the identity, and the data and generated by the dynamical system with parameters is a surrogate for observations. The parameters**

*θ***of the dynamical system are then learned by matching statistics 〈**

*θ**ϕ*〉

_{T}accumulated over

*T*= 100 days (with 1 day denoting the unit time of the system), using discrete sums in place of the time integral in the average 5 and minimizing the “observational” objective function

*K*= 36 indices

*k*, giving a vector of length 5

*K*= 180. The noise covariance matrix

*Σ*is chosen to be diagonal, with entries that are proportional to the sample variances of the moments contained in the vector

**,**

*f*Here var(*ϕ*) denotes the variance of *ϕ*, and *r* is an empirical parameter indicating the noise level. The variances var(*ϕ*) and the “true moments”
are estimated from a long (46,416 days) control simulation of the dynamical system with the true parameters
.

As an illustrative example, we use normal priors for (*θ*_{1},*θ*_{2},*θ*_{4}) = (*F*,*h*,*b*), with mean values (*μ*_{1},*μ*_{2},*μ*_{4}) = (10,0,5) and variances
. Enforcing positivity of *c*, we use a log-normal prior for *θ*_{3}=*c*, with a mean value *μ*_{3}=2 and variance
for
(i.e., a mean value of 7.4 for *c*). We take the parameters a priori to be uncorrelated, so that the prior covariance matrix is diagonal.

*p*

_{i}(

*θ*

_{i}) is the prior PDF of parameter

*θ*

_{i}. The figure shows the marginal potential energies obtained as one parameter at a time is varied and the objective function

*J*

_{o}(

**) is accumulated by forward integration, while the other parameters are held fixed at their true values. As the noise level**

*θ**r*increases, the contribution of the log-likelihood of the data ( ) is downweighted relative to the prior, the posterior modes shift toward the prior modes, and the posterior is smoothed. Here the objective function

*J*

_{o}(

**) for each parameter setting is accumulated over a long period (10**

*θ*^{4}days) to minimize sampling variability. However, even with this wide accumulation window, sampling variability remains in some parameter regimes and there noticeably affects

*J*

_{o}(

**). An example is the roughness around**

*θ**c*= 17, which appears to be caused by metastability on time scales longer than the accumulation window. The roughness could be smoothed by accumulating over periods that are yet longer, or by averaging over an ensemble of initial conditions, but analogous smoothing might be impractical for ESMs. Time-averaged ESM statistics may exhibit similarly rough dependencies on some parameters (e.g., Suzuki et al., 2013; Zhao et al., 2016), although the dependence on other parameters appears to be relatively smooth (e.g., Neelin et al., 2010), perhaps because ESM parameters targeted for tuning are chosen for the smooth dependence of the climate state on them. Roughness of the potential energy landscape can present challenges for learning algorithms, which may get stuck in local minima. Note also the bimodality in

*b*, which arises because the one-point statistics we fit cannot easily distinguish prograde wave modes of the system (which propagate toward increasing

*k*) from retrograde modes (cf. Lorenz & Emanuel, 1998).

#### 4.2.1 Bayesian Inversion

We use the random-walk Metropolis (RWM) MCMC algorithm (Brooks et al., 2011) for a full Bayesian inversion of parameters in the dynamical system 11 and 12, thereby sampling from the posterior PDF. To reduce burn-in (MCMC spin-up) time, we initialize the algorithm close to the true parameter values with the result of an ensemble Kalman inversion (see below). The RWM algorithm is then run over 2,200 iterations, the first 200 iterations are discarded as burn-in, and the posterior PDF is estimated by binning every other of the remaining 2,000 samples. The objective function for each sample is accumulated over *T* = 100 days, using the end state of the previous forward integration as initial condition for the next one, without discarding any spin-up after a parameter update.

The resulting marginal posterior PDFs do not all peak exactly at the true parameter values, but the true parameter values lie in a region that contains most of the posterior probability mass (Figure 1, middle row). The posterior PDFs indicate the uncertainties inherent in estimating the parameters. The posterior PDF of *c* has the largest spread, in terms of standard deviation normalized by mean, indicating relatively large uncertainty in this parameter. The uncertainty appears to arise from the roughness of the potential energy (Figure 1, top row), which reflects inherent sensitivity of the system response to parameter variability; additional roughness of the posterior PDFs may be caused by sampling variability from finite-time averages (Wang et al., 2014). For all four parameters, the posterior PDFs differ significantly from the priors, demonstrating the information content provided by the synthetic data. Finally, although these results have been obtained with *O*(10^{3}) forward model integrations and objective function evaluations, more objective function evaluations may be required for more complex forward models, such as ESMs.

#### 4.2.2 Ensemble Kalman Inversion

Ensemble Kalman inversion may be an attractive learning algorithm for ESMs when Bayesian inversion with MCMC is computationally too demanding. To illustrate its performance, we use the algorithm of Iglesias et al. (2013), initializing ensembles of size *M* with parameters drawn from the prior PDFs. In the analysis step of the Kalman inversion, we perturb the target data by addition of noise with zero mean and variance given by 18, that is, replacing
by
with
for each ensemble member *j*. As in the MCMC algorithm, the objective function for each parameter setting is accumulated over *T* = 100 days, without discarding any spin-up after each parameter update. As initial state for the integration of the ensemble, we use a state drawn from the statistically steady state of a simulation with the true parameters.

Table 1 summarizes the solutions obtained by this ensemble Kalman inversion after
iterations, for different ensemble sizes *M* and noise levels *r*. The ensemble mean of the Kalman inversion provides reasonable parameter estimates. But the ensemble standard deviation does not always provide quantitatively accurate uncertainty information. For example, for low noise levels, the true parameter values often lie more than 2 standard deviations away from the ensemble mean. The ensemble spread also differs quantitatively from the posterior spread in the MCMC simulations. In experiments in which we did not perturb the target data, the smaller ensembles (*M* = 10) occasionally collapsed, with each ensemble member giving the same point estimate of the parameters. In such cases, the ensemble contains no uncertainty information, illustrating potential pitfalls of using ensemble Kalman inversion for uncertainty quantification. However, with the perturbed data and for larger ensembles, the ensemble standard deviation is qualitatively consistent with the posterior PDF estimated by MCMC (Figure 1, middle row). It provides some uncertainty information, especially for higher noise levels, for example, in the sense that the parameter *c* is demonstrably the most uncertain (Table 1 and Figure 2b). Methods such as localization and variance inflation can help with issues related to ensemble collapse and can also be used to improve ensemble statistics more generally (see Law et al., 2015, and references therein). However, systematic principles for their application with the aim of correctly reproducing Bayesian posterior statistics have not been found, and so we have not adopted this approach.

*θ*= (

*F*,

*h*,

*c*,

*b*) Obtained by Ensemble Kalman Inversions for Different Ensemble Sizes

*M*and Different Noise Levels

*r*

Noise | Mean (M = 10) |
Mean (M = 100) |
Std (M = 100) |
---|---|---|---|

r = 0.1 |
(9.62, 0.579, 9.37, 2.63) | (9.71, 0.992, 8.70, 9.95) | (0.023, 0.001, 0.104, 0.022) |

r = 0.2 |
(9.57, 0.516, 7.90, 3.15) | (9.77, 0.994, 9.07, 10.04) | (0.107, 0.005, 0.524, 0.103) |

r = 0.5 |
(9.77, 0.522, 9.29, 5.31) | (9.63, 0.982, 8.34, 9.93) | (0.295, 0.017, 1.477, 0.350) |

r = 1.0 |
(9.70, 0.633, 7.68, 6.13) | (9.53, 0.952, 7.97, 9.37) | (0.385, 0.039, 1.964, 0.701) |

The ensemble Kalman inversion typically converges within a few iterations (Figure 2 indicates
5 iterations when *M* = 100). Larger ensembles lead to solutions closer to the truth (Figure 2a). Convergence within five iterations for ensembles of size 10 or 100 implies 50 or 500 objective function evaluations, representing substantial computational savings over the MCMC algorithm with 2,000 objective function evaluations. These computational savings come at the expense of detailed uncertainty information. Where the optimal trade-off lies between computational efficiency, on the one hand, and precision of parameter estimates and uncertainty quantification, on the other hand, remains to be investigated.

### 4.3 Parameter Learning From Fast Dynamics

**play the role of the single-column model , which generates data**

*θ***=**

*z**Y*. We choose

*k*= 1 and fix

*X*

_{1}=2.556, a value taken from the statistically steady state of the full dynamics. There are three parameters to learn from the fast dynamics: (

*θ*

_{2},

*θ*

_{3},

*θ*

_{4}) = (

*h*,

*c*,

*b*). The one-point statistics of the fast variables are not enough to recover all three. Therefore, we consider the moment function

*J*+

*J*(

*J*+ 1)/2 = 65. We minimize the “high-resolution” objective function

**, with a noise level analogous to the noise covariance matrix 18. The variances of the statistics are estimated from a long control integration of the fast dynamics with fixed**

*g**X*

_{1}=2.556. Because the fast variables

*Y*evolve more rapidly than the slow variables

*X*, we accumulate statistics over only days.

Bayesian inversion with RWM, with the same priors and algorithmic settings as before and with noise level
, again gives marginal posterior PDFs with modes close to the truth (Figure 1, bottom row). The posterior PDFs exhibit similar multimodality and reflect similar uncertainties and biases of posterior modes as those obtained from the full dynamics, especially with respect to the relatively large uncertainties in *c* (cf. Figure 1, middle row).

These examples illustrate the potential of learning about parameters from observations and from local high-resolution simulations under selected conditions (here for just one value of the slow variable *X*_{1}). An important question for future investigations is to what extent such results generalize to imperfect parameterization schemes, whose dynamics is usually not identical to the data-generating dynamics, so that structural in addition to parametric uncertainties arise. This issue can be studied for the Lorenz-96 system, for example, by using approximate models as parameterizations of the fast dynamics (e.g., Crommelin & Vanden-Eijnden, 2008; Fatkullin & Vanden-Eijnden, 2004; Wilks, 2005).

## 5 Outlook

Just as weather forecasts have made great strides over the past decades, thanks to improvements in the assimilation of observations (Bauer et al., 2015), climate projections can advance similarly by harnessing observations and modern computational capabilities more systematically. New methods from data assimilation, inverse problems, and machine learning make it possible to integrate observations and targeted high-resolution simulations in an ESM that learns from both and uses both to quantify uncertainties. As an objective of such parameter learning we propose the reduction of biases and exploitation of emergent constraints through the matching of mean values and covariance components between ESMs, observations, and targeted high-resolution simulations.

Coordinated space-based observations of crucial processes in the climate system are now available. For example, more than a decade's worth of coordinated observations of clouds, precipitation, temperature, and humidity with global coverage is available; parameterizations of clouds, convection, and turbulence can learn from them. Or simultaneous measurements of CO_{2} concentrations and photosynthesis are becoming available; parameterizations of terrestrial ecosystems can learn from them. So far, such observations have been primarily used to evaluate models and identify their deficiencies. Their potential to improve models has not yet been harnessed. Additionally, it is feasible to conduct faithful local high-resolution simulations of processes such as the dynamics of clouds or sea ice, which are in principle computable but are too costly to compute globally. Parameterizations can also learn from such high-resolution simulations, either online by nesting them in an ESM or off-line by creating libraries of high-resolution simulations representing different regions and climates to learn from. Such a systematic approach to learning parameterizations from data allows the quantification of uncertainties in parameterizations, which in turn can be used to produce ensembles of climate simulations to quantify the uncertainty in predictions.

The machine learning of parameterizations in our view should be informed by the governing equations of subgrid-scale processes whenever they are known. The governing equations can be systematically coarse-grained, for example, by modeling the joint PDF of the relevant variables as a mixture of Gaussian kernels and generating moment equations for the modeled PDF from the governing equations (cf. Firl & Randall, 2015; Golaz et al., 2002; Guo et al., 2015; Lappen & Randall, 2001a). The closure parameters that necessarily arise in any such coarse graining of nonlinear governing equations can then be learned from a broad range of observations and high-resolution simulations, as parametric or nonparametric functions of ESM state variables (cf. Parish & Duraisamy, 2016). The fineness of the coarse graining (measured by the number of Gaussian kernels in the above example) can adapt to the information available to learn closure parameters. Such equation-informed machine learning will provide a more versatile means of modeling subgrid-scale processes than the traditional approach of fixing closure parameters ad hoc or on the basis of a small sample of observations or high-resolution simulations. Because parameterizations learned within the structure of the known governing equations respect the relevant symmetries and conservation laws to within the closure approximations, they likely have greater out-of-sample predictive power than unstructured parameterization schemes, such as neural networks that are fit to subgrid-scale processes without explicit regard for symmetries and conservation laws (e.g., Krasnopolsky et al., 2013). Out-of-sample predictive power will be crucial if high-resolution simulations performed in selected locations and under selected conditions are to provide information globally and in changed climates. However, for noncomputable processes whose governing equations are unknown, like many ecological or biogeochemical processes, more empirical, data-driven parameterization approaches may well be called for.

- We need innovation in learning algorithms. Our relatively simple example showed that parameters in a perfect-model setting can be learned effectively and efficiently by ensemble Kalman inversion. It remains to investigate questions such as the optimal ensemble size in Kalman inversions, how to adapt inversion algorithms to imperfect models, and how to quantify uncertainties. To increase computational efficiency, online filtering algorithms need to be developed that update parameters on the fly as Earth system statistics are being accumulated.
- We need investigations of the best metrics to use when learning parameterization schemes from observations or high-resolution simulations. For example, are least-squares objective functions the best ones to use? Which covariance components or other statistics should be included in the objective functions? There are trade-offs between the number of covariance components that can be estimated from data and the information they can provide about parameterization schemes.
- We need innovation in how learning from observations should interact with learning from targeted high-resolution simulations. How should high-resolution simulations be targeted? Where is the optimum trade-off between the added computational cost of conducting high-resolution simulations and the marginal information about parameterization schemes they provide?
- We need innovation in parameterization schemes themselves, to design them such that they can learn effectively from diverse data sources and can be systematically refined when more information becomes available. It will be important to develop parameterizations that treat subgrid-scale motions (e.g., boundary layer turbulence, shallow convection, and deep convection) in a unified manner, to eliminate artificial spectral gaps that do not exist in nature and to reduce the number of correlated parameters in the schemes (e.g., Guo et al., 2015; Lappen & Randall, 2001a, 2001b; Köhler et al., 2011; Suselj et al., 2013; Park, 2014a, 2014b). Novel approaches that exploit ideas ranging from stochastic parameterization to systematic coarse-graining likely have roles to play here (e.g., Berner et al., 2017; Lucarini et al., 2014; Klein & Majda, 2006; Majda et al., 2003; Majda et al., 2008; Majda, 2012; Palmer & Williams, 2010; Palmer et al., 2005; Wouters et al., 2016; Wouters & Lucarini, 2013). Furthermore, as the resolution of ESMs increases, it will also be necessary to revisit the common practice of modeling subgrid-scale dynamics in grid columns, because the lateral exchange of subgrid-scale information across grid columns will play increasingly important roles.

The time is right to seize the opportunities that the available global observations and our computational resources present. Fundamentally reengineering atmospheric parameterization schemes, such as cloud and boundary layer parameterizations, will become a necessity as atmosphere models, within the next decade, reach horizontal grid spacings of 1–10 km and begin to resolve deep convection (Schneider et al., 2017). At such resolutions, common assumptions made in existing parameterization schemes, such as that clouds and the planetary boundary layer adjust instantaneously to changes in resolved-scale dynamics, will become untenable. Additionally, advances in high-performance computing (e.g., many-core computational architectures based on graphical processing units) will soon require a redesign of the software infrastructure of ESMs (Bretherton et al., 2012; Schulthess, 2015; Schalkwijk et al., 2015). So it is timely now to reengineer ESMs and parameterization schemes and design them from the outset so that they can learn systematically from observations and targeted high-resolution simulations.

Integrating observations and targeted high-resolution simulations in an Earth system modeling framework would have multiple attendant benefits. Solving the inverse problems of learning about parameterizations from observations requires observing system simulators that map model state variables to observables (Figure 3). The same observing system simulators, integrated in an Earth system modeling framework, can be used to answer questions about the value new observations would provide, for example, in terms of reduced uncertainties in ESMs. Addressing such questions in observing system simulation experiments (OSSEs) is increasingly required before the acquisition of new observing systems (e.g., as part of the U.S. Weather Research and Forecasting Innovation Act of 2017). They are naturally answered within the framework we propose.

## Acknowledgments

We gratefully acknowledge financial support by Charles Trimble, by the Office of Naval Research (grant N00014-17-1-2079), and by the President's and Director's Fund of Caltech and the Jet Propulsion Laboratory. We also thank V. Balaji, Michael Keller, Dan McCleese, and John Worden for helpful discussions and comments on drafts, and Momme Hell for preparing Figure 3. The program code used in this paper is available at climate-dynamics.org/publications/. Part of this research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.