Continuous Structural Parameterization: A Proposed Method for Representing Different Model Parameterizations Within One Structure Demonstrated for Atmospheric Convection

Continuous structural parameterization (CSP) is a proposed method for approximating different numerical model parameterizations of the same process as functions of the same grid‐scale variables. This allows systematic comparison of parameterizations with each other and observations or resolved simulations of the same process. Using the example of two convection schemes running in the Met Office Unified Model (UM), we show that a CSP is able to capture concisely the broad behavior of the two schemes, and differences between the parameterizations and resolved convection simulated by a high resolution simulation. When the original convection schemes are replaced with their CSP emulators within the UM, basic features of the original model climate and some features of climate change are reproduced, demonstrating that CSP can capture much of the important behavior of the schemes. Our results open the possibility that future work will estimate uncertainty in model projections of climate change from estimates of uncertainty in simulation of the relevant physical processes.


Introduction
Numerical models of weather and climate contain "parameterizations," which are physically motivated but approximate algorithms that represent processes that cannot be simulated explicitly on the model grid. One example is atmospheric convection, which could be represented by the same equations of fluid dynamics and thermodynamics used to simulate larger-scale atmospheric dynamics, but which typically occurs below the grid scale of contemporary climate models and some numerical weather prediction models. Another example is land surface vegetation, for which we do not even know the governing equations. The aim of parameterization is to relate the behavior of interest to resolved processes on the model grid. Parameterizations are derived semiempirically using insights from process understanding, observations, or high-resolution simulations that do capture the relevant processes explicitly but that would be too expensive to run inside a weather or climate model.
A body of literature suggests that parameterizations are the chief cause of differences between predictions of future climate change taken from different climate models (e.g., Geoffroy et al., 2017;Mauritsen et al., 2012;Sherwood et al., 2014;Webb et al., 2013). What is not known, however, is exactly how the parameterizations that we have are different from each other and whether the differences are representative of our uncertainty in the relevant processes. This poses a problem for climate prediction because it is unclear how to translate climate model output into probability distributions of possible future climate change. The difficulty arises partly because different parameterizations of the same process can have different physical bases, meaning that they may be written in terms of different equations and even different variables, and partly because it is not clear how best to write parameterizations in a way that is directly comparable to observations or resolved simulations of the same process.
Previous work has endeavored to address some of these problems. Perturbed physics ensembles (PPEs) are groups of general circulation model (GCM) simulations derived from one base climate model but with their uncertain parameterization parameters perturbed over the ranges of values considered possible by relevant experts (e.g., Murphy et al., 2004;Sanderson, 2011;Sexton et al., 2019). PPEs explore the uncertainty associated with one set of parameterizations systematically because the difference between different ensemble members is unambiguously defined by the differences in their parameters. However, the approach is trapped within one model structure and cannot fully explore the set of plausible parameterizations. Another set of parameterizations can be introduced into the ensemble (e.g., Shiogama et al., 2013), but the ability to define systematic differences is lost.
Meanwhile, the impulse-response method of Kuang (2010) and Herman and Kuang (2013) does allow systematic comparison of parameterizations in a way that is agnostic to their structure by testing the effect of idealized perturbations in the model resolved grid-scale variables on parameterization and then encapsulating those responses in a response matrix. As Arakawa (2004) and Herman and Kuang (2013) stated, one view is that the important question is "What does each scheme actually do [at the gridscale]?" The internal machinery of each parameterization is secondary. This is particularly true where different parameterizations have different physical motivations, because mechanistic comparison of the internal workings of each parameterization may not then be possible. Further, because the impulse-response method is written as a function of the resolved variables only, it is possible in principle to do the same analysis for high-resolution simulations or observations of the same process, as Herman and Kuang (2013) demonstrated for atmospheric convection.
The derived response matrices must also be put in the GCM in place of the original parameterization, as was done by Kelly et al. (2017) and Mapes et al. (2019) for the impulse-response method. This is necessary if we are to demonstrate that the matrix representation captures the essence of the parameterization relevant to modeling. We can then test the effects of multiple structurally distinct schemes using one parameterization code and define and explore the unknown parameter space between them in a GCM. If the matrix representation was sufficiently accurate, then the extent to which a particular parameterized process is responsible for intermodel differences when all other model components remain the same could be determined without the expensive overhead of having to port a range of structurally different parameterizations to one GCM. As with PPEs, the systematic differences between model versions would be known, and it would be possible to determine quantitatively how available parameterizations differ from one another and how well they sample the possible "structural" parameter space defined by the response matrices compared with observations or high-resolution simulations. If differences in GCM simulation of some aspect of climate change were strongly controlled by parameterization of one or more processes, then oversampling or undersampling of regions of the relevant parameter space could be taken into account when providing projections of future climate change. This would be an alternative to rewarding each GCM in the ensemble with one vote, as is frequently done in ensemble studies of climate change (e.g., Collins et al., 2013). The oversampling or undersampling of regions of the structural parameter space could also assist the direction of future model development.
A variety of studies have shown the potential for "machine learning" techniques to represent complex atmospheric processes and replace traditional parameterizations running within a GCM. Krasnopolsky (2010) used a neural network to replace the radiation parameterization within the Community Atmosphere Model (CAM). Errors were comparable with the GCM's natural internal variability for a fully coupled ocean-atmosphere simulation. The speed of simulation was also substantially accelerated compared with the case of using the original parameterization. O'Gorman and Dwyer (2018) used a random forest algorithm to parameterize convection in an idealized version of the Geophysical Fluid Dynamics Laboratory model coupled to a slab ocean and trained on data from a conventional convective parameterization. An accurate representation of both the climatological and climate change features of the original GCM containing the conventional parameterization was achieved. Rasp et al. (2018) used a neural network to represent all parameterized processes in the model atmosphere of the superparameterized CAM (SPCAM). Superparameterization means that there is a high-resolution simulation within each GCM gridbox and hence Rasp et al. (2018)'s neural network was effectively emulating a high-resolution explicit representation of sub-GCM grid-scale processes (although processes are not shared between GCM grid boxes). It was found that the neural network parameterization provided an accurate simulation of precipitation, atmospheric heating, and wave structure when compared to SPCAM and superior to the conventionally parameterized CAM. Brenowitz and Bretherton (2018) trained a neural network to emulate all subgrid-scale processes in a high-resolution simulation. It was found that using this neural network parameterization in the CAM led to a superior simulation when compared with the conventionally parameterized CAM. These studies suggest that applying machine learning techniques to parameterization will be useful for improving GCM accuracy and computational speed. Combined with impulse response or other statistical techniques, they can also be useful for understanding how to parameterize processes (O'Gorman & Dwyer, 2018), although direct interpretation of what complex neural networks or random forest techniques are doing remains difficult.
In this paper we describe continuous structural parameterization (CSP), which is a method for writing parameterizations of the same process at a given model resolution in terms of functions of the same grid-scale variables, making parameterizations with distinct structures formally comparable to one another, but retaining enough skill to replace the original parameterizations in a GCM. We base our discussion around a candidate CSP for atmospheric convection derived using linear algebra. In some ways the approach is similar to the forward method of Kuang (2010) and Herman and Kuang (2013). Where it differs is in the attempt to achieve efficient descriptions of parameterizations through a set of orthogonal modes most important to GCM simulation. This allows easy analysis of how parameterizations differ from one another or observations or high-resolution simulations of the same physical process. Orthogonality also allows fitting of our statistical model to output from standard GCM simulations. CSP has four broad goals: 1. Build a statistical emulator that expresses the grid-scale outputs of parameterizations as simple functions of their grid-scale inputs. 2. Provide low-dimensional descriptions of the most important differences between parameterizations and high-resolution simulations or observations using a diagram or other easily interpretable method. 3. Replace original parameterizations with CSP statistical emulators in the GCM to assess the degree to which relevant processes are captured. 4. Test the importance of errors introduced by a given parameterization type in ensembles of models used to predict climate change.
The overall aim is not to replace conventional parameterization nor to improve GCM integration speed but to understand our parameterizations in the context of process knowledge and provide tools for parameterization development and interpretation of climate model projections. Here we approach Goals 1-3 for convective parameterization using an example CSP, following the earlier work described above and recognizing that convection is believed to be one of the key processes causing model error in current GCMs (Sherwood et al., 2014;Webb et al., 2015). When our CSP emulators are run in a GCM in place of the original parameterizations, we find that basic features of climate and some features of climate change are preserved. Our results are less accurate than those achieved when machine learning techniques are applied, but we retain the ability to explain differences between parameterizations and a high-resolution data set. The remainder of the paper is organized as follows. Section 2 describes the GCM experiments that we use to build and test CSP, section 3 presents our statistical methodology, section 4presents our results for both parameterized and high-resolution representations of convection, section 5 is a discussion of the implications of our results, and section 6 presents our main conclusions.

Model Experiments
To train and test our statistical emulators, we take data from both coarse simulations run with parameterized convection and high-resolution convection permitting simulations.

UM Simulations
Our coarse simulations with parameterized convection are run using the Global Atmosphere 7.0 configuration of the Met Office Unified Model (UM) (Walters et al., 2019). The UM solves the fully compressible, deepatmosphere, nonhydrostatic Navier-Stokes equations using a semi-implicit, semi-Lagrangian approach. Parameterizations of atmospheric radiation, boundary layer turbulence, large-scale, and convective cloud and precipitation are included. The model resolution is 2.5°longitude by 2°latitude with 38 vertical levels and a time step of 20 min. Two convection schemes are used in our study: the well-established Gregory-Rowntree (GR) mass-flux scheme of Gregory and Rowntree (1990) with improvements described by Walters et al. (2019) and the Lambert-Lewis (LLCS) simple moist adjustment scheme of the authors' devising described in Appendix A. The statistical emulation of these two schemes and their differences is the basis for our demonstration of CSP. The model atmosphere is coupled to a 2.5 m deep "slab" ocean with thermodynamics but no representation of ocean dynamics (Boutle et al., 2017). The model is free to find its own equilibrium state by bringing top of atmosphere radiative fluxes into balance. This choice provides a stronger test of the accuracy and stability of our emulators than holding surface temperatures constant. It also allows us to test the climate change response when an emulator is used.
A number of simplifications to the simulations were made to ease the process of coding the statistical emulators and to simplify the behavior that needs to be predicted. The UM was run in aquaplanet mode with no continents or sea ice. Planetary obliquity was set to zero, meaning that seasons are only represented through the effect of Earth's orbital eccentricity of 0.0167. The sophisticated prognostic cloud scheme (PC2) and the radiative effect of clouds were switched off to simplify the relationship between GR and grid-scale moisture.
To compensate for the absence of cloud albedo, we increased the surface albedo to 0.3. (Surface albedo is 0.07 for ice-free ocean surfaces in the UM when radiatively active clouds are present.) The UM's targeted diffusion parameterization was switched on, as it was found that very occasional grid point storms occurred when running the LLCS CSP emulator. (Grid point storms are large values of grid-scale precipitation and upward vertical velocity that occur when physically unrealistic resolved convection arises.) Targeted diffusion disperses boundary layer water vapor to adjacent grid boxes when grid-scale vertical velocity crosses a threshold (0.2 m s −1 in our simulations).
We run 10 year control (0.5941 g kg −1 atmospheric CO 2 ) and 4 × CO 2 (2.3764 g kg −1 atmospheric CO 2 ) simulations for LLCS and GR, and the cases where the original convection schemes are replaced by their CSP statistical emulators between latitudes 30°N and 30°S, and no convection scheme is used poleward of 30°( GREMU and LLCSEMU). (It would be preferable to run the original parameterizations poleward of 30°, but this is technically difficult for GR. A test with LLCS shows that similar results are found for the original parameterization and no parameterization poleward of 30°cases, as shown in Figure S2 in the supporting information.) For the original parameterization GR and LLCS cases, we also run one 30 day simulation for January and one 30 day simulation for July for which values of potential temperature, θ, and specific humidity, q, are output on every model level at every time step directly before and after convection between 30°N and 30°S, allowing us to collect cases that we will use to train the statistical emulator in section 3. These simulations are spun off from 1 January and 1 July of Year 5 of the relevant 10 year simulation. We also run control and 4 × CO 2 cases for two perturbed physics setups with the original LLCS parameterization in which the value of the critical relative humidity for initiation of moist convection, r c , is perturbed from its standard value of 0.8 to 0.7 and 0.9. All the simulations are summarized in Table 1. Comparisons of the GR and LLCS climatology to more complete GCM simulations including fully coupled atmosphere-ocean GCMs are shown in Text S1 (Danabasoglu, 2019;Danabasoglu et al., 2019Danabasoglu et al., , 2020Ridley et al., 2018;Webb et al., 2017;Williams et al., 2018).

Cascade High-Resolution Simulations
We use data derived from the 4 km convection permitting simulations of the Cascade experiment (Holloway et al., 2012). As above, the Cascade simulations are run with the UM, but the 4 km resolution allows the convective parameterization to be switched off and the explicit dynamics of the model dynamical core are used to represent convection. The expectation is that a much more faithful simulation of convection should be achieved than when a parameterization is used making Cascade a good tool to benchmark parameterizations against (e.g., Guichard et al., 2004). Christensen et al. (2018b) produced a coarse-grained version of the 4 km Cascade data to provide forcing data for the European Centre for Medium-Range Weather Forecasting (ECMWF) Integrated Forecasting System (IFS) single-column model (SCM). The SCM was then run forced by the coarse-grained Cascade input data. This is very useful for our study because both the coarse-grained overall tendency of the Cascade data and the dynamical and parameterized tendencies of the IFS SCM were archived by Christensen et al. (2018a), allowing us to construct an emulator of a high-resolution simulation of convection.
We take the coarse-grained overall tendency of the Cascade data from the last 9 days of the simulation (avoiding the spin-up) over a region of the Indian Ocean (54-90°E longitude and 21°S to 4.5°N latitude) and subtract the radiative, boundary layer, and coarse dynamical tendencies of the SCM obtaining an estimate of the remaining dynamical processes that ought to be represented by a convection scheme. The estimate is not likely to be highly accurate since Cascade was created using the Met Office UM and the SCM is an ECMWF product. Cascade data are also only archived once per hour, in contrast to the 15 min time steps of the SCM, meaning that the SCM may drift substantially from the Cascade state as it is reinitialized only once every four time steps. We further average the data in the horizontal from its Christensen et al.
(2018b) resolution of 0.3°× 0.3°to as a close as possible to the UM grid of 2.5°longitude by 2°latitude without horizontal interpolation but interpolated in the vertical to the UM grid. Differences across each 15 min time step are rescaled by multiplying by 4/3 to allow comparison with UM data, which are on 20 min time steps. Given their limitations, we analyze these data as a demonstration rather than a definitive investigation of treating high-resolution simulation of convection with CSP.

Linear Models
In this section the statistical techniques we use to build convection emulators using training data are presented. First, we take n vertical columns of potential temperature, θ, and specific humidity, q, on m model levels and their respective changes due to convection across the time step, Δθ and Δq, from a GCM or the high-resolution simulation. There are other variables that are typically inputted into and outputted from convective parameterizations, but θ and q are the most important and the data we use for this first study. θ and q are converted to components of moist enthalpy, c p θ and Lq, where c p is the specific heat capacity of dry air at constant pressure and L is the latent heat of vaporisation. Placing the c p θ and Lq values for each of the m model levels for one horizontal grid point at one time step into a single vector, we form a "case." Combining the n input cases into a n × 2m matrix and subtracting the mean of c p θ and Lq on each vertical level yields where c p θ i and Lq i are the means of c p θ and Lq on the ith level, respectively. The dry and moist components of enthalpy are then of similar sizes, putting dry and moist components on the same footing for statistical modeling. Similar benefits can be achieved by normalizing each θ and q component by its mean and variance, but using enthalpy units has the convenient property that the sum of c p Δθ and LΔq over levels within a column is close to zero as enthalpy is conserved by convection. (It is not exactly zero because we neglect the latent heat of fusion associated with freezing, which is simulated by GR but not LLCS.) We find the matrix of eigenvectors, U (2m × 2m), and their corresponding weights, P (n × 2m), so that X ¼ PU, by taking the singular value decomposition of the covariance matrix X T X. Similarly, columns of c p Δθ and LΔq have their means on each level c p Δθ and LΔq removed and are combined to form the output matrix, Y (n × 2m), which is written in terms of its eigenvectors, V (2m × 2m), and their corresponding weights, Q (n × 2m), such that Y ¼ QV, by taking the singular value decomposition of Y T Y. The aim then is to predict unknown values of output Q and hence Y from known values of the inputs P. We predict Q from P rather than predicting Y from X because correlations between values of θ and q on different vertical levels that could cause large errors in our statistical analysis are avoided (e.g., see discussion of multiple regression in Hastie et al. (2008), Chapter 3). A two-step linear statistical emulator is used that first predicts whether convection is occurring and then, when convection is predicted to occur, predicts Δθ and Δq on model levels. The two-step choice is helpful because convection is a rare event even in the tropical atmosphere. It is difficult to represent large numbers of cases of no or little convection, and small numbers of cases of large convection simultaneously using a linear model. Two similar steps are also used by many convection schemes, including the ones analyzed in this paper.
Whether or not convection occurs is predicted using logistic regression. For the ith case in P, an estimate of the probability that convection will occur is where β (2m component vector) are coefficients to be determined, one for each input eigenvector. Nominally, convection is expected when C i > 0.5, but experience with data can lead us to shift the decision boundary in practice. If it is predicted that convection is occurring, then Q and hence Δθ and Δq on levels are predicted from a linear model: where γ (2m × 2m) are the coefficients to be determined for each output eigenvector in terms of each input vector and ϵ is the error. Both β and γ are estimated using ridge regression, which is a constrained variant 10.1029/2020MS002085

Journal of Advances in Modeling Earth Systems
of ordinary least squares regression that penalizes large components of β and γ via a tunable coefficient λ.
To estimate β, we maximize The best estimate of γ is Providing λ is positive and nonzero, the analysis is not very sensitive to its precise value. We use λ ¼ 10 for logistic regression estimates of convective triggering and λ ¼ 2 for linear regression estimates of convective strength throughout. Ridge and related techniques such as the Bayesian lasso are powerful tools for constraining regression parameters when correlations between components of the input weights permit a large range of coefficients. That should not be an issue here because the singular value decomposition almost eliminates correlations in the training data. However, experience with convecting data shows that there is still the possibility that chance correlations between small features in the input data and significant features in the output data can lead to very large coefficients and large prediction errors when the statistical model is used to predict outputs for an unseen input data set. The ridge models used avoid these problems because large coefficients are suppressed. More details of the logistic and ridge regression methods are given by, for example, Hastie et al. (2008).
Finally, the output matrix is estimated via Ŷ ≃ PγV. The means c p Δθ and LΔq are added to Ŷ, yielding estimates of Δθ and Δq. Hence, the information in the training data is encoded into the eigenvectors U and V, the coefficients β and γ, and the means c p θ, Lq, c p Δθ, and LΔq. The training data can be discarded and the statistical models tested against unseen data to assess their accuracy.

Model Training and Truncation
The statistical emulators for the LLCS and GR parameterizations are trained for the tropics (30°N to 30°S) using output from the 30 day January and July simulations described in section 2.1. These simulations output several million cases each, of which around 2-3% show appreciable convection. We calculate c p Δθ trop , which is defined as the mean atmospheric heating between 700 and 100 hPa for each case and then choose for training the cases closest to 30,000 equally spaced values of c p Δθ trop from its minimum to its maximum value. This attempts to build an emulator that is equally competent at representing the full range of convective events rather than the most common ones. The largest events are rare, and therefore, each is typically represented on multiple occasions in the training data. A further 30,000 nonconvecting cases (defined as c p Δθ trop <0.05 MJ m −2 , equivalent to a temperature change ΔT trop ≃ 10.1 K day −1 , although results are insensitive to the precise choice) are chosen at random. For each convection scheme, we compose control emulators, which take half of their input from control January and half from control July. For standard LLCS and GR, we also compose combined control −4 × CO 2 emulators, which take a quarter of their input from each of control January, control July, 4 × CO 2 January, and 4 × CO 2 July. The combined emulators show only minor differences with their control counterparts but are useful for running emulators online in the GCM and testing the behavior of emulated convection under climate change. The choice of 60,000 cases was made, as it is realistic to perform analysis on matrices of this size with available computing resources. It is found that using smaller sample sizes has little effect on accuracy (section 4.1).
The input and output eigenvectors U and V are calculated via singular value decomposition from the 30,000 equally spaced samples and their weights P and Q calculated for all 60,000 equally spaced and nonconvecting cases for each convection scheme. The γ coefficients in Equation 2 are estimated from the 30,000 equally spaced cases only. Cases from the equally spaced group deemed nonconvecting (c p Δθ trop < 0.05 MJ m −2 ) are then discarded, as are an equal number of nonconvecting cases. This leaves an equal number of convecting and nonconvecting cases from which we estimate the β coefficients in Equation 1. For our simulations around 85 % of cases are retained, see Table S1. (Optimally, a different set of input eigenvectors that also consider the nonconvecting sample would be calculated to remove correlations between components of P when considering nonconvecting data. However, in practice, the very slight benefit of doing this is outweighed by 10.1029/2020MS002085

Journal of Advances in Modeling Earth Systems
the tractability of using one set of eigenvectors.) Both β and γ are fitted using the scikit-learn python package; see Acknowledgments. The fidelity of the emulator is tested using a data set independent from the training data.
There is then the option of truncating the matrices to improve interpretability. The eigenvectors are ordered by the proportion of variance that they represent in the training data, each representing the largest remaining fraction of variance possible after variance associated with the previous eigenvectors has been removed. In our aquaplanet simulations, it is found that the vast majority of output convective behavior can be described with relatively few eigenvectors. We typically retain two or three for discussion in the results sections and use 10 when the emulators are run online as part of the GCM. Truncating the input space, on the other hand, is difficult to do because convection is a rare event and successfully predicting its occurrence and strength relies on retaining small signals in the input data. Hence, instead of truncating, we rotate β and γ back into θ,q on levels by forming the 2m component vector β θ;q ¼ U T β and the 2m × 2m matrix γ θ;q ¼ U T γ to interpret our results. β θ,q is the sensitivity of convective triggering to departures of θ,q from their mean values on each level. The columns of γ θ,q are the sensitivity of each output mode V to departures of θ,q from their mean values, given that convection is occurring. Having identified β θ,q and γ θ,q , a new low-dimensional input space that does preserve the input signals necessary to describe convection can be built. We demonstrate this in section 4.2 and show its use for comparing multiple convection schemes. At 15%, this is much more frequent than the 2-3% of cases seen to convect in the GCMs. Nevertheless, the small amount of data available force a change in our experimental design. The emulator is trained on 2,000 cases and then evaluated for the entire data set including the training data. The training set contains the majority of deep convecting cases, so our assessment of our ability to emulate convection should be considered preliminary, as the test data set lacks substantial independence.

Emulation of LLCS, GR, and Cascade Control Data
This section presents results for statistical models fitted to the first 10 components of the output modes, V, for each representation of convection. The most important modes of response for the control CO 2 LLCS CON, GR CON, and Cascade runs are depicted in Figure 1. Shown are the mean convective responses when convection is occurring (defined as c p Δθ trop > 0.05 MJ m −2 , equivalent to a temperature change ΔT trop ≃ 10.1 K day −1 ) (Figures 1a-1c), and the first and second eigenvectors, V 1 and V 2 , that describe how convecting cases vary across the training data (Figures 1d-1i). Physically, the mean responses and V 1 are identified with deep convection. For LLCS, positive V 2 is associated with stronger convection and more heating higher in the troposphere. For GR and Cascade, positive V 2 is associated with shallow convection. The first two components of V account for 75% of the variance in the convecting training data for LLCS, 79% for GR, and 93% for Cascade. In units of enthalpy change, it is found that the combined sum over vertical levels of dry and moist components of V is near zero for LLCS and GR, meaning that enthalpy is almost conserved by the convection schemes as expected. Agreement is less good for Cascade, which is unsurprising, given that instances of convection are estimated rather than exactly known. Figures 2a-2c show the corresponding mean input associated with the mean convecting case for each model control run. Figures 2d-2i show the rotated γ θ,q,1 and γ θ,q,2 , which are the variations from the mean input necessary to achieve variations of size V 1 and V 2 from the mean output. Also shown are the range of responses for 1,000 subsamples of the training data where 10,000 cases are chosen at random without replacement and γ is recalculated (1,000 subsamples of 1,000 cases for Cascade). Evidently, our calculations are likely to be affected by sampling errors, especially near the surface and especially for Cascade. Results for β θ,q , which control convective triggering, are similar to γ θ,q,1 (in other words deep convection) in each case, so we omit them for brevity.
Taking Figures 1 and 2 together, we can identify clear differences between the convection that occurs in the different data sets. Analyzing the mean and deep convective components, V 1 , it is plain that LLCS consumes far too much boundary layer moisture (Figures 1a and 1d) compared with GR (Figures 1b and 1e). LLCS

Journal of Advances in Modeling Earth Systems
LAMBERT ET AL.
convection occurs when the atmosphere is cooler and drier than GR (Figures 2a-2c) and strengthens as the surface layer becomes wetter and warmer than those aloft (Figure 2d). This is in contrast to GR, where deep convection also relies on a warm atmospheric boundary layer (Figure 2e). It is more difficult to make similar arguments using the Cascade data set perhaps due to the small sample size and limitations of the input data (section 2.2). However, the convection realized is similar to GR if apparently weaker, although this may be due to the temporal and spatial averaging undertaken (Figures 1c, 1f, and 1i).
We now test the ability of our statistical models to reproduce convection simulated in independent data sets not used for fitting. (Due to the small amount of data available, results for Cascade include the training data, so these results should be treated as preliminary.) Summary statistics for C ¼ 0.6 are shown in Table 2

Journal of Advances in Modeling Earth Systems
is quite good, especially for GR. It is also the case that predicting both CON and 4 × CO 2 cases using one emulator does not substantially degrade performance for either LLCS or GR. At first sight, percentage results are particularly encouraging for nonconvecting cases. However, because nonconvecting cases are by far the majority of all cases, the number of cases that would be incorrectly predicted to convect is high. This could pose problems when using the emulator online in a GCM. The proportion of nonconvecting cases correctly predicted can be increased by increasing the value of C. However, this increases the number of convecting cases that are incorrectly predicted to be nonconvecting. Experience shows that C ¼ 0.6 provides a balance between the convecting and nonconvecting prediction errors that gives reasonable results when run online in the GCM. Still, it is necessary to further reduce the number of

Journal of Advances in Modeling Earth Systems
convecting cases online, which we do by allowing only a fraction of diagnosed cases to convect (section 4.3). Figure 3 shows the performance of the emulator in predicting ΔT trop . Values of R 2 for convecting cases are given in Table 2. Predictions for LLCS are most accurate, followed by GR. Predictions for Cascade are weaker and show poor R 2 . Using the subsampled estimates of γ has almost no effect on R 2 for CON and 4 × CO 2 fits. Subsampled LLCS fits show 0.65 < R 2 < 0.67; GR fits show 0.49 < R 2 < 0.50. The effects of statistical parameter choices including the impact of sampling error on distributions of actual and emulated convection are shown in Text S2.

Joint Analysis of LLCS, GR, and Cascade Control Data
This subsection presents LLCS, perturbed physics LLCS, GR, and Cascade emulators built in terms of a common set of input, U C , and output, V C , eigenvectors, allowing direct comparison of values of β and γ that determine convective response to a given input. We build our joint input and output spaces from the combined LLCS CON and GR CON training data. (We use control data because those are the only data available for the LLCS perturbed physics versions we will consider.) Common output eigenvectors, V C , and their corresponding weights, Q C , are derived from the singular value decomposition of 60,000 equally spaced cases taken from the relevant January and July training runs. This is sufficient to capture the dominant behavior of convection in a few modes, as with the individual decompositions in the previous subsection. Because large numbers of input modes are important to convection in each data set, we derive the common input modes in a slightly different way in order to obtain a small tractable set. First, common means on each Note. For LLCS and GR, results are given for emulators of the control (CON) simulations and for emulators of the combined control and 4 × CO 2 simulations. The "convecting" and "nonconvecting" columns are the percentages and number of cases correctly identified as convecting and not convecting, respectively, in the simulations. R 2 is the coefficient of determination for c p Δθ trop for all convecting cases (including those labeled as nonconvecting by the emulator).

Journal of Advances in Modeling Earth Systems
vertical level, c p θ i and Lq i , are calculated for and subtracted from all the input data so that the mean differences between the LLCS and GR inputs are preserved. We form Xγ θ,q,1 − 3 , where γ θ,q,1 − 3 are the first three regression coefficients linking anomalies in θ,q inputs X to anomalies in outputs Y. New θ,q input data sets containing only those data determined to be linked to convection are then written Xγ θ; q;1 − 3 γ T θ; q;1 − 3 . Finally, concatenating the 30,000 LLCS and 30,000 GR equally spaced training cases, we apply singular value decomposition one more time and arrive at combined input eigenvectors, U C . Figure 4 shows the most important first four components of U C and the first two components of V C . U C,1 reflects the warmer, moister atmosphere found when convection is occurring in GR compared with LLCS, which is apparent because a common mean is used for the inputs. U C,2 is a mode with a warm boundary layer and moist near surface, U C,3 is a very moist surface mode, and U C,4 is difficult to interpret physically but has strongly anticorrelated dry and moist components. The first output, V C,1 , is a deep convective mode similar to that seen in the individual GR decomposition; the second output, V C,2 , describes large near surface drying similar to that seen for deep convection in LLCS. In the training data, V C,1 accounts for 21% of the (a-d) The first four joint input eigenvectors for the LLCS CON and GR CON data sets. As in Figure 2, dry temperature components are red, and moist components are blue in units of K. (e, f) The first two joint output eigenvectors. As in Figure 1, the red lines are the effective convective heating rate, Q1, and the blue lines are the effective convective drying rate, Q2, in units of K day

Journal of Advances in Modeling Earth Systems
output variance of LLCS, 63% in GR, and 84% in Cascade; V C,2 accounts for 51% of the output variance of LLCS, 5% in GR, and 1% in Cascade.
Statistical models that describe V C in terms of U C are then composed for all control training data sets, including the LLCS r c ¼ 0.7 and r c ¼ 0.9 cases. First, we estimate values of β and γ for the individual data sets as before using their original input weights, P, to take advantage of their orthogonality, but using the common LLCS-GR outputs, Q C . β and γ are then rotated into the U C basis by taking β C ¼ U C U T β and γ C ¼ U C U T γ. Projecting the statistical models into the truncated rotated basis reduces their fidelity. The proportion of convecting and nonconvecting cases correctly predicted in an independent data set is altered to 26% and 53%, respectively, for LLCS, 84% and 87% for GR, and 39% and 70% for Cascade. R 2 is reduced to 0.62 for LLCS, 0.46 for GR, and 0.01 for Cascade. Hence, the rotated basis retains the ability to predict changes in convective strength in LLCS and GR presumably because these are the eigenvectors it is built from, but most other predictions are damaged, especially for triggering. Note, however, that the degradation depends on the truncation chosen. Using a larger set of eigenvectors would increase fidelity at the expense of tractability. The choice depends on the application.
Values of γ C that link the first two input and output eigenvectors are shown in Figures 5a and 5b. The effect on convection of the "warm, moist atmosphere" mode, U C,1 , per unit anomaly is weak, but its standard deviation across the training data is large, and so, it plays an important role in increasing the strength of convection in all simulations through V C,1 . We judge this through "importance," which we define as a given component of γ C multiplied by the standard deviation of the relevant component of U C . For all LLCS model variants, increasing U C,1 also reduces V C,2 , reducing boundary layer drying and enhancing drying aloft. Neither GR nor Cascade show this mode very strongly, and V C,2 is therefore not sensitive to the presence of U C,1 or U C,2 in their input data. Increasing the "warm boundary layer" mode, U C,2 , increases V C,1 in

Journal of Advances in Modeling Earth Systems
GR but reduces V C,1 in all LLCS versions. The Cascade data are largely insensitive to U C,2 . Components of U C beyond U C,2 have lower importance and contribute less to convection and intermodel difference. However, both LLCS and GR V C,1 respond positively to the "moist surface mode," U C,3 (not shown).
The LLCS perturbed physics versions, r c ¼ 0.7 and r c ¼ 0.9 show very similar sensitivities to standard LLCS, so we do not discuss them in detail. However, we note that zonal mean precipitation produced by LLCS r c ¼ 0.9 is more similar to GR than standard LLCS (Figure 7b). Changes in zonal mean precipitation under 4 × CO 2 warming are more like standard LLCS, however (Figure 7c). The model simulations make it clear that LLCS can be tuned to reproduce GR zonal mean precipitation satisfactorily. However, the rotated basis shows that the fundamental sensitivities of LLCS to input are little altered by changing r c , and it is therefore not necessarily a surprise that the climate change simulation is not improved. Figure 6 is a comparison of the predicted convective triggering probability, β, with the actual amount of convection realized for 60,000 cases from the independent data sets for LLCS CON and GR CON. Using the same method used to choose the original training data, we select 30,000 convecting cases that represent the range of mean tropospheric heating and 30,000 nonconvecting cases at random. (Using the entire data set swamps the parameter space with nonconvecting cases, even where the percentage error in predicting the occurrence of convection is small because the number of nonconvecting cases is so large; Table 2.) The two-dimensional slices show which parts of the U C parameter space defined by the corresponding weights P C are expected to experience convection. The black idealized contours are values of β when varying components of P C within the relevant plane but holding others at mean values. The blue contours are for the independent data set when all components of P C are allowed to vary. The blue and black contours are not coincident because Figure 6. Convective triggering predictions compared with true simulated tropospheric warming as a function of P C for 60,000 cases (including 30,000 convecting) from the independent data sets. Planes in the P C parameter space for (top) P C,1,2 and (bottom) P C,3,4 for both (left) LLCS CON and (right) GR CON. Black contours are the predicted probability of triggering convection, C, when varying components of P C in the plane but holding others at mean values. C ¼ 0.6 is the threshold used for triggering convection in our UM simulations. The 0.9 contour is also shown to indicate which side of the 0.6 contour is expected to trigger. The blue 0.6 contours are predictions of β where all components of P C are allowed to vary. Red contours and accompanying shading are values of simulated ΔT trop in K day −1 . In our analysis the threshold for convection is ∼10.1 K day −1 .
For a perfect prediction of convective triggering, the blue contour would overlay the red contour.

10.1029/2020MS002085
components of P C are correlated, which occurs because the relevant singular value decomposition was done for the combined LLCS-GR training data set and not for LLCS or GR individually.
Results for the triggering and strength of convection are complementary within the U C,1,2 plane. P C,1 varies strongly across both the LLCS and GR data sets. More positive values of P C,1 are associated with more triggering of convection and stronger convection in GR. In LLCS stronger convection is associated with more positive P C,1 , but its effect on triggering is apparently small (black contours) but confounded by correlations with other components of P C in practice (blue contours). As with the strength of convection, the effect of P C,2 is markedly opposite for convective triggering in GR and LLCS: More positive values of P C,2 trigger convection in GR but suppress it in LLCS. GR responds positively to increases in both P C,3 and P C,4 . The response of LLCS is more confused. The central region of the P C,3,4 plane is convecting (red contours), but this is not expected purely from varying P C,3 and P C,4 (black contours). Correlations with other components are required. Overall, our joint analysis shows clear differences between the different convection schemes that can be understood in simple terms. Compared with GR, LLCS condenses too much boundary layer moisture, is relatively insensitive to an important mode of warm-moist free atmosphere variation, and has the wrong sign of response to boundary layer warming. This suggests pathways via which LLCS might be improved: adjustment of the scheme's ability to bring unsaturated parcels from the boundary layer into moist convection aloft could reduce boundary layer moisture consumption; a simple representation of entrainment could improve interaction with the free atmosphere. It is interesting to note that values of γ C for Cascade have some similarity with GR, but this must be treated with caution, given that the rotated model describes the Cascade data poorly. The results of this section are largely clear from the individual analyses of section 4.1. However, our example demonstrates that a low-dimensional parameter space can express key differences between a large number of representations of convection concisely. Together, sections 4.1 and 4.2 approach Goals 1 and 2 of CSP, described in section 1. In the next section, we approach Goal 3 by running the statistical emulators within the GCM to assess the extent to which they capture relevant physical processes.

UM Simulations With Emulated Convection
The 10 year control and 4 × CO 2 UM simulations with emulated convection were run for GR and LLCS r c ¼ 0.8 using the combined control-4 × CO 2 emulators (LLCSEMU CON and 4 × CO 2 , and GREMU CON and 4 × CO 2 ; Table 1). To reduce the bias caused by incorrectly diagnosing nonconvecting cases as convecting when running the emulators online, we reduce the number of convecting cases by allowing only a fraction of diagnosed cases to convect. First, the ratio, F ¼ Δθ trop =Δθ trop , of predicted mean tropospheric heating due to convection, c p Δθ trop , to actual mean tropospheric heating due to convection, c p Δθ trop , is calculated for the independent data set. Then, for each case diagnosed as convecting online, N crit ¼ 1 − 1 F is compared with a random number, N, drawn from a uniform distribution such that 0 ≤ N ≤ 1. If N > N crit , convection is allowed to trigger; if N ≤ N crit , then convection does not trigger. For LLCS, N crit ¼ 0.86; for GR, N crit ¼ 0.59, reflecting the larger error in diagnosing triggering for LLCS (Table 2).
All simulations have a stable equilibrium climate and reproduce broad features of the original parameterized runs (LLCS CON and 4 × CO 2 , and GR CON and 4 × CO 2 ) with reasonable fidelity. Figure 7a shows values of global and tropical mean precipitation and temperature in the original and emulator parameterization simulations. Control emulator simulations are biased with respect to the corresponding original simulations in the global mean by 1.6 K and 0.1 mm day −1 for LLCSEMU, and −0.1 K and −0.1 mm day −1 for GREMU. Tropical mean biases are 1.2 K and 0.4 mm day −1 for LLCSEMU and −0.04 K and 0.1 mm day −1 for GREMU. The 4 × CO 2 -control climate change is quite well simulated in LLCSEMU. Climate change is more disappointing for GREMU, particularly in the tropics, where precipitation increases at only 0.7% K −1 tropical mean temperature change compared with original GR values of 2.3% K −1 .
More detailed precipitation statistics are shown in Figures 7b-7e. Zonal mean precipitation in the LLCSEMU and GREMU control runs is quite reasonable and clearly captures the difference between LLCS and GR ( Figure 7b). The 4 × CO 2 -control zonal mean changes are fair for LLCSEMU, but disappointing for GREMU (Figure 7c). The sharp features in panels b and c seen at 30°N to 30°S in LLCSEMU and GREMU occur because the convection emulator is switched off poleward of 30°. Results for convective precipitation only are very similar (not shown). Figures 7d and 7e are histograms of grid box total daily precipitation for July in Year 5 of the simulations and 4 × CO 2 -control changes. LLCSEMU totals are satisfactory, while GREMU tends to predict too many heavy precipitating events and too few light precipitating events. The 4 × CO 2 -control changes show the correct sense of change for both LLCSEMU and GREMU: More lighter events tend to occur in LLCS, while heavier events increase at the expense of lighter events in GR. The emulated changes tend to be too weak for both LLCSEMU and GREMU, however, particularly for lighter events.
Overall, the online LLCSEMU and GREMU results are encouraging. The model is stable, and equilibrium climate is close to LLCS and GR, although LLCSEMU rains too much in the subtropics and GREMU has too many heavy precipitation days. Climate change simulations are reasonable, although changes in zonal 10.1029/2020MS002085

Journal of Advances in Modeling Earth Systems
mean precipitation in GR are disappointing and changes in daily precipitation totals are too weak in both models.

Discussion
Our analysis achieves each of Goals 1-3 set out for CSP in section 1 to at least some degree. We have demonstrated that statistical emulators of two GCM convection schemes and a high-resolution data set can have skill in predicting the onset and magnitude of atmospheric convection. The representation is quite approximate but could surely be improved. To form a CSP, a framework need only provide a structure that represents a group of parameterizations and the differences between them smoothly and unambiguously by providing as much orthogonality between modes as possible. A straightforward improvement to our CSP would be to introduce higher order and cross terms into the regression calculations using discrete orthogonal polynomials. We could also introduce more variables into the analysis, although we note that past work has found θ and q to be satisfactory for analyzing both model output and observed effects of convection (Johnson et al., 2016;Mapes et al., 2019;Yanai et al., 1973). Another framework entirely is evolutionary genetic programming, which uses Darwinian evolution to produce models from combinations of simple functions (e.g., Makkeasorn et al., 2008).
We also showed that a rotated, reduced input space allows us to describe the most important differences between different representations of convection more easily and might assist in future model development.
Care must be taken in the analysis as the reduced input and output spaces lose skill in predicting aspects of convection. In our demonstration, representation of triggering was particularly affected perhaps because we built an input space based on modes known to control the strength of convection. There is a balance between emulator skill and tractability that is set by the degree of truncation of our input and output spaces. We may compose as many representations as we like, each optimized for a different purpose. A key advantage of our approach over others is that it is possible in principle to define eigenvectors that allow estimation of the relationship between the most important inputs and outputs without contamination from linear correlations between variables. A good basis for many applications might be derived from observations or high-resolution simulations that explicitly resolve convection. We did not attempt this because the Cascade high-resolution data set was small and our ability to represent it was limited. The inaccuracy of our Cascade emulator may stem from a combination of having too few cases to fit to and the fact that different parameterization schemes were used in the original Cascade simulations and the SCM experiments. Alternatively, it may come from fundamental limitations of our technique. Defining the convection that should be parameterized and separating it cleanly from other processes is difficult in resolved simulations. It is even more challenging for observations as explored by Mapes et al. (2019) for the impulse-response method, although their analysis did yield useful conclusions regarding the sensitivity of observed versus resolved simulation of convection to q.
While one major goal for CSP is to develop metrics for model development, another is to develop emulators with sufficient fidelity that they can be run within a GCM. Success in this goal would mean that the emulators reproduce their targets well enough that we might explore the parameter space of possible parameterization schemes online within a GCM. Our GCM simulations that run the LLCS and GR emulators interactively show stable equilibrium climates with broadly similar characteristics to GCMs run with the original parameterizations. This is encouraging. Nevertheless, some aspects of the CSP emulator performance are disappointing, particularly for climate change where emulator simulations tend to respond too weakly. Performance is certainly weaker than that achieved with the random forest technique of O'Gorman and Dwyer (2018). Random forest or other machine learning representations of a range of convection schemes may themselves be analyzed with a linear model, but our complete emulators have an unambiguous relationship with each other and with the results they achieve when applied within a GCM. Hence, further work that improves our emulators would be useful.
Our emulators are deterministic-a given input always leads to the same output. However, a body of recent work suggests that performance can be improved in some cases through making parameterizations "stochastic" by adding noise to the parameterization output (e.g., Christensen, 2019;Lin & Neelin, 2003;Plant & Craig, 2008). It is trivial to introduce this extension to our statistical emulators by perturbing their parameters. When applied to a high-resolution model or observed data set, CSP is also well adapted to discovering the range of outputs that occur for a given set of coarse-grained inputs, potentially providing new routes to building stochastic parameterizations.
If a future study is to build emulators good enough to probe the effect of the range of possible convective parameterization on gross features of future climate change, then it needs to engage with clouds and cloud radiative effects. A move to a more realistic land and ocean configuration may not be necessary in the first instance, however, as it has been demonstrated that global mean temperature sensitivity to increased atmospheric CO 2 concentration in comprehensive land-ocean-atmosphere GCMs is well related to that in corresponding aquaplanet simulations (Ringer et al., 2014). Nevertheless, if the full range of regional climate responses to our uncertainty in physical processes is to be explored using CSP, then CSP also must be applied to other parameterizations in the atmosphere, ocean, and land surface.

Conclusion
Using the example of convection, we describe Continuous Structural Parameterization (CSP), which is a method for writing different representations of the same subgrid-scale process as functions of the same grid-scale variables. It is found that CSP can represent two convection schemes implemented within the Met Office Unified Model (UM) with reasonable fidelity. When emulated convection is implemented within the UM, the GCM produces a stable equilibrium climate with features broadly similar to the case where the original convection scheme is used.
Using our CSP, key differences between parameterization schemes can be expressed concisely within a new parameter space that is agnostic to model structure and offers the possibility of comparison with high-resolution models of convection or observations. Here, a CSP representation of a high-resolution data set taken from the Cascade experiment has some success, even though the data set is small and not optimally designed for our purposes. Further CSP development is necessary, and a large high-resolution data set designed specifically for emulation is needed to produce cleaner results. Nevertheless, our work suggests that CSP can assist parameterization development both by indicating realistic areas of the relevant parameter space and by providing parameterization prototypes directly. Our long-term goal is that CSP can assist ensemble prediction of climate change by highlighting how the set of model parameterizations we have relates to our true uncertainty in physical processes.

A2. Convective Adjustment
Once convection is triggered, a preliminary profile is established through convective adjustment. Where dry convection is triggered, θ k + 1 is adjusted so that the preliminary value of θ k + 1 , θ k + 1,p ¼ θ k . Dry convection continues upward, providing that the new value of θ k + 1,p satisfies θ k + 2 < θ k + 1,p . Moisture is mixed upward by setting q k + i,p ¼ q k , where i is the ith level above k.
If moist convection is triggered on level k, then levels above k involved in the convective event are adjusted to the moist pseudoadiabat: where r v is the mass-mixing ratio of water vapor, L is the latent heat of vaporisation of water vapor, R d is the gas constant for dry air, and η ≃ 0.622 is the ratio of the dry air and water vapor gas constants. Preliminary q is set to its saturation value, q k ¼ q sat,k , on each level that moist convection is occurring, including the bottom level unless q k + i,p > q sat,k + i,p . Similar to dry convection, moist convection continues upward if the new value of θ k + 1,p derived from the pseudoadiabat satisfies θ k + i + 1 < θ k + i,p .
If the dry or moist convective event terminates below the highest model level, then subsequent levels are tested to determine whether another event can trigger in the same vertical column. Note that LLCS does not consider the freezing level and assumes that all condensation and precipitation is liquid.

A3. Relaxation Time Scale and Conservation
Recognizing that evolution to a new stable profile is not instantaneous, the original input θ and q are relaxed toward the preliminary values, θ p and q p via where ξ p represents either θ p or q p , ξ r represents either θ r or q r , subscript r corresponds to values after the relaxation time scale has been applied, t step is the GCM time step (1,200 s in our experiments), and τ is a relaxation time scale, a free parameter of the scheme. The standard value used in our simulations is the "pure mixing time scale" of 3,600 s of Tompkins and Craig (1998).
Moisture and enthalpy are then conserved within each separate convective event in the column. First, moisture is adjusted so that the total mass of water vapor within each convective event, M q,r , is the same as in the input, M q , less the amount of water condensed to produce latent heating, M L , by adjusting specific humidity via where subscript f refers to final calculated values. M L is outputted by the scheme as precipitation at the surface, thus conserving the moist component of enthalpy. This is done on all convecting levels of a given event including dry convection below the level at which condensation first occurs. Hence, the scheme has the tendency to eliminate large amounts of boundary layer moisture, producing behavior that arguably should be simulated via the UM boundary layer scheme. This feature may be revised in future versions but is probably useful for suppressing the occurrence of grid point storms.
Dry enthalpy must be conserved to take account of heat added to the column during dry adjustment. As for moist enthalpy, this includes all levels of convective events that begin as dry adjustments that then trigger moist events above the bottom level. For each level, implied dry heating is written ΔQ d ¼ M k c p (T d,k − T k ), where M k is the total mass of the level, T k the initial temperature, and T d,k is the implied temperature change if latent heating is neglected (equal to the entire convective adjustment for events with no moist component).
The final temperature change ΔT f is calculated by subtracting ΔQ d from the relaxation value ΔT r uniformly per unit mass: Final output θ f is calculated via where p 0 ¼ 1,000 hPa and κ ¼ R d c p .

Data Availability Statement
The scikit-learn package is available online (from https://scikit-learn.org/). Both the Unified Model simulation data (https://doi.org/10.5281/zenodo.3836042) and the statistical model training data (https://doi.org/ 10.5281/zenodo.3837627) are available from Zenodo. The coarse-grained Cascade data are available from the Natural Environment Research Council Centre for Environmental Data Analysis (Christensen et al., 2018a).