Potential and Limitations of Machine Learning for Modeling Warm‐Rain Cloud Microphysical Processes

The use of machine learning based on neural networks for cloud microphysical parameterizations is investigated. As an example, we use the warm‐rain formation by collision‐coalescence, that is, the parameterization of autoconversion, accretion, and self‐collection of droplets in a two‐moment framework. Benchmark solutions of the kinetic collection equations are performed using a Monte Carlo superdroplet algorithm. The superdroplet method provides reliable but noisy estimates of the warm‐rain process rates. For each process rate, a neural network is trained using standard machine learning techniques. The resulting models make skillful predictions for the process rates when compared to the testing data. However, when solving the ordinary differential equations, the solutions are not as good as those of an established warm‐rain parameterization. This deficiency can be seen as a limitation of the machine learning methods that are applied, but at the same time, it points toward a fundamental ill‐posedness of the commonly used two‐moment warm‐rain schemes. More advanced machine learning methods that include a notion of time derivatives, therefore, have the potential to overcome these problems.


Introduction
During the last decade, machine learning (ML) has become an essential tool in data science, engineering, medical research, and many other applied sciences. The success of ML techniques can be explained by its general ability to extract (learn) complex nonlinear relationships from data. Hence, ML methods like random forests or deep neural networks (NNs) provide very powerful techniques for many kinds of classification and nonlinear regression problems. Besides, these methods have been greatly popularized by the availability of easy-to-use Python libraries like Tensorflow and Keras (Abadi et al., 2015;Chollet et al., 2015;Géron, 2019).
In atmospheric science ML methods are very appealing for nowcasting applications (Sønderby et al., 2020;Xingjian et al., 2015), remote-sensing retrievals (Lary et al., 2016;Maxwell et al., 2018), and many other applications that require a postprocessing of numerical weather prediction (NWP) or climate model output. This area of research is growing rapidly, and for a general overview we refer to recent reviews of ML techniques and their application in climate science (Huntingford et al., 2019;Reichstein et al., 2019) and fluid mechanics (Brunton et al., 2020). 10.1029/2020MS002301 relationships between resolved model variables and those unresolved quantities. This problem might be accessible to ML methods given that appropriate training data are available either from observations or from highly resolved benchmark simulations. Some efforts have been made to use ML to replace convection parameterizations in general circulations models (O'Gorman & Dwyer, 2018;Rasp et al., 2018) or even to describe all subgrid processes in the atmosphere (Brenowitz & Bretherton, 2018, 2019Yuval & O'Gorman, 2020). The results are promising, but not without challenges, which are especially related to the generalization of the ML predictions outside of the chosen training data. Further improvements can probably be expected if conservation principles and/or invariance and similarity laws can be built into ML methods Ling, Jones, et al., 2016;Ling, Kurzawski, et al., 2016). Advanced regularization methods might be another important ingredient to stabilize the behavior of complex ML-based parameterizations (Brenowitz et al., 2020).
When we discuss ML methods for the parameterization problem, it can be helpful to distinguish two different approaches. First, the use of ML to emulate an existing parameterization to achieve an increase in computational performance, that is, a speedup (Chevallier et al., 2000), and second, to derive new parameterizations from original data when the parameterizability of the problem is not necessarily clear. There is no sharp line between the two approaches, though, because the emulation of expensive highly resolved models shares many problems with the general parameterization problem.
In the following, we apply regression using NNs to the parameterization of warm-rain cloud microphysics. The formulation of the corresponding autoconversion and accretion rates is a classic problem in cloud modeling going back to first parameterizations of Kessler (1969) and Berry and Reinhardt (1974). This problem seems especially well suited for ML methods because most warm-rain parameterizations used today are based on data from numerical solutions of the kinetic collection equation (KCE). Hence, this looks like a well-defined regression problem.
The paper is organized as follows. In section 2 we introduce the KCE and define the bulk microphysical conversion rates. In section 3 we introduce the superdroplet method as a benchmark solver and discuss the training data in section 4. In section 5 ML is applied to the parameterization problem. In section 6 we attempt to interpret the predictions from the ML method. In section 7 we solve the corresponding ordinary differential equations (ODEs) and compare the ML results with an established warm-rain parameterization. Sensitivities to hyperparameters and other assumptions are discussed in section 8, followed by our conclusions in section 9.

The Kinetic Collection Equation
The collision-coalescence of droplets in a cloud is described by the KCE: also known as Smoluchowski equation or quasi-stochastic collection equation (Drake, 1972;von Smoluchowski, 1916von Smoluchowski, , 1917. Here f (x) is the number of droplets per unit volume in the mass range [x, x + dx], and x and y are the drop mass. The collision-coalescence kernel K(x, y) describes the physics of the collision process. For clouds it is specified as where r(x) is the drop radius, v(x) is the terminal fall velocity, and E coll (x, y) is the collision efficiency. The first two terms constitute the geometric swept volume. The collision efficiency quantifies the details of the fluid dynamics interaction and is tabulated from detailed trajectory calculations. In the following, we use the collision efficiency of Hall (1980) with modifications by Beheng (1994).
For small droplets the hydrodynamic collision kernel is nonlinear with K ∼ x 2 , and analytical approaches to solve the KCE are of limited use in this case. The solutions describe the colloidal instability, which is the formation of larger and larger drops due to binary collisions. The kinetic equation is invariant under the transformation

10.1029/2020MS002301
with a constant c and therefore similarity solutions of the form̃(x, t) = c (x, ct) exist for each solution f (x, t).
Atmospheric models do not necessarily predict f (x, t); instead, only partial moments of this distribution in two size ranges are used, which are defined as and drops smaller than the mass x * = 2.6 × 10 −10 kg are cloud droplets, and larger drops are raindrops. M (0) i = N i is the number density, and M (1) i = L i the mass density or cloud resp. rain water content. Depending on the number of moments that are used for each particle category, parameterizations are classified as single-, double-, or triple-moment schemes. In the following, we focus on double-moment schemes with the variables N c , L c , N r , and L r . It is straightforward to prove that the liquid water content L = L c + L r is conserved by Equation 1 and the scaling constant c of similarity solutions can be identified as c =L∕L (Drake, 1972).
The time evolution described by Equation 1 establishes a system of ODEs for the partial moments N c , L c , N r , and L r given by with the mean cloud droplet massx c = L c ∕N c . The autoconversion rate AU, the accretion rate AC, and the two self-collection rates SC c and SC r are unknown, and to specify these process rates in terms of the partial moments is known as the warm-rain parameterization problem. Note that we have already made the approximation to couple the number rates of autoconversion and accretion, AU N and AC N , to the mass rates (Beheng, 1994(Beheng, , 2010 by the assumption that autoconversion creates droplets of the mass x * and accretion collects on average the cloud droplets with massx c . All moments and the process rates are positive semidefinite quantities. The time evolution of L c is monotonically decreasing, because it has only sink terms. Correspondingly, L r increases monotonically.
The autoconversion rate AU and the accretion rate AC can be calculated directly from known solutions of the kinetic equation by and similar integral forms exist for SC c and SC r (Beheng, 2010;Doms & Beheng, 1986).
Having such data reduces the parameterization problem to a regression task and depending on the input data and regression assumptions different parameterizations have been derived in the last decades (Beheng, 1994;Berry & Reinhardt, 1974;Khairoutdinov & Kogan, 2000). In the following, we will compare with the double-moment parameterization of Seifert and Beheng (2001;SB2001 hereafter) given by

10.1029/2020MS002301
where k c = 9.44 × 10 9 m 3 kg −2 s −1 and k r = 5.78 m 3 kg −1 s −1 are coefficients of the Long-kernel (Long, 1974) and is the shape parameter of an assumed Gamma distribution. The functions Φ au ( ) and Φ ac ( ) are given by and depend only on the liquid water time scale = L r ∕(L c + L r ) = L r ∕L, which is motivated by the above-mentioned invariance of the kinetic equation.

The Benchmark Simulations
For the benchmark solutions of the KCE, we apply the superdroplet method of Shima et al. (2009), which is a Monte Carlo algorithm that simulates the collision-coalescence processes explicitly. The Monte Carlo algorithm actually implements a more general equation which takes into account stochastic fluctuations (Dziekan & Pawlowska, 2017). The KCE (1) is in fact only the mean-field approximation of this more general stochastic collection equation (Alfonso et al., 2008).
As initial condition, we use a Gamma distribution for cloud droplets with where A and B are calculated from the initial mean radiusr 0 and the initial liquid water content L 0 = L c,0 and is the shape parameter. In the following, we will make use of simulation from the three-dimensional phase space L 0 ∈ [0.3, 2] g m −3 ,r 0 ∈ [9,15] μm, and ∈ [0, 4]. The Monte Carlo algorithm draws samples from the initial conditions that represent a certain number of real droplets given by the (average) multiplicity 0 = 25,600 in a control volume of 25 m 3 . The initialization follows the single-SIP approach of Unterstrasser et al. (2017), which improves the convergence properties of the Monte Carlo method. Typically more than 10,000 superdroplets are used for a single simulation, which ensures a converged solution. The actual number of superdroplets scales with the number density of the initial conditions. If the number of superdroplets would exceed 50,000, it is limited to that value. Due to the stochastic nature of the Monte Carlo algorithm, which also causes additional variance, it can be useful to do simulations with the same initial condition multiple times. This will be denoted as a simulation ensemble.
The autoconversion and accretion rates could be calculated from the integral representation Equations 10 and 11, but the Monte Carlo algorithm allows a direct calculation of those conversion rates. For each superdroplet collision, it is known to which drop category the drops belong to, and hence, the autoconversion rate can be exactly calculated from where k is the superdroplet with the smaller multiplicity and j is the superdroplet with the larger multiplicity. The prime denotes quantities after the collision event. Hence, k ∈ c ∧ k ′ ∈ r means that the superdroplet with the smaller multiplicity became a raindrop after the collision event. The mass x ′ k is the mass of the superdroplet k after the collision event with x ′ k > x * . The portion of the superdroplet j that is added to k is given by the probabilitỹthat is defined and calculated as specified in section 5.1.3 of Shima et al. (2009). Hence, autoconversion is simply the sum over x ′ k for all superdroplets that became raindrops weighted by their multiplicity ′ k . For accretion we have to distinguish two cases and AC = AC 1 + AC 2 . In the first case the superdroplet k is a raindrop, and j is the cloud droplet. The raindrop k grows by collecting the fractioñof j: In the second case the superdroplet j is the raindrop, and k is the cloud droplet. Hence, the mass transfer to rain due to accretion is only the mass of k before the collision event: The self-collection rates are simply and the sum is taken over the list of all superdroplet pairs for this time step. In passing we note that the number rates AU N and AC N can be calculated simply by dropping the mass weights in the equations above.
The simulations are performed with a time step of 1 s, and output is written every 20 s. Each individual simulations of the collision-coalescence process is long enough such that all cloud water is eventually converted to rain. The length of the individual simulations depends on L 0 andr 0 . For example, for L 0 = 1 g m −3 andr 0 = 13 μm, it is 60 min. For different L 0 the simulation time scales with 1/L 0 . Figure 1 shows the normalized autoconversion rate in terms of Φ au ; that is, it reproduces Figure 1 from SB2001 for a subset of our simulation output. This confirms that the superdroplet simulations and the derived autoconversion rate are comparable to SB2001. Although the superdroplet method should have a higher accuracy given that we are using a very large number of particles, it is also obvious that the Monte Carlo method exhibits significant noise. This becomes even more severe for larger , that is, narrower initial cloud droplet distributions. Some of these fluctuations might be correct physical behavior, though, and especially for larger , the droplet growth can be dominated by rare events ("lucky droplets"). Nevertheless, it is encouraging that the old fit of SB2001 (black dashed line) is a good approximation to the new data, although some bias is visible.

Preparation of Training and Testing Data
As predictors, we use the moments N c , L c , N r , and L r , which would be the prognostic variables in a double-moment warm-rain scheme. In addition, we assume that the shape parameter is known, for example, from the properties of a given aerosol distribution. Hence, these five variables are the basic features or predictors in the training data. The process rates AU, AC, SC c , and SC r are the labels or target variables.
The solutions of the KCE exhibit exponential growth, and therefore, the moments and the process rates span several orders in magnitude. Hence, it is advisable to use log-transformed features and labels for training. This can also be motivated from the fact that established regression-based parameterization like Beheng (1994) or Khairoutdinov and Kogan (2000) uses a multiplicative power law ansatz. The log-transform requires that we specify a minimum value and feature-label vectors with values smaller than are removed from the training and testing data. We use 0 = 10 −15 for all data except the autoconversion rate for which a larger value of 1 = 10 −12 is applied (Table 1). The reasoning for this choice is discussed in section 7.
Number and mass densities are linearly dependent in the sense that they increase by a factor c when the size distribution f (x) is multiplied by the same constant factor. This can be avoided whenx i = L i ∕N i is used instead of N i . The alternative choice N i and x i would be truly orthogonal. Here we prefer the combination L i andx i for the mass rates because L c and L r are the most important predictors for AU and AC. Finally, we may replace L r with for predicting autoconversion to make the set of predictors identical to SB2001.
In the following, we train a neural net for each label separately. This allows us to test different sets of predictors for each process rate. Another consideration is that autoconversion is most important when L r is still small; later on the system is dominated by accretion. Hence, we do not want to contaminate the training  10, 11, 12, 13, 14, 15 9, 10, 11, 12, 13, 14, 15 0,1,2,3,4 0.5, 1.5, 2.5, 3.5 n 5 (+2) 5 potential predictors P (features) total number of samples after data reduction (without validation data) Note. n is the number of ensembles using different random number seeds for the same initial condition.
data with rather irrelevant autoconversion labels, for example, when L r ≫ L c . Even accretion becomes irrelevant when L c is small. Hence, we can remove labels based on , L c , and L r to focus the training on the relevant part of the phase space.
As testing data, we use a separate set of simulations with different values of L 0 and . This makes sure that the testing data are clearly separated from the training data. The selection of the training and testing data is summarized in Table 1. The testing data span a marginally smaller range of initial conditions as the training data. This means the ML models will, in some sense, only need to interpolate to achieve a good skill on the testing data. This is sufficient here because cases that have lower L 0 and/or smallerr 0 are nonprecipitating and will not produce rain by the warm-rain collision-coalescence process within reasonable time. Cases with higher L 0 and largerr 0 rain so quickly that the error in the timing becomes irrelevant. The shape parameter is not a prognostic variable in a two-moment scheme and is set a priori and is usually within the range shown here.
Finally, the training, validation, and testing data are standardized using the mean and standard deviation to ensure that all features zero mean and a standard deviation of one. Hence, we apply the transformatioň = ( −̄)∕ , where is a feature vector with mean̄and standard deviation anďis the standardized value of the feature.
A subset of the training data before standardization is visualized by a pairs plot in Figure 2 showing AU and the predictors L c ,x c , , and . This shows clearly that we have defined = 0 as the shape parameter of the initial conditions. Note that discrete values of, for example, andx c , are artificially smeared out in this plot by the kernel density estimator, which assumes a Gaussian distribution. It can also be seen thatx c does not change much, especially not for large . This is explained by the fact that initially, only the tail of the distribution is affected by collisional growth, but not the mean, and later accretion does collect all cloud droplets. Only for broad distributionsx c decreases during the final collection stage, because very small cloud droplets are not collected efficiently by the raindrops (see, e.g., Figure S40).

The ML Models
To parameterize the process rates with ML, we train several small NNs. Each has three fully connected layers with 16 nodes per layer. Each neural net has only a single output node corresponding to our choice to train a separate NN for each individual process rate. This gives a total of 609 trainable parameters for each NN. The sigmoid activation function is applied. Larger nets and other activation functions have been tested, and those results are documented in the supporting information. However, they did not lead to any improvements. We choose the mean squared error (MSE) as loss or cost function and RMSprop as the optimizer. RMSprop is a variant of gradient descent that automatically adjusts the gradient magnitude during the minimization. The initial learning rate is set to 10 −3 . We use early stopping with a patience of 20 epochs and a maximum number of 200 training epochs.

Journal of Advances in Modeling Earth Systems
In this first ML step, we primarily test different predictors for the process rates. For autoconversion, our first models follow classic formulations like Beheng (1994) or Khairoutdinov and Kogan (2000) and use only the cloud variables L c andx c (Model 1). As a second model, we can, in addition, assume that the (initial) shape parameter is known as well (Model 2). Our Models 3 and 4 make use of the rainwater content L r as an  Note. MAE and MSE of AU for testing data are given in 10 −9 m 3 kg −2 s −1 . MAE, MSE, and ME of t 10 in min, and MRE of t 10 in %. All ML models use sigmoid activation and 0 = 10 −15 for accretion and both self-collection rates.
additional predictor of AU, either explicitly (Model 3) or by using (Model 4). The latter two models are ML counterparts of SB2001 in the sense that they have the same information provided as predictors.
Two additional ML models using onlyx c − − andx c − as set of predictors are an attempt to improve the ML models by borrowing the dependencies from SB2001. Hence, for Model 5 withx c − − we have eliminated the L c dependency from AU by dividing the training data (labels) by L 2 c . For model 6 with predictorsx c − we have, in addition, removed the dependency with the corresponding term of Equation 12. See also Table 2 where all six models are listed.
All six models show a good convergence. The MSE and mean absolute error (MAE) of the validation data follow the training loss perfectly as shown for the first four models in Figure 3. Hence, there is little sign of overfitting. Overfitting is also not to be expected, because we have much more training data than trainable parameters. Interestingly, the lines of Models 3 and 4 can hardly be distinguished, which already indicates that it is rather irrelevant whether we use L r or . At the same time, it is obvious that having more predictors provides a better model and, especially, the jump from Models 2 to 3 and 4 is large. Models 5 and 6 use different training and validation data and cannot be directly compared with Models 1-4 in this kind of diagnostic.
More important than training or validation loss is the comparison with the testing data, which has not been used to optimize the model parameters. This is shown in Figure 4 and in Table 2. Also for the testing data, we see that Models 3 and 4 are clearly superior to Models 1 and 2. We compare with four classic, in the sense of non-ML, parameterizations. First, SB2001 with its original parameters. ML Models 3 and 4, which use the same set of predictors, are significantly better than SB2001, whereas SB2001 is better than ML Models 1 and 2. For "SB new" we have recalibrated Φ au on the training data using a Levenberg-Marquardt   (2000), which is of similar quality as ML Models 1 and 2 that also use only cloud variables. The model named "power" is a regression model using a power law ansatz for L c , N c , and , optimized using Levenberg-Marquardt minimization on the training data. This is only marginally better than KK2000.
This seems to make little difference for the scores against the testing data, Model 5 is slightly worse than Model 4, but Model 6 withx c − has the best MAE and MSE against the testing data. The result can be interpreted such that ML Models 3 and 4 are able to approximate those dependencies well enough from the original data and do not benefit from a simplification of the problem when we remove these dependencies. That Model 6 is better than Models 3 and 4 suggests that there are some issues with the generalization of those two models to the testing data. Figure 5 shows a similar comparison of ML models with SB2001 for accretion and both self-collection rates. In general, we find that the use of additional predictors leads to an improvement of the models. The results are ML models that provide a significant improvement over SB2001 (and KK2000 for accretion). The additional improvement from including in the self-collection rates is small though. For accretion, it is not necessarily favorable to include , but this difference is not significant for our training and testing data. In a full atmospheric model, one could hope that simpler parameterizations or ML models might generalize better and drop those variables that provide only a marginal improvement.
In the following (sections 6 and 7), we use L c −x c − − for autoconversion, L c − L r −x c −x r − for accretion, L c −x c − − for self-collection of cloud droplets, and L r −x r for the self-collection of rain.
As a pure ML application, we would already be done at this point. The ML models can make skillful predictions of the process rates that are of similar quality, or even better, than standard parameterizations used in atmospheric models. From the point of view of model development for NWP and climate models, on the other hand, we want to go at least one step further and test whether the ML-based process rates lead to skillful solutions of the ODE system that approximates the original KCEs. Before we do so, we first analyze the ML models in some more detail in the next section.

Interpreting the NNs With Partial Dependence Plots
While we have seen that the ML models can achieve similarly good predictions of the conversion rates as SB2001, they do not directly reveal how they do so. However, there are ways to peek inside the "black box" that allow to gain some insight from the trained models (McGovern et al., 2019). In particular, we use a method called partial dependence plots (PDP) (Friedman, 2001). To create a PDP, one predictor is set to a fixed value for all samples in the test set, which is then passed to the trained model to create a prediction. By repeating this process for a range of predictor values and averaging the results, one obtains the effect of changing this one predictor on the prediction, all else being equal. This can then be repeated for every feature to obtain the plots in Figure 6. Note that in these univariate PDP plots, dependencies between the predictors are thus not taken into account. We also plotted the dependencies in SB2001 for comparison (see figure caption for details).

Journal of Advances in Modeling Earth Systems
For accretion, the dependencies of the ML model as revealed by the PDP technique are very close to SB2001 for L c and L r , whereas the additional effects from taking into accountx c ,x r , and are small. This is also consistent with the results using different sets of predictors from the previous sections, which already suggested that L c and L r are the dominant predictors. For autoconversion ML Model 4 the PDP analysis shows  that the ML approach finds dependencies on L c ,x c , , and that are very similar to SB2001. The dependency on is somewhat, the onex c only marginally, weaker than the analytic relations of SB2001. The latter originate directly from the Long (1974) kernel, but the Long-kernel itself is only an approximation, and, hence, it is not obvious that it provides the correct functional relationships. ML Model 4 predicts an increase of AU with that is qualitatively similar to Φ au ( ) of SB2001. Note that the dependency that results from the PDP analysis corresponds to rather than just Φ au ( ) itself. This explains why AU of SB2001 levels off at small in Figure 6. For small the ML model shows smaller AU than SB2001. Also, for the two self-collection rates the agreement between ML Model 4 and SB2001 is surprisingly good, although we have to emphasize that Figure 6 is plotted in log scales. For a small cloud liquid water content L c < 10 −4 the ML model predicts a lower self-collection rate (SC c resp.) than SB2001. The same is observed for rain with L r < 10 −5 leading to a lower self-collection rate SC r compared to SB2001. The dependency of cloud droplet self-collection on shows a good agreement, whereas the additional dependency onx c is weak. Cloud droplet self-collection shows an increase for large , which should be interpreted with caution due to the fact that this is only a box model without sedimentation or other effects. Note that the SB2001 cloud droplet self-collection rate as shown here includes the loss due to autoconversion.
Thus, the PDP technique confirms that the ML algorithm is able to extract physically reasonable nonlinear relationships for the warm-rain processes from the training data. The main dependencies are consistent with the well-established warm-rain parameterization of SB2001. The additional sensitivities are physically reasonable and promise to provide an improvement over SB2001. Whether they actually hold that promise will be to focus of the next section.

Results for the ODE System
In atmospheric models the warm-rain ODE system, Equations 6-9, is part of a much larger PDE system, and these source and sink terms are usually integrated with a simple Euler forward time stepping. Hence, we do the same here using a sufficiently small time step of 5 s. In NWP and climate models time steps of 20 s and up to several minutes are common, but this can deteriorate the solution of the warm-rain ODE system.
Note that we solve here only the ODE system as given by Equations 6-9 and no additional processes like drop sedimentation, drop breakup, or large-scale dynamics are taken into account. This ODE system should therefore parameterize the KCE as defined by Equation 1 and be interpreted as a box model or as spatially homogeneous cloud.
In the following, we focus on comparing the different ML models for autoconversion with benchmark solutions of the KCE and the parameterization of SB2001. In this section, all ML models use the same choices (and hyperparameters) for accretion and the two self-collection rates by using the ML models that gave the best results against the testing data. Figure 7 shows the rainwater content L r for four solution with different initial conditions with zero initial rainwater, but with different initial liquid water content L 0 = L c (t = 0), different initial mean volume radius r 0 =r c (t = 0) and different shape parameter . The time evolution is typical of all solutions of the KCE and shows a conversion from the pure cloud water initial condition with no rain to a pure rain water state where all cloud water has been depleted. Hence, the solutions change primarily in their time evolution, namely, the onset of the cloud-to-rain conversion, which is dominated by autoconversion, and the speed of the conversion to rain once a first significant amount of rain has formed. This slope is set by the accretion rate. Obviously, the amount of rain is limited by the total liquid water in the system L 0 . Lower L 0 , smallerr 0 , and larger lead to a slower formation of rain and a delay of the transition to a pure rain state. For these four cases, SB2001 provides almost perfect solutions that are very close to the benchmark solutions of the KCE. The ML model, here Model 4, shows a delay for the three cases with = 0 and a too fast formation of rain for the slowest case with = 2. Hence, the dependency on seems to be suboptimal in the ML model. To analyze those dependencies more systematically, we define a time scale t p as the time when p percent of the cloud water has been converted to rainwater. The time scale t 50 is a good measure for the overall performance of the ODE systems, whereas t 10 focuses on the initial stage where autoconversion tends to dominate. Figure 8 presents the results for t 10 for ML Model 2 with predictors L c ,x c , and and Model 4 that, in addition, includes . ML Model 2 has clear deficiencies and fails to represent the timing of the benchmark solutions in a large part of the parameter space. In contrast, Model 4 shows a reasonable behavior, and also, the dependencies onr 0 and are qualitatively correct. But ML Model 4 shows a significant bias with a delay in most parts of the phase space and a too weak sensitivity tor 0 . The dependency on is actually captured quite well here and only slightly too strong. SB2001 is significantly better than ML Model 4 and matches the benchmark simulation surprisingly well given that it has not been retraining or retuned for this data set but is the original parameterization as published in 2001. On the other hand, we have used exactly the same collection kernel as in SB2001 and only applied a different numerical method. This shows that the good scores of the ML models against the testing data of the process rates do not directly transfer to the ODE solutions. In contrast, SB2001 is better for the ODE solutions, although it is worse against the testing data for the process rates. This will be discussed and explained in section 8. Here we just note that ML Model 4 would be an acceptable warm-rain parameterization, but in this metric, it is clearly outperformed by SB2001.
The deficiencies of ML Model 2 can also clearly be seen in the time series of AU and AC shown in Figure 9. ML Model 4 provides a very good approximation of both AU and AC, for this case, and is in some aspects even superior to SB2001. The latter shows a remarkable overestimation of AU in the first 10 min. Hence, for cases like this, for which ML Model 4 has only a small temporal bias, the solution reproduces the detailed evolution of the KCE reasonably well. In contrast, the autoconversion rate of Model 2 starts much too high, decreases with time and even shows some signs of instability later in the simulation. The inability to represent the time evolution of AU, and especially the increasing AU in the first stage of rain formation, can also be observed for many other parameterizations that are only based on cloud variables like Berry and Reinhardt (1974), Beheng (1994), and Khairoutdinov and Kogan (2000). Hence, this is caused by our choice of predictors, not by the ML approach itself.
To compare with the other ML models, we analyze the overall errors for t 10 as shown in Figure 10 and in Table 2. The scores are evaluated over a wide range of initial conditions with L 0 ∈ [0.3, 2] g m −3 ,r 0 ∈ [10,15] μm, and ∈ [0, 4]. This reveals that the other ML models with different predictors, including those borrowing some dependencies from SB2001, cannot improve significantly over Model 4. Only Model 3, which uses L r instead of , is slightly better for the ODE solutions. In our opinion, this small difference is not significant. In terms of mean error (ME) and mean relative error (MRE) Model 5 is the best ML model. Overall, none of the ML models comes even close to the low MAE and MSE of SB2001. A detailed inspection shows that all ML-based models have difficulties with the slowly evolving cases of low L 0 , smallr 0 , and large .
In the next section, we will analyze this behavior in more detail and test some more ML models with the goal to remedy these deficiencies.

Sensitivities and Discussion
ML models can be fine-tuned by altering the design of the neural net and the optimization algorithm that determines the trainable parameters (Géron, 2019). In the ML community, such choices are called hyperparameters. Most prominently for neural nets is the size of the network. A wider and deeper neural net has more trainable parameters and can fit the training data better, but this will not necessarily generalize to the testing data or the application.  . Mean absolute error (MAE) and mean squared error (MSE) for t 10 for ML Models 13 to 18 and 4 using different thresholds ϵ 1 as well as SB2001 (left), and mean error (ME) and mean relative error (MRE) for the same models (right).
As part of this study, we have made many attempts to further improve the performance of the ML model to come closer to the ODE results of SB2001. We have, for example, tested deeper and wider NNs as well as different activation functions. None of this leads to a significant improvement in the behavior of the ODE solutions, and these further experiments are therefore documented in the supporting information. The main bottleneck for a further improvement seems to be that improved results for the training or test set do not necessarily transfer to the ODE solutions.
Some hint for the underlying cause for the difficulties to further improve the ML models can be gained from experiments that change not the ML model itself, but the threshold 1 in the data reduction of the training data. Changing 1 has a pronounced and systematic impact on the timing of the ODE solutions: A smaller Further evidence for the cause of the problem can be gained from Figure 9. The autoconversion rate AU can be very small at initial times, and for large it becomes virtually zero for the benchmark solutions and, hence, in the training data. If AU is zero at initial time in the ODE system, though, no rain will ever develop because AC is also zero due to L r being zero by definition. When we remove the small values of AU at initial time with help of an increased 1 , then the ML model extrapolates ending up with a higher value for AU. This explains the change in the ME and MRE for different 1 switching for a delay at low 1 to a too fast development at high 1 . Hence, we have to conclude that the benchmark simulations do not provide useful training data for AU at initial time. The underlying cause for this behavior is that the concept of autoconversion is, in fact, ill-posed for large and smallr 0 . The two-moment parameterization is only based on the moments M (0) c = N c and M (1) = L c . For a very slowly evolving narrow cloud droplet distribution Table 3 The Same as these two moments contain no information about the evolution of the drop size distribution during this earliest stage of the rain formation. The collisional growth is initially only affecting the higher moments of the cloud droplet distribution, which are, unfortunately, not part of the two-moment model. Hence, when we follow this interpretation, the ML models have to fail because the training data contain no information that can lead to an ODE system that describes those solutions of the KCE. The puzzling piece here is that such an ODE system seems to exist in form of the SB2001 parameterization.
So how and why does SB2001 actually work? SB2001 does overestimate AU for ≪ 0.1, especially for the difficult cases with small L 0 , smallr c , and large . This can clearly be seen in Figure 9 and becomes even more severe for more extreme cases. The parameterization compensates this by a decrease of AC for small that is not based on data (see the supporting information). The parameters in Φ ac ( ), Equation 15, have been chosen such that the solutions of the ODE system match the timing of the reference solution of the KCE. Thus, the good agreement of SB2001 for t 10 is a combination of the self-similarity of the solutions, which is the mathematical foundation that makes this possible, and a tuning of the ODE system for t 10 . The self-similarity is instrumental in this because for each slowly evolving solution with smallr c , small L 0 , and large we can in principle find a corresponding fast solution with nonvanishing AU by taking the limit to large L 0 . The self-similarity will then guarantee that the rescaled similarity solution matches the time evolution of the true slowly evolving solution. This neither requires nor guarantees that AU and AC are both correct for the slowly evolving solution, though. Hence, for those extreme cases SB2001 is not a rigorous approximation of the process rates based on the numerical data, but a parameterization that mimics the time evolution of solutions of the KCE.
In some sense ML, at least in the approach that we have chosen here, is an attempt to be more rigorous than SB2001, but such a direct solution based only on data may not exist for this problem. In the parameterization community, this is known as a closure problem. The chosen model, here the ODE system for the two moments M (0) c = N c and M (1) = L c , cannot be derived rigorously from the fundamental physical equations (here the KCE) without the help of additional and independent assumptions, for example, making use of conservation laws or invariance properties that may be part of the underlying equations but are not yet contained in the parameterized model. Due to the closure problem associated with vanishing autoconversion for smallr c and large , it remains unclear whether a two-moment scheme is actually able to represent autoconversion properly over the full range of parameters. The results presented here indicate that it is not. Hence, the problem might be fundamentally ill-posed. Nevertheless, the SB2001 model provides a good parameterization because it makes use of the invariance of the KCE to time transformations as an additional closure assumption.

Conclusions
In this study we have applied standard ML methods in form of fully connected neural nets to the parameterization problem of warm-rain cloud microphysics. We find that ML-based models provide a reasonable representation of the microphysical process rates. The ML approach confirms previous results that including rain water information is essential to parameterize warm-rain autoconversion, that is, in a standard two-moment model cloud water variables alone are insufficient to predict autoconversion. The dependencies of appropriate ML models of the cloud microphysical process rates are consistent with established warm-rain parameterizations. The resulting ML-based parameterizations can be applied in actual atmospheric models to describe the warm-rain processes.
A great advantage of the ML approach is that different model formulations can be implemented and evaluated quite quickly, for example, models using different sets of predictors. In the supporting information, we document an extension of the warm-rain two-moment model to predict the number change due to autoconversion and accretion explicitly. To our knowledge, this has never been done with standard bulk schemes, but it can, in our case, improve the prediction of cloud droplet size.
More generally, the ML approach seems especially well suited to parameterize the complications and details of ice microphysical processes, which often require a mix of analytical relations and look-up tables in state-of-the-art parameterizations. Hence, the ML approach has the potential to greatly simplify and speed up the development process of microphysical parameterizations for NWP and climate models.

10.1029/2020MS002301
The main problem of the ML approach presented here is that for the actual application in form of an ODE system, the quality of the ML-based models is inferior to the classic non-ML parameterizations. The ML models show significant biases especially for slowly evolving solutions of the KCE. This behavior cannot be improved by modifying the ML model using standard hyperparameters like size of the network, activation function, and so forth. We posit that these biases of the ML models are not so much a deficiency of the ML approach itself but are caused by inherent limitations of the chosen two-moment warm-rain model. During the first stage of rain formation, the two-moment model is insufficient to describe the evolution of the process rates, and, hence, it cannot be trained properly with the process rates of the benchmark simulations and standard ML approaches. In physically based parameterizations such limitations can be overcome by making additional closure assumptions, but often, this leads to compensating errors being introduced.
A logical step would be to include additional higher moments as predictors. This could maybe overcome the problems and the ML methods do offer a powerful technique to derive such more complicated parameterizations. Still, there might be new challenges when going to more and higher moments like the treatment of activation and turbulent mixing and the effect of these processes on the higher moments of the cloud droplet distribution. In addition, three-moment schemes are more costly in a computational sense, when applied in a full NWP or climate model. Maybe more advanced ML techniques like, for example, neural ODEs (Chen et al., 2018;Rackauckas et al., 2020) can resolve the issues even on the level of a two-moment scheme by learning the equations from data in the sense that the ML models include the concept of an ODE or of a function and its derivative. Being able to include the time evolution of the model variables (the ODE solutions) and the process rates (the derivatives of the ODE solutions) in the training data would probably resolve the closure problem related to the ill-posedness of the two-moment warm-rain parameterization. Then we would only have to remove the misleading autoconversion rates during the initial stage from the training data and replace those with the corresponding benchmark solution. By being able to train directly on the time evolution of the model variables, such advanced ML methods might succeed to extract an ODE system that is as good as or even better than SB2001. Even more so if it would be possible to build the time invariance of the KCE into the neural net similar to the Galilean invariance that is included in ML-based turbulence models (Ling, Jones, et al., 2016;Ling, Kurzawski, et al., 2016). But to achieve this, the ML model first needs a concept of time and time derivatives.
In this study, we have not analyzed the computational performance or efficiency of the ML approach versus classic parameterizations. This has to be done in a full model framework and, probably, taking into account hardware accelerators like GPUs. For the simple warm-rain ODE and using CPUs, the ML models are in fact more expensive than the classic parameterizations. But this might change dramatically for full mixed-phase microphysics schemes and using an implementation and hardware that is tailored toward ML models. Hence, we do not want to draw any conclusions from the current study regarding computational performance or efficiency.