On the Relationship Between Shallow Cumulus Cloud Field Properties and Surface Solar Irradiance

Shallow cumulus clouds exhibit highly three‐dimensional (3‐D) spatial structure leading to complex variability in the surface solar irradiance (SSI) beneath. This variability is captured by the typically bimodal shape of the SSI probability density function (PDF). Using large eddy simulation to generate well‐resolved cloud fields and Monte Carlo 3‐D radiative transfer to reproduce realistic associated SSI PDFs, we seek direct relationships between the cloud field properties and the SSI PDF shape. Applying both random forest and artificial neural network algorithms, we find variations in the two modes of the SSI PDF are well predicted by just a handful of cloud field properties. The two algorithms utilize cloud properties similarly, with indistinguishable performance despite their different architectures. These results offer a marked improvement in realism relative to one‐dimensional radiative transfer while bypassing computationally expensive 3‐D radiative transfer, with immediate application to renewable energy assessments, and potential for several other geophysical applications.


Introduction
Solar irradiance drives our weather and climate (e.g., Stephens et al., 2012;Trenberth & Fasullo, 2012;Wild et al., 2015) and its distribution at the surface impacts crop yields (e.g., Campillo et al., 2012), creation and destruction of air pollution (e.g., Peterson & Flowers, 1977), and renewable energy production (e.g., Perez et al., 2016) to name but a few applications. When clouds are present, they are usually the primary modulator of surface solar irradiance (SSI). Omnipresent shallow cumulus clouds exhibit complex three-dimensional (3-D) spatial structure and evolve quickly (Berg & Kassianov, 2008;Lamer & Kollias, 2015) presenting both an intellectual and computational challenge to predict their influence on SSI.
Observed SSI beneath shallow cumuli typically manifests as a bimodal probability density function (PDF), representing separately cloud shadows and the gaps between Schmidt et al., 2007Schmidt et al., , 2009. For widely used one-dimensional (1-D) radiation schemes, which consider vertical columns independently and apply a plane-parallel approximation within each column, the observed PDF bimodality is not always reproduced even with well-resolved cloud fields from large eddy simulation (LES). However, for identical cloud fields, the bimodality can be reliably reproduced with Monte Carlo 3-D radiative transfer . While it is clear that the shape of the bimodal PDF is related to the properties of the cloud field, a direct mapping between the two has never been investigated. Such a mapping is of scientific interest to reveal relationships between complex cloud fields and their corresponding SSI, as well as practical value given the computational expense of Monte Carlo 3-D radiative transfer.
The aim of the present study is to examine the feasibility of a direct mapping between shallow cumulus cloud field properties and the shape of the corresponding SSI PDF. Cloud field properties are derived from LES and SSI from 3-D radiative transfer, providing known inputs and outputs and the need to learn the complex relationship between them. This problem is an ideal candidate for machine learning, which is increasingly being adopted in the atmospheric sciences (e.g., Glassmeier et al., 2019;Masuda et al., 2019;McGovern et al., 2017;O'Gorman & Dwyer, 2018;Thampi et al., 2017). One of the common criticisms of machine learning approaches is that they are used as a "black box" providing little physical interpretability. However, this paradigm is now being challenged (McGovern et al., 2019;Toms et al., 2020). Here we make a concerted effort to peer inside the black box.
The remainder of the manuscript is structured as follows. Section 2 describes our simulated cloud and SSI fields. Section 3 outlines our method of quantifying these fields and building relationships between them. Section 4 examines the quality of the derived relationships against an independent test data set. A short summary and concluding remarks are provided in section 5.

Data
In this study, we use LES of 30 nonprecipitating shallow cumulus cases, selected from the LES Atmospheric Radiation Measurement (ARM) Symbiotic Simulation and Observation activity (LASSO; Gustafson et al., 2020), spanning the summers of 2015 (5 days), 2016 (10 days), and 2017 (15 days). Simulations are performed using the System for Atmospheric Modeling (SAM; Khairoutdinov & Randall, 2003) with twomoment bulk microphysics (Morrison & Gettelman, 2008), a domain size of 24 × 24 × 15 km and a horizontal grid spacing of 100 m; further details are given in Gristey et al. (2020) and Glenn et al. (2020) who used an identical setup. Output is considered every tenth simulated minute throughout the duration of the runs from 07:00-22:00 local time (LT; Central Daylight Time or UTC-5). Specifically, we extract the following combination of radiatively relevant outputs that we determined are not strongly correlated (not shown): domain mean cloud fraction defined based on an optical depth of unity f c , the relative dispersion (standard deviation divided by the mean) of liquid water path for cloudy columns only D(LWP), and the mean in-cloud drop number concentration N c . We also calculate the mean of horizontally projected cloud areas A c and the mean distance from each cloud center to the cloud center of its nearest neighbor D c − NN . Together with the cosine of the solar zenith angle cos(SZA), these variables are henceforth referred to as the cloud field properties: f c , D(LWP), N c , A c , D c − NN , and cos(SZA).
Simultaneous 3-D LES output (specifically cloud liquid water content and drop number concentration) is then ingested into offline 3-D radiative transfer to compute SSI that is subsequently normalized by cos (SZA) to remove variations in the magnitude of incoming solar irradiance. We use the Education and Research 3-D Radiative Transfer Toolbox (EaR 3 T), which is based on the Monte Carlo Atmospheric Radiative Transfer Simulator (MCARaTS Iwabuchi, 2006), with an identical configuration as described by Gristey et al. (2020). Since some of the LASSO cases deviate substantially from shallow cumulus, we subset the LES 10-min output for 3-D radiative transfer using the following criteria: cloud fraction 5%-40%, no significant cloud ice, greater than 10 individual clouds in our domain, mean cloud size less than 2 km, and solar zenith angle less than 40°to capture the afternoon evolution of shallow cumulus where variations in the cloud field properties are most likely to influence the surface irradiance. After performing 3-D radiative transfer on this subset of LES, we have 531 snapshots of simultaneous cloud field properties and 3-D SSI PDFs for our analyses, as described next.

Quantifying the Shape of the 3-D SSI PDF
To provide a mapping to the SSI PDF, we first need to quantify its shape. To achieve this quantification, we create SSI PDFs using 5 W m −2 bins. This bin width is chosen such that each bin is sufficiently populated while variability across the SSI range remains well captured (e.g., blue bars in Figures 1c and 1d). We then fit distributions to each mode of the SSI PDF as follows.
For the larger irradiance mode that exhibits a tail toward maximum irradiance values, we found the three-parameter lognormal distribution fits the data well (right black line in Figures 1c and 1d): where ϴ is the location parameter (a horizontal shift of the distribution), s is the shape parameter (the standard deviation of the log of the distribution), and m is the scale parameter (the median of the distribution). Parameters are fitted from the start of the large irradiance mode to the last irradiance bin. The start of the large irradiance mode is defined as the first irradiance bin that has probability density more than half of the preceding bin counting backward from the maximum bin of the large irradiance mode; essentially this is where the probability density starts to increase rapidly.
For the smaller irradiance mode that is generally more symmetrical and bell-shaped, we found that a twoparameter normal distribution fits the data well (left black line in Figures 1c and 1d): (2) where μ is the location parameter (the mean of the distribution) and σ is the shape parameter (the standard deviation of the distribution). Parameters are fitted from the first irradiance bin to the end of the small irradiance mode. The end of the small irradiance mode is defined as the bin where the normalized SSI is twice the normalized SSI of the bin with the highest probability density within the small irradiance mode; essentially data are considered evenly either side of the maximum of the small irradiance mode.
We also quantify the weight of each mode, w 1 and w 2 , simply defined as the integral of the PDF under each mode. The data between the two modes are given by 1 − (w 1 + w 2 ) but typically account for a small fraction of the overall SSI PDF (around 10%) so are not considered further. In total, this provides seven parameters that quantify the shape of any given bimodal SSI PDF henceforth referred to as SSI PDF fit parameters: ϴ, s, m, μ, σ, w 1 , and w 2 .

Mapping Between Cloud Field Properties and SSI PDF Fit Parameters
Our analysis seeks to use cloud field properties (inputs) to predict the SSI PDF fit parameters (outputs). This presents a regression problem for which we use a random subset of 75% of the data to determine relationships (henceforth "training data") and retain 25% for testing (henceforth "test data"). To build these regression relationships, we consider two machine learning algorithms with fundamentally different architectures: a random forest (RF) and an artificial neural network (ANN).
Our RF implementation (Figure 2a) consists of a set of decision trees. During training, the "root" node of each decision tree has access to a random subset of the training data. Data are split at each node of the decision tree using a value of a single input cloud field property that results in the maximum reduction of mean-squared-error (MSE) of the outputs for samples in the two resulting branches. This process is repeated until the decision trees are fully grown. The mean outputs of samples in the final "leaf " nodes provide the predictions of each decision tree. When making predictions with the trained RF, the result is the mean of decision tree predictions.
Our ANN implementation (Figure 2b) consists of an input layer with each node representing one of inputs, an output layer with each node representing one of the outputs, and a number of hidden layers between. The hidden layers contain auxiliary nodes that apply nonlinear activation functions to transform their weighted input from the previous layer to output for the next layer. During training, the weights that connect nodes between each layer are learned by iteratively minimizing the MSE between the predicted output and the true output.
To infer the relative value of each input cloud field property in arriving at these trained algorithms we consider impurity-based and permutation-based importance metrics (Breiman, 2001). The impurity-based metric can only be applied to the RF and determines how effective questions about each cloud field property are at reducing the MSE. The permutation-based metric can be applied to both the RF and ANN and works by randomly permuting the cloud field properties one-by-one and assessing the degradation in the MSE of predictions. To compare with the impurity-based metric, only training data are used (although results from test data are similar) and degradations are scaled to a total of 100%.
Hyperparameters that control the model setup and complexity, such as the depth of decision trees in the RF or number of hidden layers in the ANN, must be defined for both of these machine learning algorithms before training commences. We perform tuning to determine appropriate hyperparameters (see Text S1 in the supporting information).

Results
After training the machine learning models described in the previous section, we now evaluate their performance. Evaluations are performed by making predictions on the independent test data that were held out from the training.
Predictions by the RF for each SSI PDF fit parameter (Figures 3a-3g) show positive correlations with their truth counterparts, indicating model skill. Prediction accuracies are greater than 80% for all parameters and data points fall approximately evenly either side of the 1:1 line giving relatively small biases. Closer inspection reveals that weight parameters (Figures 3f and 3g) are better predicted than the other parameters (Figures 3a-3e). This can be physically interpreted as the model doing better at predicting the fractional surface covered by cloud shadows, as opposed to predicting the magnitude of the irradiance within those cloud shadows or the clear sky between them. Some of the largest discrepancies occur for the θ parameter, where predictions at low magnitudes are overestimated. We found that these discrepancies are not associated with any particular case or any particular cloud field property (not shown), suggesting the training data set may benefit from more samples of these outliers.
Predictions by the ANN for each SSI PDF fit parameter (Figures 3h-3n) are remarkably similar to predictions by the RF (Figures 3a-3g) despite the different model architectures (Figure 2). Consistent with the RF predictions, ANN prediction accuracies are always greater than 80%, biases are small, and weight parameters are better predicted than other parameters. Comparing these statistics and the root-mean-squarederror (RMSE) between ANN and RF predictions (bottom right of plots in Figure 3) suggests that neither is consistently better.
The predicted SSI PDF fit parameters in Figure 3 can also be used in Equations 1 and 2 to reconstruct the SSI PDFs (Figure 4). We find that the bimodal shape of the SSI PDFs is well reproduced. Predictions from both the RF and the ANN are able to capture variations in shape, size and location of both modes but neither method shows consistent improvement over the other. For context, commonly used 1-D radiation schemes do not capture the bimodal shape at all, let alone the variations in the details of each mode (see Gristey et al., 2020Figure 4 or Schmidt et al., 2007. Figures 3 and 4 therefore demonstrate, both quantitatively and qualitatively, that machine learning methods using just a few shallow cumulus cloud field properties as input can produce more realistic SSI PDFs than 1-D radiative transfer. This is expected to be highly complementary to existing approaches for parameterizing 3-D radiative effects (e.g., Hogan & Bozzo, 2018;Jakub & Mayer, 2015) since, although we only consider a limited set of conditions, we provide a direct mapping to Monte Carlo results.
Finally, we comment on the relative value of the various input cloud field properties (Table 1). For the RF, the impurity-based and permutationbased importance both indicate that f C is the most valuable input followed by D(LWP). The permutation-based importance assigns slightly larger relative value to f C at the expense of the lesser important inputs, but the two metrics are generally consistent. For the ANN, the permutation-based importance also indicates f C is the most valuable input followed by D(LWP). The ANN assigns larger value to A C but, otherwise, the relative values are comparable. The dominance of f C is not surprising given this is the primary factor that separates the two modes of the SSI PDF and that f C is closely related to the relative area under the two modes (see Text S2). This is additionally supported by Tables S1 and S2 that show the relative value of f C for the weight parameters is >90%, as well as other interesting features such as the enhanced value of D(LWP) for σ (the standard deviation of the small irradiance mode), or the enhanced value of cos(SZA) for ϴ (the start of the large irradiance mode). One notable result is that even the lesser important inputs for both the RF and ANN appear to provide predictive value. This demonstrates the complexity of the 3-D SSI PDF; detailed cloud characteristics such as D C − NN and A C matter.

Summary and Conclusions
Prediction of SSI variability, as captured by its PDF shape, is important for many applications but can be challenging beneath shallow cumulus clouds. We propose a novel approach to predict SSI PDFs from shallow cumulus cloud field properties using machine learning.
RF and ANN algorithms are trained to map directly between cloud field properties from LES and SSI PDFs from Monte Carlo 3-D radiative transfer. Predictions by the RF and ANN on independent test data are capable of capturing not only the bimodal shape of the SSI PDF but also variations in the shape and size of each mode. This presents a substantial improvement in realism of SSI relative to commonly used 1-D radiation schemes that do not capture these features. The two algorithms utilize the input cloud field properties similarly and perform indistinguishably well, despite their different architectures.
The results demonstrate that SSI PDFs beneath shallow cumulus clouds can be reproduced with both accuracy and efficiency. This has direct application to modeling assessments of solar renewable energy, providing a realistic frequency distribution of SSI that can be combined with photovoltaic cell characteristics. This also paves the way for several other potential applications, for example, in LES to parameterize 3-D radiative effects or in NWP to correct for 3-D SSI biases and access subgrid SSI variability. However, several challenges lay ahead for such applications including spatial redistribution of the SSI PDF in any potential LES parameterization or the fact that some of the cloud field properties used here may not be easily accessible at NWP scales. A similar argument holds for using cloud properties derived from passive observations, which are likely to represent different physical quantities and suffer from their own biases due to 3-D radiative effects. We expect these issues can be overcome by extending the work here to examine spatial distributions or consider other closely related cloud field properties. This, as well as implementing these approaches for other cloud regimes will be the focus of future work.

Data Availability Statement
LASSO data can be accessed at https://archive.arm.gov/lassobrowser. SAM is available at http://rossby. msrc.sunysb.edu/~marat/SAM.html. EaR3T is openly available at http://doi.org/10.5281/zenodo.4110565 (funding from the National Aeronautics and Space Administration grant 80NSSC18K0146 [CAMP 2 Ex]). Note. The permutation-based importance metric includes ± one standard deviation from 100 random permutations. Permutation-based importance for each SSI PDF fit parameter separately is provided in Tables S1 and S2.