Evaluation of Cloud and Precipitation Simulations in CAM6 and AM4 Using Observations Over the Southern Ocean

This study uses cloud and radiative properties collected from in situ and remote sensing instruments during two coordinated campaigns over the Southern Ocean between Tasmania and Antarctica in January–February 2018 to evaluate the simulations of clouds and precipitation in nudged‐meteorology simulations with the CAM6 and AM4 global climate models sampled at the times and locations of the observations. Fifteen SOCRATES research flights sampled cloud water content, cloud droplet number concentration, and particle size distributions in mixed‐phase boundary layer clouds at temperatures down to −25°C. The 6‐week CAPRICORN2 research cruise encountered all cloud regimes across the region. Data from vertically pointing 94 GHz radars deployed was compared with radar simulator output from both models. Satellite data were compared with simulated top‐of‐atmosphere (TOA) radiative fluxes. Both models simulate observed cloud properties fairly well within the variability of observations. Cloud base and top in both models are generally biased low. CAM6 overestimates cloud occurrence and optical thickness while cloud droplet number concentrations are biased low, leading to excessive TOA reflected shortwave radiation. In general, low clouds in CAM6 precipitate at the same frequency but are more homogeneous compared to observations. Deep clouds are better simulated but produce snow too frequently. AM4 underestimates cloud occurrence but overestimates cloud optical thickness even more than CAM6, causing excessive outgoing longwave radiation fluxes but comparable reflected shortwave radiation. AM4 cloud droplet number concentrations match observations better than CAM6. Precipitating low and deep clouds in AM4 have too little snow. Further investigation of these microphysical biases is needed for both models.


Introduction
General circulation models (GCMs) are challenged by uncertainties and biases in the simulation of Southern Ocean clouds, aerosols, and precipitation, and these uncertainties affect simulated global cloud feedback on climate change. The clouds simulated by GCMs participating in the third and fifth Coupled Model Intercomparison Projects (CMIP3 and CMIP5; Meehl et al., 2005) mostly reflected too little sunlight back to space over the Southern Ocean (45-65°S) (Ceppi et al., 2012;Trenberth & Fasullo, 2010;Williams et al., 2013). Bodas-Salcedo et al. (2014) and others identified insufficient low cloud cover and insufficient supercooled liquid water in the cold sector of frontal cyclonic system as likely causes of this bias. Trenberth and Fasullo (2010) suggested that too little low cloud in the current climate might cause an underestimation of positive low cloud feedback on future climate change over this region. Models which glaciate mixed phase clouds at overly warm temperatures also have a spuriously negative high-latitude cloud optical depth feedback, driven by a simulated warming-induced transition from ice-dominated to liquid-dominated low clouds, while satellite observations suggest these clouds are already liquid-dominated (Gordon & Klein, 2014;McCoy et al., 2016;Tan et al., 2016;Terai et al., 2016). Improved simulation of Southern Ocean clouds in climate models will help us to better simulate the radiative energy budget in the current climate and to make more reliable future projections of Earth's climate.
Several recent GCM sensitivity studies have shown that the SO cloud bias can be substantially reduced by inhibiting several uncertain stratiform and convective cloud microphysical processes that can glaciate mixed-phase Southern Ocean clouds (Bodas-Salcedo et al., 2019;Gettelman et al., 2019;Kay et al., 2016). This may have led the Coupled Model Intercomparison Project Phase 6 (CMIP6) versions of several GCMs (Eyring et al., 2016) with revised treatments of mixed-phase clouds to have more positive global cloud feedback than in their CMIP5 counterparts (Bodas-Salcedo et al., 2019;Gettelman et al., 2019;Zelinka et al., 2020).
Until recently, there were very few in situ observations available to test and constrain such modeling choices. Satellite observations from active and passive sensors are an invaluable resource, but they have interpretational uncertainties that need to be anchored by in situ measurements. An evaluation of the CMIP6 GCM simulations of SO clouds and precipitation based on in situ observations coordinated with collocated active remote sensing is a key step for future improvement of cloud representations in the models.
Motivated by this, two coordinated field studies were conducted over the sector of the Southern Ocean between Tasmania and the Antarctic sea ice edge in January-February 2018: (1) a U.S. aircraft study based in Hobart, Tasmania, the Southern Ocean Clouds, Radiation, Aerosol Transport Experimental Study (SOCRATES) and (2) an Australian ship-based study, the second Clouds, Aerosols, Precipitation, Radiation, and atmospheric Composition Over the southeRn ocean field study (CAPRICORN2). These two studies used complementary sampling strategies. The research flights targeted weather regimes with low-lying clouds at altitudes below 4 km during daytime and primarily in cyclone cold sectors, providing detailed multivariate spatial cross sections through complex cloud fields but no temporal continuity. The ship sampled all weather regimes and times of day, but its only in situ measurements above the surface were radiosondes (4-8 times daily). Both platforms had vertically pointing cloud radar and lidar. The data from these two studies pair well because they test different aspects of GCM simulations.
One of the key novel features of this paper is to use this diverse and unique range of in situ and remote sensing measurements, together with satellite measurements to characterize Southern Ocean mixed-phase cloud and radiative properties. We perform a detailed comparison of observations with the atmospheric components of two state-of-the-art GCMs, with the intention to identify the strengths and deficiencies of the models and shed some lights on the potential solutions. The cloud and radiative properties of interest include the following: Southern Ocean cloud morphology, cloud and precipitation occurrence and frequency, cloud droplet number concentration (N d ), hydrometeor size distribution, and shortwave (SW) and longwave (LW) radiative effects at the top of atmosphere (TOA). Radiosondes launched on the ship and dropsondes from the aircraft map out the troposphere relative humidity field. The two GCMs evaluated in this paper are as follows: The Community Atmosphere Model version 6 (CAM6, Bogenschutz et al., 2018) is the atmospheric component of Version 2 of the Community Earth System Model (CESM2), developed by the National Center for Atmospheric Research (NCAR) and many other partners. The Atmosphere Model Version 4 (AM4, Zhao et al., 2018) is part of the CM4 climate model (Held et al., 2019) and ESM 4 (Dunne et al., 2020) Earth system model developed by the Geophysical Fluid Dynamics Laboratory (GFDL).
A centerpiece of our approach for comparing GCMs with observations is the use of nudged-meteorology simulations in which the GCM winds and temperature field are lightly nudged with a 24-hr time scale toward reanalysis, while other simulated fields (e.g., humidity, clouds, aerosols, and precipitation) are not nudged and freely evolve. This allows us to focus on model errors in water processes that are probably derived from the local action of physical parameterizations rather than an incorrect synoptic environment.
The models are sampled along the same paths followed by the plane and the ship, so that every observation can be meaningfully compared with model output at the same simulated time and place, without need for compositing or other statistical averaging, similar to Bretherton et al. (2019) and Wu et al. (2017). The nudged-meteorology approach is particularly useful in capturing the rapidly evolving storm systems of the SO.
Recently Gettelman et al. (2020) used SOCRATES and satellite measurements to look at cloud location, cloud phase, and boundary layer structure in CAM6 simulations and evaluate the improvement of CAM6 simulations compared to CAM5 using monthly averaged satellite retrievals. They found that improvements to the ice nucleation scheme in CAM6 result in significant improvements in the representation of supercooled liquid water. Our paper complements Gettelman et al. (2020) by assessing cloud and precipitation occurrence and its radiative impacts from a more statistical perspective, and combines unique CAPRICORN2 data and radar simulators for a comprehensive assessment.

Macrophysical and Radiative Properties of Clouds and Precipitation in CAM6 and AM4 Simulations
The remainder of this paper is organized as follows. Section 3 describes our observations and models, including more detail on the nudged-meteorology approach taken here. Section 4 evaluates the macrophysical and radiative properties of cloud and precipitation in CAM6 and AM4, including temperature, humidity, cloud water, and precipitation, low cloud occurrence, low and deep cloud macrophysics inferred from observed and simulated radar reflectivities, and TOA upwelling SW and OLR. Section 5 discusses microphysical properties of clouds and precipitation in CAM6 and AM4 simulations, including hydrometeor size distributions, cloud droplet number concentration, and hydrometeor microphysics for low and deep clouds inferred from COSP reflectivity decomposition. Section 5 presents conclusions.

SOCRATES and CAPRICORN2 Measurements
During the SOCRATES campaign, 15 research flights of the U.S. National Science Foundation Gulfstream V (GV) research aircraft (UCAR/NCAR -Earth Observing Laboratory (EOL), 2005) were conducted from Hobart, Tasmania (42°S, 147°E) out over the Southern Ocean between 15 January and 24 February 2018. The GV aircraft flew roughly southward at its ferry altitude of 6 km to a southernmost waypoint, typically near 58-62°S, chosen to optimize sampling of cold-sector boundary layer stratocumulus and cumulus. The GV then descended to conduct standardized sampling modules during the generally northbound return legs. Each 45-50 min module, spanning 400-500 km, was made up of 10-min above-cloud, in-cloud, and below-cloud (150-200 m altitude) legs, and a sawtooth leg consisting of an ascent to 600 m above cloud top, a descent to 150 m above sea surface, and another ascent above the cloud top. Over 70% of the in situ sampling from vertical profiles of SOCRATES were made in single or multiple stratiform layer clouds, followed by~20% of cumulus rising into stratocumulus, and 5% of open cell cumulus (Atlas et al., 2020). A comprehensive suite of instrumentation for sampling mixed-phase cloud, aerosols, and turbulence was deployed (https://www.eol.ucar.edu/content/socratesaircraft-payload), as well as a vertically pointing cloud radar and lidar and dropsondes.
The primary in situ instruments used in the current study are the Vertical-Cavity Surface-Emitting Laser (VCSEL; SouthWest Sciences (SWS) & UCAR/NCAR -Earth Observing Laboratory (EOL), 2008), the Cloud Droplet Probe (CDP), and the Two-Dimensional Stereo probe (2DS; Wu & McFarquhar, 2019). The VCSEL reported relative humidity (RH), derived as the ratio of measured water vapor concentration and saturated vapor pressure over liquid water at the ambient temperature (per Wexler's formula; Wexler, 1976) at a 25 Hz temporal resolution. HARCO heated total air temperature sensors were used for measurement of temperature (T) every 25 Hz.
We use GV remote sensing measurements from the 94 GHz (W band) HIAPER cloud radar (HCR; EOL, 2014) and the high spectral resolution lidar (HSRL; EOL, 2010), The radar and HSRL operated at a 2 Hz temporal resolution and could be manually switched to point up or down. The goal was generally to point toward the nearest clouds. Both instruments have a minimum range or "dead zone" of 150-200 m from the plane, but this was rarely an issue unless the aircraft was flying within a thin cloud layer. Past its dead zone, the HSRL could detect the occurrence of essentially all clouds (with attenuation for thicker clouds), even when the aircraft was flying at its ferry altitude of 6 km. Thus, in this study the combination of the HSRL and the in situ aircraft cloud probes were used to determine lower-tropospheric cloud occurrence. The CDP measured liquid water content and cloud droplet size distribution from 1-50 μm at a sampling rate of 10 Hz. The 2DS provided hydrometeor images, from which data processing software synthesized cloud and precipitation size distributions from 10-1,028 μm diameter.
The CAPRICORN2 cruise of Australia's Research Vessel (RV) Investigator spanned 10 January to 21 Febuary 2018. It was a sequel to earlier voyages in 20-29 March 2015, and March-April 2016 described in Mace and Protat (2018) and Protat et al. (2017). We use radar reflectivity profiles collected by an onboard calibrated 95 GHz W band vertically pointing cloud radar (see Mace & Protat, 2018 for more details). The radar reflectivity has been corrected for wet radome attenuation. We also use radiosondes from the cruise.

10.1029/2020EA001241
Earth and Space Science

Satellite Measurements
To assess the GCM-simulated top-of-atmosphere (TOA) radiative fluxes, we use edition 4A of National Aeronautics and Space Administration (NASA) Clouds and the Earth's Radiant Energy System (CERES; Wielicki et al., 1996) synoptic (SYN) cloud and radiation products (Doelling et al., 2013;Rutan et al., 2015). We use the hourly TOA fluxes of reflected shortwave radiation (RSW) and outgoing longwave radiation (OLR). The CERES SYN data are available on a 1°× 1°grid (https://ceres.larc.nasa.gov/products.php? product=SYN1deg). We extract the nearest grid points to the contemporaneous aircraft and ship locations for comparison with models.

GCMs 2.3.1. CAM6 Model Description
CAM6 was comprehensively described in Bogenschutz et al. (2018) and Gettelman et al. (2019). This section summarizes key features of CAM6 for this study. CAM6 implements the Cloud Layers Unified by Bi-normals (CLUBB, Golaz et al., 2002;Larson et al., 2002) parameterization to replace the planetary boundary layer, shallow convection, and cloud macrophysical parameterization schemes used in CAM5. The unified CLUBB scheme bypasses the complexity of interactions between schemes to improve performance for the simulation of boundary layer clouds, especially of intermediate types of regimes such as the stratocumulus to cumulus transition (Bogenschutz et al., 2013;Guo et al., 2015). CAM6 retains the deep convection scheme of Zhang and McFarlane (1995) used in CAM4 and CAM5. The precipitation from the CLUBB and deep convection schemes is referred as large-scale (stratiform) and convective precipitation, respectively. CLUBB diagnoses cloud fraction and cloud liquid water from a joint double-Gaussian probability density function (PDF). Ice and liquid cloud fractions in CLUBB are the same and are analytically diagnosed by integrating oversaturated portions of the joint PDF (Guo et al., 2014). The total cloud fraction in CAM6 combines CLUBB and deep convective cloud cover fractions, and an ice cloud fraction assuming maximum overlap.
The CAM6 microphysics package incorporates a two-moment scheme for four classes (liquid, ice, and large scale rain and snow) with updated ice nucleation parameterization, MG2 (Gettelman & Morrison, 2015). MG2 is coupled to a physically based mixed phase ice nucleation scheme (Hoose et al., 2010) implemented in CAM6 with modifications for a PDF of contact angle by Wang et al. (2014). MG2 accounts for preexisting ice during cirrus ice nucleation (Shi et al., 2015).
Aerosols are predicted by a four-mode version of the Modal Aerosol Module (MAM4) (Liu et al., 2016), initialized based on climatological profiles in year 2000 from CMIP6 emissions inventory. The activation of aerosols into cloud droplets in CAM6 is diagnosed as a function of the modeled subgrid-scale updraft velocity and aerosol compositions and size distribution (Abdul-Razzak & Ghan, 2000).
The CAM6 simulations in this paper are run with prescribed sea surface temperature. A finite-volume (FV) dynamical core of 0.9°longitude × 1.25°latitude resolution is used with 32 vertical levels and a model time step of 30 min. To facilitate model evaluation against observations, CAM6 was run in a nudged configuration using the NASA Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2; Molod et al., 2015;Rienecker et al., 2011) horizontal winds, temperature, and monthly mean sea surface temperature (SST) with a relaxation time scale of 24 hr. MERRA-2 nudging fields are interpolated to the CAM6 vertical levels before nudging. The CAM6 simulation is performed starting on 1 January 2017, to ensure proper spin-up of aerosol and land surface fields well before any observational comparisons. Model outputs along the tracks of the aircraft and ship (specifically, from the nearest model grid points to the current ship and aircraft locations) are calculated in-line and output at time steps of 1 and 10 min, respectively. 2.3.2. AM4 Model Description AM4 was comprehensively described by Zhao et al. (2018). Here we summarize those physical parameterizations from the model that are particularly relevant to its simulation of Southern Ocean clouds and aerosols. AM4 uses a double plume shallow convection scheme adapted from Bretherton et al. (2004), and a deep convection scheme based on a cloud work function relaxation closure (Zhao et al., 2018). The macrophysical scheme of large-scale clouds in AM4 follows Tiedtke (1993). Cloud water content and fractional cloud cover are described prognostically by large-scale budget equations. The increase in cloud cover is determined by the fraction of the cloud-free area exceeding saturation. AM4 implements a one-moment microphysics scheme for liquid water following Rotstayn (1997) and Rotstayn et al. (2000) with an inclusion of a prognostic scheme for cloud droplet number concentration (Ming et al., 2007), as in AM3. A rain profile is diagnosed at each time from the cloud properties (Rotstayn, 1997).
Ice is predicted from water vapor diffusion at the expense of liquid water (the Wegener-Bergeron-Findeisen process) and homogeneous freezing of liquid water at temperatures colder than −40°C. Ice melts to form liquid water at temperatures warmer than 0°C. In AM4, there is no distinction between falling ice, snowflakes, and graupel. All forms of atmospheric ice are represented by a single variable. The ice particles fall with a mass-weighted mean velocity calculated assuming fall speed is proportional to the 0.16 power of particle diameter. Falling ice particles are approximated by a negative exponential distribution with effective radius determined by temperature that ranges from 15-100 μm (Donner et al., 1997).
Aerosols in AM4 are predicted based on climatological sources in year 2016 from the CMIP6 emissions inventory; only the mass is prognosed for each aerosol type with a fixed assumed size distribution (Zhao et al., 2018). The activation of aerosols into droplets uses the parameterization of Ming et al. (2006).
AM4 uses the GFDL Finite-Volume Cubed-Sphere dynamical core (FV 3 ; Harris & Lin, 2013;Putman & Lin, 2007) with a grid of~100 km horizontal resolution and 33 vertical levels. For the simulations presented here, AM4 was run in a nudged configuration (Jeuken et al., 1996) similar to that used for CAM6, with the same 24 hr nudging time scale, but instead nudged to the fifth generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric reanalysis of the global climate (ERA5; Hersbach & Dee, 2016) horizontal winds, temperature, and surface pressure with a relaxation time of 24 hr. Like CAM6, the AM4 simulation starts on 1 January 2017. Data are output every 3 hr for radiation fields and 1 hr for other quantities. The nearest model grid points to the ship and aircraft locations were extracted from the AM4 simulations by linearly interpolating to the observation point for comparison with observations and CAM6.

COSP Radar Simulator
Within each grid column, the profiles of cloud and precipitation are converted to profiles of synthetic radar reflectivity using implementations of the Cloud Feedback Model Intercomparison Project (CFMIP) Observation Simulator Package (COSP; Bodas-Salcedo et al., 2011) in the two GCMs. CAM6 and AM4 use COSP Versions 2.1 and 1.4.1, respectively (Bodas-Salcedo et al., 2011;Swales et al., 2018), but there is no crucial scientific difference between COSP versions. In this study, we focus on use of the CloudSat simulator within COSP. It provides synthetic radar reflectivity at a frequency of 94 GHz and can be compared with the observed W band reflectivity.
The implementation of COSP in a GCM usually makes some additional model-specific assumptions that are not part of the GCM, are not necessarily well documented, and which may impact the synthetic radar reflectivity. For example, the hydrometeor size distribution assumptions can be slightly different between COSP and the parent GCM microphysics scheme. In the CAM6 COSP, all hydrometeors are described with modified gamma distributions. In the CAM6 microphysics scheme, cloud drops are described with a gamma distribution while ice, rain, and snow are assumed to have exponential distributions (gamma with m = 0). The AM4 microphysics scheme has a single ice category that includes both cloud ice and snow and has an aggregate fall speed. In this sense, snow is simply falling ice. AM4 treats the total ice and snow concentration as cloud ice in COSP, which is assigned to have the temperature-determined effective radii of cloud ice particles in AM4. Furthermore, the clear-sky ice flux (flux of ice outside of cloud entering the unsaturated portion of the grid box from above) is used for snow in COSP with effective radii computed internally in COSP. Snow inside clouds is not accounted for explicitly. The impact on the synthetic radar reflectivity of these differences in the assumptions made between COSP and the GCM microphysics scheme is discussed in Appendix B.
The COSP interface varies between host models. CAM6 uses COSP's default column generator to produce 10 homogenous subcolumns, while AM4 treats the subgrid cloud and precipitation fields from the radiation scheme as the COSP subcolumns, rather than using the default COSP subcolumn generator. We observed little difference between the subcolumns. The insufficient subcolumn variability in COSP's default subcolumn generator may lead to overestimated radar reflectivity and probability of precipitation compared to the satellite observations (Song et al., 2018).

Macrophysical and Radiative Properties of Clouds and Precipitation in CAM6 and AM4 Simulations
In this section, we will use in situ and remote sensing observations from SOCRATES and CAPRICORN2, and NASA CERES SYN product to evaluate the macrophysical and radiative properties of clouds and precipitation in CAM6 and AM4. SOCRATES sampling focused on low clouds with cloud top height lower than 4 km and little or no precipitation falling from any overlying clouds through the 4 km level. We select RF09 (a case of cumulus rising into stratocumulus) and RF12 (a stratocumulus case) as two SOCRATES examples to demonstrate single-flight comparisons of observations and GCM simulations of shallow cumulus and stratocumulus regions. The upward pointing 94 GHz shipborne radar deployed on the R/V Investigator during CAPRICORN2 sampled whatever clouds were overhead, including many periods of deep clouds with cloud tops above 4 km that were not targeted in SOCRATES. We use this unique radar data set to evaluate the representation of both deep and low clouds in the GCMs. Figure 1 shows time-height plots of T along Flight RF09 (inside the black channel) overlying the corresponding fields simulated by CAM6 and AM4 respectively. The temperature field (and the relative humidity field in section 4.2) is plotted from 0-8 km altitude to encompass the ferry leg and provide synoptic-scale context. RF09 (Figures 1a and 1b) targeted an extensive deck of cold, low-level cloud in the cold sector of a midlatitude cyclone south and east of Tasmania. Two sampling modules were completed in the cold sector regions south of 50°S. The boundary layer cloud tops, at a height of 2.5 km, have a cloud top temperature near −15°C. RF12 (Figures 1c and 1d) targeted an extensive stratocumulus case south of 55°S. The stratocumulus deck topped a fairly well-mixed 1,500 m deep boundary layer, with a cloud top temperature around −9°C capped by a 5°C temperature inversion. The cloud deck was in the cold sector of a weak cyclone. Figure 1 reveals that the temperature in the two nudged GCM simulations agrees with the in situ observations to within 1-2°C. Since temperature is a nudged field, this indicates that the nudged-meteorology approach is working as hoped.

Temperature
In order to test the accuracy of the large-scale meteorology in the GCMs, the root-mean-square (RMS) error was calculated between the observations during SOCRATES and GCMs. Across all campaign flights, observed temperature along the flight track were averaged over 50 m bins in altitude during each 2 min time interval. The two nudged GCMs were similarly sampled. CAM6 and AM4 had RMS temperature errors of 1.3 and 1.4 K, remarkably small considering the remote sampling region and large synoptic variability. This is mostly a testament to the accuracy of the reanalysis to which the GCMs were being nudged (which match the observations within even smaller RMS errors of less than 1 K). However, it also shows both GCMs are very good short-term weather forecast models that are able to retain this level of accuracy for at least a day (the nudging time scale).

Relative Humidity
Relative humidity (Figure 2) is important for producing clouds. It is a more challenging test for the nudged GCM simulations, since their humidity fields are not constrained with reanalysis data. For both observations and models, the RH in this paper is computed based on liquid saturation. In RF09 (Figures 2a and 2b), the high-RH boundary layer is capped by dry, low-RH, subsiding air above 2.5 km. The free-tropospheric RH is fairly well simulated by both models. Inside the boundary layer, the observed RH is horizontally variable, and is relatively low in the ascent portion of a cloud-free sawtooth near 57°S in the return leg of RF09. This is suggestive of shallow cumulus rising into a broken stratocumulus layer, a common cold-sector cloud type. Both models capture the boundary layer depth qualitatively well except that they underestimate RH at the top of the boundary layer. The boundary layer RH in CAM6 is comparable to observations (Figure 2a), but the AM4 boundary layer is drier than observed ( Figure 2b). In RF12 (Figures 2c and 2d), both CAM6 and AM4 clearly show low RH at the top of the boundary layer, suggesting biased low boundary layers in these GCMs. Figure 3a shows a time-height section of RH during the 1−15 February, 2018 period of the CAPRICORN2 campaign. The regions with high RH (>80%) are correlated with the cloud regimes (as will be shown in section 3.5). The corresponding RH for CAM6 and AM4 is shown in Figures 3b and 3c. Both models qualitatively reproduce the RH profiles for low cloud regimes sampled along the ship track. CAM6 slightly overestimates the observed RH in regions of deep cloud while AM4 substantially underestimates RH in those regions.
We use RH as a measure of humidity, since it has comparable variability across RMS errors across the whole range of sampled heights. Across all flight samples, the RMS error of RH is 23% and 22% for CAM6 and AM4, respectively. For comparison, the ERA5 and MERRA-2 reanalysis had slightly smaller RMS RH errors of 17% and 19%. Such errors are large enough to affect the existence and placement of cloud layers, even in a GCM with perfect microphysics.

Cloud Water Content and Precipitation
Figure 4 shows observed and modeled in-cloud cloud water content (CWC), and precipitating particle number density (NLarge, described below) during RF09 and RF12. The microphysical fields are only plotted over the 0-4 km altitude range to highlight low clouds and their environment. This is an even more challenging comparison for the models because it requires the models to have both accurate cloud placement and cloud

10.1029/2020EA001241
Earth and Space Science microphysics. The observed CWC is taken from the GV CDP and is plotted when its value exceeds 0.01 g m −3 . CWC less than 0.01 g m −3 is masked in gray. For the GCMs, the cloud-containing grid cells are distinguished from clear-sky grid cells by having nonzero cloud water mixing ratio and the in-cloud water content is calculated by dividing grid-mean cloud water content by simulated cloud fraction. To be consistent with observations, the GCM CWC is plotted when its value exceeds 0.01 g m −3 .
To shed light on the representation of precipitation in the GCMs, we compute in-cloud NLarge (Figures 4c  and 4f ). NLarge is computed from 2DS particle size distributions (PSD) as the concentration of large precipitating particles with radius greater than 100 μm. The observed NLarge is compared against the CAM6 counterpart along the flight track computed in the same way as in observations based on the model PSD of fraction mean cloud and precipitation. The "fraction-mean" cloud and precipitation are calculated by dividing grid-mean cloud and precipitation quantities by simulated cloud and precipitation fraction, respectively. The CAM6 precipitation fraction is set to be the same as the cloud fraction in each cloud-containing grid cell and to the cloud fraction of the lowest cloud-containing grid cell below cloud. Because precipitation in AM4 is treated diagnostically, NLarge is not computed by AM4.
The RF09 sawtooth legs sampled a broken cloud field with intermittent CWC (Figure 4a). CAM6 generally underestimates its cloud water content (Figure 4a) but overestimates NLarge (Figure 4c). The CWC in AM4 in RF09 (Figure 4b) agrees better with observations than CAM6, but the AM4 clouds have lower cloud base heights compared to observations, a bias seen in many cases during the SOCRATES campaign. As one might expect, the CAM6 CWC and NLarge in the comparatively horizontally homogeneous stratocumulus decks of RF12 (Figures 4d and 4f ) agree better with observations than these quantities in the more heterogeneous cumulus regions of RF09. However, CAM6 tends to miss the light precipitation and spatially intermittent snow that formed in the thicker centers of mesoscale closed cells during RF12 (e.g., −58°N and −56°N in the return flight in Figure 4f ). CAM6 and especially AM4 simulate a cloud base height that is too low compared to the~1 km base observed in RF12 (Figures 4d and 4e). In AM4, the simulated clouds extend down to the ground level.

Earth and Space Science
Cloud placement errors reduce the value of a point-by-point comparison of GCM versus observed cloud properties. It is more illuminating to make a statistical comparison of mean biases in GCM versus the observed CWC at the same overall region, altitude range, and time. We bin the observed and simulated CWC for the 15 SOCRATES flights into boxes of 500 m in altitude and 25 min (equivalent to 210 km at a typical flight speed of 140 m s −1 ) in time along the flight track. This binning box is chosen to be big enough to reduce sampling noise but small enough to still represent the local CWC. Boxes in which the binned average CWC < 0.01 g m −3 for either the models or the observations are excluded from the statistics. Boxes with less than 10 observed samples are screened out. This leaves 133 binned samples, most of which are in altitudes below 3 km. Figure 5 presents the bin mean and range of CWC over all time bins within each altitude band. The model and observed CWC interquartile ranges generally agree with each other between 1.5-2 km (although with large spread). CWC is clearly overestimated, especially by AM4, below 1 km, an indication that the simulated cloud base is systematically too low as in the RF09 and RF12 examples. On the other hand, in-cloud CWC for both GCMs, especially AM4, is biased low above 2.5 km compared to observations, suggesting that the GCM clouds tend to have a slightly lower cloud top height. This is consistent with the low RH bias at the top of the boundary layer seen for RF12 in Figure 2c (CAM6) and more prominently in Figure 2d (AM4).

Low Cloud Occurrence
Occurrence of low clouds with tops below 4 km in CAM6 and AM4 columns cannot be evaluated using 1 Hz in situ point measurement. Instead, it is evaluated in this section using a column cloud fraction based on combining a HSRL backscatter threshold to detect cloud above or below the aircraft and the GV CDP liquid water content to detect cloud at the aircraft level during SOCRATES which may not extend outside the 150 m lidar dead zone, or which may attenuate the lidar beam before it reaches the cloud edges. Within a lidar sampling time of 0.5 s, low cloud is flagged if any of the 10 Hz CDP liquid water content Examples of the lidar backscatter for RF09 and RF12 are shown in Figures 6a and 6e, where cloud boundaries (i.e., cloud tops when the aircraft was above and cloud bases when below) are well captured by HSRL as seen from the strong lidar backscatter near 1 to 2 km. The observed upper cloud boundaries (cloud tops) are slightly higher than those implied by the GCM cloud fraction maps.
We define the observed low cloud fraction as the fraction of low cloud flags during every 10 min (equivalent to~1°at a typical flight speed of 200 m s −1 ). We compare this with the corresponding low cloud fraction in CAM6 and AM4 averaged over the same time periods when there is observational data (e.g., Figures 6b, 6c, 6f, and 6g). The low cloud fraction for each GCM is computed following that GCM's vertical cloud overlap assumptions (maximum-random overlap [i.e., maximum overlap within all adjacent cloudy layers, while applying random overlap for noncontiguous blocks of cloudy layers] for CAM6 and exponentially decaying overlap for AM2 with a length scale of 2 km [i.e., assuming a specified length scale controlling the decorrelation of clouds in the column; Zhao et al., 2018]). The regions outside of the HSRL view zone (i.e., regions above/below the aircraft when the HSRL pointed down/up) are masked out before computing GCM low cloud fraction (gray shading in Figure 6).
The low cloud fraction comparisons for RF09 and RF12 are shown in Figures 6d and 6h. As suggested by the lidar backscatter profiles in Figures 6a and 6e, the observed low cloud fraction in the cumulus regions in RF09 is smaller than that in the stratocumulus regions in RF12. In both flights, CAM6 typically simulates a low cloud fraction that is too large, whereas that in AM4 is too small. Similar low cloud fraction biases are present across the 15 SOCRATES flights. Figure 7a shows an all-flight histogram of 10-min average low cloud fraction. Low clouds, either alone or cooccurring with cloud layers aloft, are observed in 96% of the 10-min intervals during SOCRATES. About half of the intervals have a low cloud fraction greater than 80%. Only~10% of the intervals have a low cloud fraction less than 20%. In CAM6, intervals of nearly complete low cloud cover (greater than 90%) occur 60% of the time versus 30% of the time in AM4 and 45% in the observations. Over half of the intervals including low clouds in

Earth and Space Science
AM4 are characterized by a low cloud fraction smaller than 50%, about twice as frequent as CAM6 and observations. Another way to present these data is by binning the 10-min intervals by the observed low cloud fraction, and testing how well the models replicate the low cloud fraction within each bin (Figure 7b). Ideally, a model would lie on the 1:1 line with no scatter about the observations in this box-whisker plot, but from our other comparisons we expect both large scatter (a large interquartile range of simulated cloud fraction for a given observed cloud fraction) and bias. Indeed, the scatter is large, and the interquartile ranges show that in most bins, about 75% of the CAM6 samples lie above the observed cloud fraction, while about 60% of the AM4 samples lie below the observed cloud fraction. One exception for AM4 is that it produces too much cloud when the observed cloud fraction is less than 10%. This could be due to geographical misplacement of scattered cloud rather than parameterization biases given its agreement with observations for the 10-20% low cloud fraction bin. In summary, CAM6 overestimates and AM4 underestimates low cloud fraction in the cold-sector low cloud regimes sampled by SOCRATES.

Low and Deep Cloud Macrophysics Inferred From Observed and Simulated Radar Reflectivities
Cloud macrophysics properties can be characterized by radar reflectivity. Here, we evaluate the low and deep cloud macrophysics in CAM6 and AM4 by comparing observed and simulated radar reflectivities during CAPRICORN2. We note that the HCR radar reflectivity data during SOCRATES (see supporting information) shows essentially the same results as the low clouds during CAPRICORN2; hence, we only present the analysis of CAPRICORN2 here.  Figure 8 shows reflectivities from the radar, CAM6 COSP, and AM4 COSP simulators for the half-month period during CAPRICORN2 campaign. As seen in Figure 8a, low clouds with cloud tops below 4 km were regularly observed while deep cloud layers reaching above 6 km were also frequent. The deep clouds are often associated with significant precipitation indicated by strong reflectivity (>0 dBZ) near the surface, which also often attenuates the W band radar echo below detectability above 6 km. The precipitation from the thin low clouds is much weaker. As one would expect, the cloudy, precipitating regions are collocated with high relative humidity in a time-height section created from the ship-launched radiosondes. For this study, CAM6 COSP provided reflectivity with and without hydrometeor and gas attenuation as viewed from the ground (Figures 8b and 8c), while AM4 COSP only output attenuated reflectivity as viewed from space ( Figure 8d). As seen by comparing Figures 8b and 8c, the inclusion of attenuation can reduce the reflectivity by several dB for deep precipitating clouds, but it has no significant impact on cloud morphology and low cloud reflectivity. Since AM4 COSP reflectivity is significantly weaker than that of CAM6 COSP (Figure 8d), the hydrometeor attenuation is of only minor importance. As such, we expect the space-based attenuated reflectivity of AM4 COSP to be qualitatively comparable to its ground-based counterpart. In the rest of the study, unless otherwise mentioned, we will compare attenuated CAM6 and AM4 COSP reflectivity with observations. CAM6 COSP reflectivity (Figure 8c) agrees fairly well with the ship-observed reflectivity (Figures 8a), but has longer and less interrupted periods of deep cloud occurrence (e.g., 1-3 February; 11-13 February). The AM4 COSP reflectivity is significantly too weak in the deep clouds, indicating underestimation of snow (Figure 8d), for reasons to be discussed in section 4.3. An abrupt change in reflectivity occurs at the freezing level at 1-2 km, below which the AM4 COSP reflectivity matches the observations better.
For a quantitative statistical comparison of observed and modeled reflectivity, we construct Contoured Frequency by Altitude Diagrams (CFADs, Yuter & Houze, 1995) of observed and COSP reflectivity along the entire ship track during the CAPRICORN2 campaign ( Figure 9). The joint histograms are created for every 2 hr with a 100 m vertical resolution and 2 dBZ increments from −40 to 10 dBZ in the horizontal, then conditionally averaged over the desired cloud regimes. Unlike in some studies of deep convection (e.g., Houze et al., 2007), our CFADs are not normalized to exclude regions with no detectable reflectivity.
The CFAD averaged over all CAPRICORN2 observations (Figure 9a) shows a shadowy boomerang shape with a horizontal arm due to low clouds below 4 km and a diagonal arm due to deep convective clouds that extend beyond 6 km. The CAM6 COSP CFAD (Figure 9b) displays a shape analogous to observations but with much higher occurrence of reflectivities exceeding −10 dBZ. The upper arm of the AM4 COSP reflectivity CFAD is strongly shifted by~25 dBZ toward reflectivities lower than observed (Figure 9c). Figure 9 also shows separate CFADs for low versus deep cloud columns, which are defined as having a maximum reflectivity above 4 km less (vs. greater) than −40 dBZ. We note that the deep cloud columns might include some high clouds, but we do not expect this to change our assessment given its infrequency. The observed low-cloud CFAD (Figure 9d) has a mode between −10 and 0 dBZ between 0-1 km in altitude associated with lightly precipitating cloud, with a lower tail extending to −40 dBZ contributed by low-level nonprecipitating clouds. The CAM6 low-cloud CFAD (Figure 9e) shows a comparable histogram of reflectivities, but with the maximum occurrence frequency at a slightly lower reflectivity near −10 dBZ and no tail of reflectivities below −20 dBZ and 1 km altitude. The AM4 low-cloud CFAD (Figure 9f ) is fairly similar to observations below 1 km altitude but underestimates reflectivities above 1 km altitude.
The observed deep-cloud CFAD (Figure 9g) constitutes the broader upper arm of the boomerang, with typical reflectivities clustering around 0 dBZ below 4 km and decreasing to~−20 dBZ at~6 km (Figure 9g). The CAM6 deep clouds (Figure 9h) cluster at a comparable reflectivity range but occur more frequently than observed. Larger reflectivities are maintained at a much higher altitude in CAM6 as well. The AM4 deep clouds ( Figure 9i) have a −15 dBZ low bias in reflectivity except near the surface, where they are comparable in frequency and magnitude to observations.

TOA Upwelling SW and OLR
Biases in CWC and cloud placement contribute to radiative biases in the GCMs. A conventional way to evaluate the impact of cloud on radiation is to compute cloud radiative forcing, defined as the difference of net downward radiative fluxes at TOA with and without cloud. However, since the retrieval of clear-sky radiation from satellite observations inevitably involves uncertainty, in this study we instead compare observed and simulated TOA reflected shortwave and outgoing longwave radiative fluxes as more reliably observed proxies for cloud effects on radiation. We recognize that they may also incorporate biases not related to cloud, for example, in humidity or surface properties. The radiative flux estimates are matched to the same locations and times as the low cloud fraction estimates.  Figure 10 shows the TOA RSW and OLR fluxes along the flight tracks during SOCRATES from CERES SYN and from the two models, binned by observed low cloud fraction. Consistent with the overestimated cloud fraction in CAM6, the RSW in CAM6 is biased high for all bins of observed low cloud fraction. This high bias remains significant even when the observed low cloud fraction is 90-100%, suggesting that the low clouds in CAM6 are not only too frequent, but also too bright. As a result, the average RSW in CAM6 over the entire SOCRATES field campaign is about 20% higher than observed. The overestimate of low cloud cover in CAM6 also leads to underestimated OLR in bins with 50% or less observed low cloud cover. Since the CAM6 cloud tops are at altitudes comparable to the observed, although slightly low-biased, they appear not to have large cloud-top temperature biases. Thus, when the observed and CAM6 cloud fractions are near to 100%, the average OLR of CAM6 is similar to observed. The radiation bias of CAM6 ("too frequent, too bright") is consistent with the climatological cloud radiative effect shown in Gettelman et al. (2020).
In contrast, the underestimated low cloud fraction in AM4 allows for more OLR originating from the sea surface to escape to space, contributing to a sizable high OLR bias in all cloud fraction bins. Surprisingly, the AM4 TOA upwelling SW is comparable to observations in all observed cloud fraction bins. This implies the clouds are optically thicker than observed, that is, AM4 has a "too few, too bright" bias for SO low clouds, which is common in CMIP5 models (Engström et al., 2015;Nam et al., 2012). Figure 11 compares TOA RSW (a) and OLR (b) from CERES SYN observations with the two models for the same period during CAPRICORN2. The deep clouds in CAM6 tend to reflect more shortwave radiation (are "brighter") than observed, leading to a 10% high bias in the mean reflected SW over the whole period. The CAM6 OLR has a time-mean comparable to the observations but has a low bias in the deep cloud regions (e.g., 1-3, and 11 February). In AM4 the RSW is comparable to CERES with intermittent high biases, while the OLR is typically slightly high. Overall, these biases are similar to the low clouds observed in SOCRATES. They imply that deep clouds, like low clouds, are in general too bright in both CAM6 and AM4, and are too frequent in CAM6 but too broken in AM4.

10.1029/2020EA001241
Earth and Space Science

Microphysical Properties of Clouds and Precipitation in CAM6 and AM4 Simulations
We now investigate some underlying model-observation discrepancies in microphysics that may contribute to the radiation biases in models associated with Southern Ocean low clouds.

Hydrometeor Size Distributions
We quantify the occurrence of precipitating and nonprecipitating low clouds in observations and CAM6 along the SOCRATES flight track sorted by ambient temperature (Figure 12a). An observed or CAM6 low cloud is classified as precipitating if NLarge (defined in section 4.3 as the concentration of cloud particles with radius bigger than 100 μm; recall also that this cannot be computed for the simpler AM4 microphysics) is greater than 1 × 10 −4 m −3 in observations or CAM6 simulations. The occurrence is computed in cloud regions where CDP CWC exceeds 0.01 g m −3 . Eighty-five percent of the SOCRATES samples were collected in cold clouds (at temperatures below freezing), of which only~10% were precipitating. This is partly because the GV intentionally avoided long flight legs in drizzling supercooled clouds for safety. Repeating the analysis based on the nearest CAM6 grid cells along all 15 SOCRATES flight tracks (Figure 12b), we find  that the CAM6 clouds span a generally similar temperature with comparable precipitation occurrence, although precipitation occurrence in CAM6 clouds does not agree that well with observed clouds during individual flights (e.g., precipitation is overestimated in RF09 but underestimated in RF12 in CAM6; Figures 4c and 4f ). The imperfect match during individual flights might be because the deficient representation of the cloud intermittency in CAM6.
We compared the hydrometeor size distributions observed from the CDP and 2DS averaged over the nonprecipitating and precipitating clouds with those inferred along the flight tracks from CAM6 (Figure 13), summed over cloud, rain, ice, and snow. As seen in Figure 13a, nonprecipitating clouds display a unimodal distribution with a peak around 10 μm radius. This unimodal distribution is well represented in CAM6 and is dominated by liquid. CAM6 underestimates the number of cloud droplets with radii less than 20 μm, which dominate the overall cloud droplet number concentration. This bias is larger for the precipitating clouds ( Figure 13b).
By definition, the observed number of particles with radius >50 μm is larger for precipitating clouds, leading to a shoulder in the observed droplet size distribution seen in Figure 13b. The CAM6 simulations have a comparable increase in rain (blue dash) at 50-300 μm radii and in snow (red dash) at radii exceeding 300 μm, suggesting that there is slightly more snow on average in CAM6 than in observations. The model PSDs should not be expected to agree perfectly well with observations on the large-radius tail, given a simple bulk two-moment scheme in CAM6. Note that the PSD in this study is computed from in-cloud legs defined as CWC > 0.01 g m −3 . CAM6 is found to have more rain than observations if a less strict in-cloud threshold is used (Gettelman et al., 2020).
The hydrometeor PSDs in CAM6 ( Figure 13) are dominated by supercooled liquid droplets at small sizes. This is consistent with observations. We find that the supercooled boundary layer clouds sampled by the GV at temperatures of −5 to −25°C were a mix of small liquid drops that dominate the cloud optical depth and (when precipitating) larger ice and snow particles. This conclusion is based on several complementary lines of evidence shown in Appendix C. A more comprehensive phase partitioning analysis is deferred to future work.

Cloud Droplet Number Concentration (N d )
We compare observed in-cloud N d , computed as the summation of cloud droplets measured by the CDP when the CDP CWC > 0.01 g m −3 , with the GCM-simulated in-cloud N d . Figure 14 shows the RF09 and RF12 examples. AM4 N d is comparable to observations, but CAM6 significantly underestimates N d .
These flights are representative of SOCRATES as a whole. Figure 15 shows interquartile range boxes of observed and GCM in-cloud N d measured across all 15 SOCRATES flights and binned similarly to the in-cloud CWC described in section 4.3. Points where binned average N d < 1 cm −3 for either the models or the observations are excluded from the statistics. Figure 15 shows that the observed N d clusters around 25-150 cm −3 with the highest N d (>100 cm −3 ) occurring mostly near 0.5-1.5 km. CAM6 shows a low bias in N d above 500 m which amplifies with height. AM4 simulates more high N d outliers than observed for clouds above 2 km, and does not simulate the relatively uncommon occurrences of observed N d lower than 40 cm −3 . On average, however, AM4 produces a mean N d at all altitudes much closer to observations than CAM6.

10.1029/2020EA001241
Earth and Space Science CAM6's low N d bias could be due to insufficient CCN production or too small a fraction of aerosol activated in the model. We find that there is no significant statistical bias in precipitation scavenging of CCN in CAM6 when all cases are considered. Atlas et al. (2020) finds CAM6 simulates too little cloud layer turbulence in stable and neutral boundary layers, which could lead to an underactivation of CCN. However, CAM6 also underestimates N d in unstable boundary layers for which its simulated turbulence is on average consistent with observations. This suggests that there may be multiple competing biases in the model. Disentangling these compounding influences will be necessary to understand the cause of N d bias in CAM6 and should be the topic of future investigations.

Hydrometer Microphysics Inferred From COSP Reflectivity Decomposition 4.3.1. CAM6
Hydrometeor microphysics can also be inferred from reflectivity. Here, we partition the nonattenuated COSP synthetic reflectivity during CAPRICORN2 (e.g., Figure 8b) into contributions from cloud liquid, cloud ice, rain, and snow. Here we only consider large-scale precipitation, since convective precipitation rarely occurs in CAM6 along the ship track. The synthetic reflectivities of liquid, ice and rain are calculated from their respective grid mean number concentrations and effective radii following the formulas in COSP. The synthetic snow reflectivity is computed as the residual of the total nonattenuated COSP reflectivity and the sum of synthetic reflectivities from the other three hydrometers. AM4 only outputs an attenuated reflectivity which cannot be exactly partitioned in this way.
We decompose the CAPRICORN2 CAM6 CFADs into cloud liquid, cloud ice, rain, and snow for all clouds (Figures 16a-16d), low clouds (Figures 16e-16h), and deep clouds (Figures 16i-16l). In all cases, stronger reflectivities are dominated by snow. CAM6 also simulates a substantial amount of cloud liquid and drizzle with reflectivity below −20 dBZ at altitudes below 2 km (Figures 16a, 16e, and 16i). Above 2 km, cloud ice becomes more prevalent in CAM6 but has low reflectivity below −10 dBZ. However, such low reflectivity is missing in the nonpartitioned reflectivity (Figures 15b, 15e, and 15h) suggesting that snow is more frequent in CAM6 than in the observations. The missing tail of low reflectivities might be also partly due to the insufficient subgrid variability of cloud and precipitation in CAM6 COSP such that almost all simulated clouds have precipitation dominating their reflectivity, or due to the underrepresentation of thin clouds in the models.
The snow mass or size in CAM6 low clouds appears underestimated since its maximum frequency (Figure 16h) is located at a lower reflectivity than the observations (Figure 15d). This indicates that snow in CAM6 low clouds is more homogeneous but less intense compared to the observations. For deep clouds, the frequency of occurrence of snow ( Figure 16l) is much higher than observations, while the grid average reflectivity is similar to observed at~0 dBZ. This implies that the snow in CAM6 deep clouds is similarly homogeneous and moderate. Note that the high snow occurrence could partially be attributed to the insufficient subgrid variability of cloud and precipitation in CAM6 COSP as mentioned earlier.

AM4
To better understand the representation of hydrometeors in AM4, we compare time-height sections of grid mean liquid water and ice mixing ratios and precipitation fluxes from CAM6 and AM4 ( Figure 17). Normally AM4 shows substantially more cloud ice compared to CAM6 (Figure 17f compared to Figure 17b). The reason is that its microphysics scheme does not distinguish snow from ice and the cloud ice in AM4 is the sum of ice and snow. The AM4 downward ice flux is vertically continuous with the rain flux (Figures 17g-17c), confirming that above the freezing level the AM4 precipitation from deep and shallow clouds is in the form of sedimenting cloud ice particles. The snow flux approximated from the clear-sky ice flux as used in AM4 COSP (Figure 17h) is less frequent and intense compared to the snow flux in CAM6 (Figure 17d). AM4 has less supercooled liquid water above 2 km than CAM6 (Figures 17e-17a), but our CAPRICORN2 and SOCRATES observational analyses cannot as yet clearly test which model is closer to the truth. To evaluate the snow intensity in AM4, we compare the hydrometeor PSDs in AM4 COSP with CAM6 COSP (Figure 18). Here the PSDs are computed from area-weighted mean cloud liquid, cloud ice, rain, and snow. AM4 has greater ice with much less rain and snow. Compared with CAM6 COSP snow PSDs, AM4 COSP significantly underestimates large snow particles with radius greater than 100 μm, leading to lower reflectivities. The AM4 COSP snow PSD is not taken from the AM4 microphysics, which would give no separate snow contribution to the PSD and worsen the AM4 underestimate of reflectivity.

Summary
Observations of cloud properties from sophisticated in situ and ship-based remote and in situ sensors over the Southern Ocean during airborne (SOCRATES) and ship-based (CAPRICORN2) measurement campaigns during January-February 2018 are used to evaluate two state of the art atmospheric general circulation models (GCMs): CAM6 and AM4. These GCMs were nudged to reanalysis wind and temperature fields to minimize differences between modeled and observed synoptic conditions.
These measurements, together with collocated CERES TOA radiative flux estimates, provide a valuable data set for evaluating simulations of cloud and precipitation in CAM6 and AM4 and to understand their radiation biases The major conclusions and implications are as follows: The nudged-meteorology simulation method facilitates detailed comparison of measured and simulated cloud properties from a limited set of observations in a synoptically variable environment.
Both GCMs correctly simulate that Southern Ocean supercooled boundary layer clouds in that they reproduce observed compositions (i.e., they are mostly composed of small cloud droplets and larger precipitating ice particles).
CAM6 has too much cloud and that cloud is too bright ("too frequent, too bright").
Cloud droplet number concentration in CAM6 is typically too low.
Precipitation in CAM6 is too frequent and too homogeneous.
AM4 has too little cloud occurrence, but the clouds are too bright ("too few, too bright").
AM4 clouds include too much small ice and too little snow.
The low bias in cloud droplet number concentration in CAM6 is consistent with discrepancies seen between other state of the art models and satellite observations of Southern Ocean cloud droplet number concentrations in summertime low clouds Revell et al., 2019). This low bias is a widespread issue remaining in GCMs that presumably contributes to TOA SW bias for low-lying liquid clouds over the Southern Ocean. Both CAM6 COSP and AM4 COSP make assumptions about microphysics, size distributions, and horizontal homogeneity that are not fully consistent with their host GCM. Ideally such assumptions should be minimized, but at a minimum they must be kept in mind when comparing cloud radar data with COSP output. CAM6 COSP seems to simulate too large an area fraction of snow. AM4 simulates snow as a tail of the cloud ice distribution, while COSP expects a separate snow category. With or without COSP, this results in AM4 simulating snow crystals that are too small and have far too little radar reflectivity.
The biggest challenge is still ahead-how to use the insights from this comprehensive analysis to improve the participating GCMs and their COSP simulators. We hope that the approach presented here will prove beneficial in testing other GCMs and developing improvements for future GCM versions.

Appendix A: HSRL Backscatter Coefficient Threshold in Determining Cloud Occurrence
HSRL obtains the lidar return signal with high spectral resolution (<75 MHz laser bandwidth), which enables the separation of aerosol and cloud returns from molecular returns. Here we further separate cloud from aerosol returns by use of calibrated HSRL aerosol and cloud backscatter coefficient.
Examining the probability density function of HSRL cloud and aerosol backscatter coefficient for all 15 flights during SOCRATES ( Figure A1), we find a trimodal distribution with three peaks locating near 10 −7 , 10 −6 , and 10 −3 m −1 sr −1 , respectively. Through inspection of HSRL lidar backscatter profiles (e.g., Figures 4a and 4b), we interpret the two left modes as being contributed by the aerosols within and outside of the boundary layer, which are associated with lower backscatter coefficient than the rightmost cloud mode. We determine 3 × 10 −5 m −1 sr −1 as a HSRL backscatter coefficient threshold separating the cloud mode from the two aerosol modes (the blue line in Figure A1). This threshold was determined by a sensitivity test where we compare the HCR and HSRL cloud detection using different HSRL backscatter thresholds ranging from 10 −5 to 10 −4 m −1 sr −1 . We find that the frequency of cloud occurrence as detected by HSRL is not sensitive to the threshold, but reduces quickly once the threshold increases beyond 3 × 10 −5 m −1 sr −1 .

Appendix B: Droplet Size Distribution in CAM6 Microphysics Scheme and CAM6 COSP
Use of CFADs as an observational constraint on GCM snowfall rate is complicated because the hydrometeor size distributions assumed in COSP do not match the internal distributions within the GCM microphysics.
Here we compare CAM6 and CAM6 COSP DSDs for low clouds during CAPRICORN2 based on their respective hydrometeor size distribution assumptions described in section 2.4 ( Figure 18). The hydrometeor PSDs are computed from their fraction mean masses and effective radii. Here we compare CAM6 microphysics and CAM6 COSP here, since AM4 COSP snow is not taken from the AM4 microphysics.
Rain and snow DSDs are represented well in CAM6 COSP. COSP slightly underestimates cloud liquid and overestimates ice particles, which leads to an underestimation (overestimation) in liquid (ice) reflectivities. However, this bias is not expected to significantly alter the net synthetic reflectivities in the frequently precipitating CAM6 mixed-phased low clouds during CAPRICORN2 where snow dominates the reflectivity. A discrepancy is found for snow DSDs between CAM6 and CAM6 COSP, where CAM6 COSP has a greater concentration of small snowflakes (Figure 18d). We note that this discrepancy is caused by the inconsistency in snow densities assumed in CAM6 and CAM6 COSP. CAM6 COSP assumes a snow density of 100 kg/m 3 , but the effective radius used by CAM6 COSP is computed in CAM6 by assuming a snow density of 250 kg/m 3 . The bigger snow density leads to a smaller effective radius, and therefore more small snowflakes and less big ones. Such discrepancy vanishes when the snow effective radius input into COSP is computed Figure A1. Probability density function of HSRL backscatter coefficient for 15 flights during SOCRATES.

10.1029/2020EA001241
Earth and Space Science using a snow density of 100 kg/m 3 (not shown). The density inconsistency barely affects the large particle number and has little impact on reflectivity.
It is reasonable to assume that the snow size distributions during CAPRICORN2 are similar to that during SOCRATES. Comparing Figures 8 and 18 suggests that the mean snow PSD in CAM6 including all cloud types in SOCRATES is on average qualitatively consistent with the mean SOCRATES-observed DSD for precipitating low clouds, although the frequency of occurrence of snow is much higher.

Appendix C: Phase Partitioning
We visually inspected representative images from the 2DS and the PHIPS HALO (Schnaiter, 2018), a new imaging instrument deployed on the GV for SOCRATES that is optimized to detect ice particles with radii between 20-300 μm and liquid drops with radii of 60-300 μm (Abdelmotaleb et al., 2016;Schnaiter et al., 2018). These images suggest that in the precipitating boundary layer clouds sampled by the GV at temperatures of −5 to −25°C, most of the larger particles (radius > 100 μm) are aspherical frozen hydrometeors.
The SOCRATES 2DS data have insufficient spatial resolution to clearly discriminate the phase of small particles with radii less than 100 μm. We instead used a comparison between the liquid water content inferred from the CDP and from a CSIRO (The Commonwealth Scientific and Industrial Research Organization) King hotwire probe to test for the presence of small ice particles of radius less than 25 um, the size range dominating the cloud droplet number concentration and thus optical depth. Such particles would be detected by the CDP but the data processing algorithm would treat them as liquid water droplets, which introduces a high bias in CDP-inferred cloud water content due to their lower density. Small ice particles should affect the CSIRO King probe's LWC measurement rather differently. For instance, ice might partly bounce off the hot wire causing the King probe to underestimate the cloud ice contribution to the cloud water content. Hence a comparison of the LWC inferred from the two instruments can test the presence of cloud ice. Figure C1 shows a two-dimensional histogram of the two LWC measurements over all SOCRATES low cloud sampling at temperatures −5 to −25°C, presented as a two-dimensional histogram. The strong concentration of data along the 1:1 line is evidence that small particles (radius <25 um) are predominantly supercooled liquid droplets. Mace and Protat (2018) reports that the light scattering from supercooled Southern Ocean boundary layer stratocumulus clouds mostly comes from liquid droplets, based on an analysis of ship-borne lidar depolarization ratios during CAPRICORN. Our visual inspection of plots of HSRL depolarization ratios from boundary layer cloud tops observed during SOCRATES supports this conclusion.

Data Availability Statement
SOCRATES data are provided by EOL (https://data.eol.ucar.edu/). We thank Alain Protat (alain.protat@bom.gov.au) for providing radar reflectivity data during CAPRICORN2. CIRES SYN data used in this study were obtained from the NASA Earth Science Data Systems program (https://search.earthdata.nasa.gov).