UKESM1: Description and Evaluation of the U.K. Earth System Model

We document the development of the first version of the U.K. Earth System Model UKESM1. The model represents a major advance on its predecessor HadGEM2‐ES, with enhancements to all component models and new feedback mechanisms. These include a new core physical model with a well‐resolved stratosphere; terrestrial biogeochemistry with coupled carbon and nitrogen cycles and enhanced land management; tropospheric‐stratospheric chemistry allowing the holistic simulation of radiative forcing from ozone, methane, and nitrous oxide; two‐moment, five‐species, modal aerosol; and ocean biogeochemistry with two‐way coupling to the carbon cycle and atmospheric aerosols. The complexity of coupling between the ocean, land, and atmosphere physical climate and biogeochemical cycles in UKESM1 is unprecedented for an Earth system model. We describe in detail the process by which the coupled model was developed and tuned to achieve acceptable performance in key physical and Earth system quantities and discuss the challenges involved in mitigating biases in a model with complex connections between its components. Overall, the model performs well, with a stable pre‐industrial state and good agreement with observations in the latter period of its historical simulations. However, global mean surface temperature exhibits stronger‐than‐observed cooling from 1950 to 1970, followed by rapid warming from 1980 to 2014. Metrics from idealized simulations show a high climate sensitivity relative to previous generations of models: Equilibrium climate sensitivity is 5.4 K, transient climate response ranges from 2.68 to 2.85 K, and transient climate response to cumulative emissions is 2.49 to 2.66 K TtC−1.


Introduction
Government and society require reliable guidance on how the Earth's climate will evolve in the coming century and how this can be mitigated through changes in human activity. This requires detailed understanding of how the climate will respond to future emissions of pollutants, as well as to anthropogenic land use. In addition to the physical climate response, there is a growing need to understand the response of the global biosphere and associated biogeochemical cycles, both to a changing climate and to increasing concentrations of atmospheric CO 2 . Changes in the efficiency of the Earth's natural carbon cycle may provide a strong feedback onto future climate warming, reducing available carbon emissions to realize key policy targets, such as limiting global warming to less than 2 • C above pre-industrial levels (Friedlingstein, 2015). Warming and increased absorption of CO 2 by the global oceans are expected to have negative impacts on seawater chemistry and ocean ecosystems (Gehlen et al., 2014), while warming on land may trigger vegetation change (Walther, 2010) as well as permafrost thaw and release of ancient carbon stores (Burke et al., 2018;Chadburn et al., 2017). Changes in atmospheric composition and temperature have the potential to alter the rate and balance of chemical reactions with implications for stratospheric ozone recovery (Chipperfield et al., 2017) and future air quality (Silva et al., 2017). To address these challenges, the U.K.'s Met Office and Natural Environment Research Council (NERC) have jointly developed the U.K. Earth System Model (UKESM1) as a successor to the HadGEM2-ES model (Collins et al., 2011). UKESM1 will deliver simulations to the Coupled Model Intercomparison Project Phase 6 (CMIP6; Eyring et al., 2016) and will be a community research tool for studying past and future climate. UKESM1 uses the coupled climate model HadGEM3-GC3.1 (GC3.1 hereafter, Kuhlbrodt et al., 2018;Williams et al., 2018) as its physical core, with the following components interactively coupled to GC3.1: terrestrial carbon and nitrogen cycles, including dynamic vegetation and representation of agricultural land use change; ocean biogeochemistry (BGC) with prognostic diatom and non-diatom concentrations and a unified troposphere-stratosphere chemistry model, tightly coupled to a multi-species modal aerosol scheme. These additional Earth system components, as well as couplings across components and coupling with the physical climate model, allow a number of important policy and science questions to be addressed that would not be possible with GC3.1. Primary among these are estimates of available carbon emission budgets commensurate with limiting global warming below specific targets (Matthews et al., 2017). Introducing prognostic atmospheric chemistry into GC3.1 allows an assessment of the mitigation potential from reducing short lived climate forcers such as methane, black carbon, and tropospheric ozone (Collins et al., 2018;Stohl et al., 2015), while coupling atmospheric chemistry to terrestrial biogeochemistry opens up the potential to study the risk of permafrost thaw, methane release, and their impacts on Arctic warming. A full atmosphere treatment of ozone chemistry supports a joined-up study of stratospheric ozone recovery and climate change as well as analysis of interactions between climate warming, tropospheric ozone, and vegetation (Sadiq et al., 2017). In the marine domain, interactive treatment of ocean heat and carbon uptake, carbonate chemistry, and ocean biology enables future risks around ocean acidification and warming to be assessed (Bopp et al., 2013). Finally, feedbacks from changes in Earth system components of the model on future climate warming can also be investigated.
To provide a model capable of addressing the above questions, in developing UKESM1, we emphasized Earth system process-completeness, particularly with respect to couplings between prognostic model components, across different model domains (e.g., atmosphere, ocean, land, and ice) and between different process parametrizations. We aimed to make UKESM1 internally consistent, using the model to predict processes as much as possible, rather than prescribe them from an external data source. This treatment increases the model degrees of freedom when predicting future changes in the coupled Earth system. As one example we highlight some of the interactions involved in the UKESM1 representation of dust; similar interactions exist for numerous other model processes and variables. Dust is produced from the land surface, depending on model predicted vegetation type, soil moisture, and surface winds. The dust produced is then advected by the model atmosphere and interacts with the radiation parametrization. Atmospheric dust can be deposited into the model ocean, thousands of kilometers from where it was produced, and act as a source of soluble iron influencing ocean biological activity and thus marine uptake of CO 2 . Surface chlorophyll, resulting from this marine biological activity, feeds into surface emission parametrizations of dimethyl sulphide (DMS) and primary marine organic aerosol, both of which increase cloud condensation nuclei in the model atmosphere, influencing the model's radiation and cloud microphysical schemes and thus the physical climate response. Retaining such interactions in a fully prognostic sense means that UKESM1 is one of the most process-complete Earth system models available today.
Developing an Earth system model that retains this degree of process realism, process couplings, and prognostic treatment comes with risks. Primary among these is constraining the performance of the new Earth system components that often do not have the same degree of observational cover as more traditional physical climate processes. In addition, the increased degree of coupling between processes, as well as the preferential use of model predicted fields as input to dependent parametrizations (e.g., predicted vegetation type, seawater DMS, and atmospheric chemical oxidants) rather than prescribed, temporally constant fields, increases the risk that errors in one model component will negatively impact dependent variables or parametrizations in another part of the model, with the danger of a cascade of biased interactions driving the overall simulated climate to be significantly in error. In developing UKESM1, when a new Earth system process was included or made prognostic when it had previously been constrained by prescribed input, we carefully assessed the performance of the full model as well as the dependent model components. Our aim was to ensure new processes performed as realistically as possible and that dependent processes as well as the full model retained an acceptable level of skill when assessed against a basket of performance indicators. If the introduction of a new Earth system process or prognostic coupling was neutral or positive for model performance, then it was included. If the impact was negative, the degree of performance degradation was weighed against the potential benefit of more realistic future projections. Wherever possible we leaned toward inclusion of increased process realism or coupling.
In section 2 we describe the component models of UKESM1 and how they are coupled to one another. Section 3 discusses the methodology for tuning the fully coupled Earth system model to achieve acceptable performance. In section 4 we evaluate the model against observations and examine a few of its emergent properties, and finally in section 5 we summarize and discuss some of the challenges arising from the development of UKESM1. Appendix C describes some non-standard configurations of the model: first, configurations of reduced complexity intended for carbon cycle research and second, the configuration which runs with prescribed emissions (rather than concentrations) of CO 2 . Appendix D lists the code versions and simulation identifiers for experiments used in this work.

Component Models and Coupling
UKESM1 is built upon component models which have been developed to study climate processes for specific science domains such as ocean biogeochemistry or atmospheric composition. Each of these has been evaluated and tuned in uncoupled configurations (atmosphere-only, ocean-only, or offline land), before their inclusion in the coupled Earth system model. Its physical core GC3.1 has been tested in coupled atmosphere-land-ocean-sea ice mode, but without feedbacks from any of the additional Earth system components in UKESM1. The components themselves have been developed as models in their own right, with corresponding descriptions in the literature. Below we list the component models and their primary references; in Appendix A we give more detailed descriptions.
• Physical core atmosphere-land-ocean-sea ice model: HadGEM3-GC3.1 (Kuhlbrodt et al., 2018;Williams et al., 2018) • Terrestrial biogeochemistry: JULES (Joint UK Land Environment Simulator; Clark et al., 2011), with developments to plant physiology and functional types (Harper et al., 2016(Harper et al., , 2018, land use, and the introduction of nitrogen cycling. • Ocean biogeochemistry: MEDUSA (Model of Ecosystem Dynamics, nutrient Utilisation, Sequestration and Acidification; Yool et al., 2013), with the addition of a diagnostic submodel for DMS surface seawater concentration (Anderson et al., 2001) to support interactive DMS emissions to the atmosphere. • Atmospheric composition: UKCA (United Kingdom Chemistry and Aerosol model; Archibald et al., 2019;Mulcahy et al., 2018) with unified stratospheric-tropospheric chemistry and close coupling between chemistry and aerosols. UKESM1 is a successor to HadGEM2-ES (Collins et al., 2011), and many of its components, namely, the land surface, atmospheric chemistry, and the core physical model, have been developed from that model's components. The most significant developments for century-scale climate projections are (a) the inclusion of interactive nitrogen cycling in the terrestrial carbon cycle, which addresses a common bias across CMIP5 models to simulate excessive land carbon uptake, and (b) the interactive stratospheric chemistry, which enables us to predict the radiative forcing of ozone, methane, and nitrous oxide in the full model domain and allows stratospheric ozone recovery to evolve with climate change. Additionally, aerosol is represented by prognostic mass and number, allowing better representation of aerosol size distribution and hence cloud-aerosol forcing.
Of prime importance for an Earth system model (ESM) is the coupling between its components, since these couplings determine which feedbacks the model is capable of simulating and hence the potential applications of the model. Table 1 lists the couplings which are active in the default configuration of UKESM1, with a summary of the rationale for their inclusion. UKESM1 uses the OASIS3-MCT coupler (Craig et al., 2017) as the interface between marine and atmospheric components. Coupling information is exchanged between the ocean and atmosphere every 3 hr. Between the land and atmosphere the coupling frequency is every dynamical model timestep (20 min). Couplings involving aerosols, chemistry, or radiation are performed once per hour. Ocean physics, sea ice, and BGC are coupled every ocean model timestep (45 min).

Performance Tuning of Final Coupled Model
In the final stages of development of any coupled climate model or ESM, there is a phase of fine-tuning model parameters or settings in order to minimize model biases and satisfy high-level performance metrics such as global TOA radiation balance, water, and energy conservation. For discussion of these issues and a summary of common practices, see, for example, Mauritsen et al. (2012), Hourdin et al. (2016), Schmidt et al. (2017). This process is generally more complex for an ESM because of the large number of interactions between component models, and the concomitant potential for biases in one component to degrade the performance of other parts of the coupled system. In this section we summarize all of the parameters tuned in the development of the fully coupled UKESM1 and the rationale for that tuning. Note that prior to coupled testing, the component models were developed and tuned in uncoupled mode, and this tuning is described in the documentation papers for those components (Harper et al., 2016(Harper et al., , 2018Morgenstern et al., 2009;Mulcahy et al., 2018;O'Connor et al., 2014;Walters et al., 2019;Williams et al., 2018;Yool et al., 2013). The parameter values from that component model tuning acted as the starting point for the tuning of UKESM1; in the case of physical model parameters, the starting point was the values used in GC3.1 (the physical core of UKESM1).
The goals of the UKESM1 tuning were twofold: first, to keep long-term global mean carbon fluxes and net TOA radiation close to zero under pre-industrial forcing, to avoid significant drifts in either the climate or carbon pools, and second, to avoid large biases between observations and the model state. Despite the majority of reference observations being taken in the last 40 years, most of the UKESM1 coupled model tuning was performed using pre-industrial tests with perpetual 1850 forcing, during the model's spin-up phase. Of course, present-day tests would allow a cleaner comparison to observations, but such tests are complicated by the fact that present-day climate and the carbon cycle are not in equilibrium with their forcing and by the difficulty of choosing initial conditions. The only truly reliable way to test the present-day impact of changes to ESM components is to allow the change to spin-up in a pre-industrial configuration and then run transient historical tests through to the present-day. However, this is prohibitively slow to perform iteratively and instead we made pragmatic judgements to account for expected differences between the pre-industrial and present-day model state.
A wide range of model outputs were used to inform the tuning process. These included metrics (single numbers such as global mean TOA radiation or global total plant productivity) as well as maps of key model variables. In Appendix B, we list the main metrics and variables which were used during the tuning process to identify priority biases and monitor the stability of the pre-industrial state. While these summarize the primary tuning targets, other variables were examined from time to time, particularly when a particular bias or process needed checked in more detail. Now that we have completed pre-industrial spin-up and historical simulations with the tuned model, it is possible to assess the accuracy of these judgements with hindsight. To this end we have performed two "untuned" experiments parallel to the pre-industrial and historical simulations described in section 4. The untuned configuration is identical to the final model except that the parameters described below have been reset to their original pre-tuned values. The pre-industrial untuned run branched from the beginning of the UKESM1 pre-industrial control (i.e., from the end of the spin-up described in section 4), while the untuned historical run branched from the year 1990 of a single member of the UKESM1 historical ensemble. presents a set of metrics summarizing the biases which were tackled by this tuning exercise. The first 5 years have been discarded to exclude initial spin-up effects, and there is minimal drift in these metrics during the 20-year period over which they are calculated. In practice, the tuning was guided by a wide array of information including the pattern of biases across multiple quantities as described above; the metrics in Table 2 serve only to illustrate the impact of the tuning at a headline level.

Snow-Vegetation Interactions
The most significant tuning of the coupled ESM concerned the interaction between shortwave (SW) radiation, snow, and vegetation. We found it necessary to tune the parameters governing this interaction in order to reduce a substantial SW radiation bias. The extent to which vegetation is covered by snow makes an enormous difference to surface albedo, and there is a large diversity in how models parametrize this process (Boisier et al., 2012;Bright et al., 2015;Ménard et al., 2014). JULES represents the bending and partial burying of vegetation by snow (Ménard et al., 2014) on a per-PFT basis, by reducing the leaf area index (LAI) used in the albedo calculation as snow depth increases (Walters et al., 2019). When the depth of snow (d) is less than the canopy height h i of a given PFT, the partial burying is described by the equation where i denotes the PFT, L i is the true LAI for each PFT, and L i ′ is the "exposed" LAI used in the albedo calculation. The parameter n i thus governs the rate at which vegetation is covered by snow as snow depth increases, but it is not well constrained by observations and must be tuned for a given LAI distribution. The LAI in GC3.1 is prescribed using a seasonal climatology derived from MODIS satellite observations, while the phenology in UKESM1 is simulated interactively using the scheme of Clark et al. (2011). Figure 1 shows LAI for C3 grass in the two models in JJA and DJF. GC3.1 shows an almost complete loss of LAI in winter in central Asia and North America. This seasonal reduction is more extreme than seen in other observational products (Garrigues et al., 2008) and may in part result from the obscuring of vegetation by snow as seen by the satellite data which is used in constructing the LAI climatology. Even the needleleaf tree PFT in this climatology sees LAI reductions of 80-95% in regions dominated by evergreen needleleaf trees (not shown). Therefore, in GC3.1 there is a little need for the parametrization in equation (1), since LAI is already very small and n i is accordingly tuned to very low values.
In contrast, the LAI in UKESM1 is more similar between summer and winter for grasses and evergreen PFTs, with reductions of 20-35%. This wintertime LAI may be too high because UKESM1 does not fully represent the phenology of grasses, but the large difference from the LAI climatology used in GC3.1 requires a different tuning for the snow-vegetation parametrization. Accordingly n i is given higher values in UKESM1, particularly for grasses, so that the radiative effect of snow partially covering vegetation is taken into account. The specific values of n i were chosen to maximize the similarity in grid box mean wintertime surface albedo between UKESM1 and GC3.1, to avoid large differences in regional mean radiative fluxes. Using the GC3.1 values of n i in UKESM1 produces a widespread negative bias in DJF clear-sky outgoing SW across northern mid-latitude land masses ( Figure 2c). The tuning of n i increases the wintertime albedo in seasonally snow-covered regions, increasing the TOA outgoing clear-sky SW (Figure 2b), which reduces the negative bias but does introduce positive bias in some regions ( Figure 2d). The net impact of this tuning can be summarized by the second metric in Table 2, defined as TOA outgoing clear-sky SW mean over land between 30 • N and 60 • N for DJF, representing the clear-sky radiative impact of snow on TOA radiation. A crude observational range of 50.5-53.6 W m −2 has been derived by calculating the same metric for two satellite data sets: CERES-EBAF Ed4.0 (Loeb et al., 2018) and ISCCP-FD (Zhang et al., 2004), using the 2000-2010 period common to both data sets. Table 2 shows that the impact of the tuning was to increase this metric from well below the observational range to just above it.

Bare Soil and Dust Emissions
In UKESM1, emissions of mineral dust are tightly coupled to the fraction of bare soil in a grid box and hence are very sensitive to biases in vegetation fraction in arid and semi-arid areas. In particular the model, prior to tuning, was prone to excessive emissions of dust in Australia, India, and central Asia. We tuned bare soil fraction in these regions by increasing the vegetation dynamics parameter lai min  for grass PFTs. When LAI for a given PFT exceeds this lai min , some excess productivity is allocated to horizontal spread of the PFT within the grid box. For the same productivity, smaller values of lai min allow greater vegetation fractions and hence smaller bare soil area. Two satellite land cover data sets were used as a reference for this tuning: European Space Agency (ESA) Climate Change Initiative Land Cover data (CCI-LC; Poulter et al., 2015) and that from the International Geosphere-Biosphere Programme Land Use and Cover Change project (IGBP-LUCC; Loveland et al., 2000). The impact of increasing lai min is illustrated by considering the total area of Australian bare soil in the model, which is reduced by nearly a factor of two to 2.8 ×10 6 km 2 , covering around one third of the continent. The resulting bare soil area lies between the equivalent values for IGBP-LUCC and CCI-LC of 2.5 ×10 6 km 2 and 4.0 × 10 6 km 2 , respectively. Bare soil areas in central Asia were similarly reduced, but the same is not true for India where the model suffers from a deficit in monsoon rainfall (Williams et al., 2018), which leads to such low plant productivity that the lai min parameter has little or no impact.
Even with reduced biases in vegetation cover, there was also a need to tune the parameters of the dust emission scheme itself. These have been described by Woodward (2011), and in summary they control the emission response to soil moisture and friction velocity and include an overall scaling factor for the total emission. These were tuned to maximize agreement with in situ surface concentration measurements from the University of Miami network (18 sites), and Aerosol Robotic Network (AERONET; Holben et al., 2001) total aerosol optical depth (AOD) measurements at eight dust-dominated sites. While regional metrics are used in the tuning process (see section 4.6), overall performance is summarized by a single global dust error metric, defined as the RMS difference between √ model and √ obs for all of these University of Miami and AERONET observations. The square root is used as a weighting function to increase the contribution of lower concentrations at remote sites to the global metric. The calculation uses seasonal means of the model and observations; each site receives equal weight; thus the concentration observations contribute more strongly than AOD: 18 versus 8 sites. The final UKESM1 configuration has a global dust error of 0.47 (unit-less), which is just over half the error of the untuned configuration; the reduction comes from a combination of the changes to dust parameters and the reduction in bare soil area described above. For context, the value of this metric is 0.51 for GC3.1 and was 0.98 for HadGEM2-ES.

Terrestrial Productivity
Carbon fluxes and stores are key performance indicators for an Earth system model, due to their importance in regulating carbon feedbacks. The global total GPP of plants in UKESM1 was tuned downward to better agree with observations by reducing the quantum efficiency of photosynthesis. Quantum efficiency is specified separately for each PFT, and the UKESM1 values were adjusted to the lower end of the ranges found by the literature review of Skillman (2008). To inform this tuning we combined GPP from our pre-industrial tests with an estimate of the likely pre-industrial-to-present-day change based on previous models and compared this against the present-day GPP estimated by the FLUXNET model tree ensembles (MTE) data product of Jung et al. (2011). The goal was to improve both the annual mean and seasonal cycle of the global total GPP. The global total GPP for the last 20 years of the resulting UKESM1 historical simulations is 130 GtC year −1 , a little higher than the Jung et al. (2011) 5-95% range of 119 ± 6 GtC year −1 .

Surface Chlorophyll
The ocean-atmosphere coupling of PMOA emissions described in section A5 makes aerosol emission dependent upon the interactively simulated surface chlorophyll concentrations, which allows the model to capture potential feedbacks between ocean biogeochemistry and climate. However, ocean models typically have larger biases in ocean chlorophyll than other marine biogeochemistry properties (Kwiatkowski et al., 2014;Yool et al., 2013), posing the risk of introducing a large bias in aerosol load. We mitigate this risk by multiplying the chlorophyll concentration by a global scaling factor. A value of 0.5 is used for this factor, chosen to produce a similar global mean outgoing SW radiation flux at TOA to that in the atmosphere-only implementation of the PMOA emission scheme, which was driven by the observation-based GlobColour data set (Maritorena et al., 2010). This scaling factor reflects the fact that MEDUSA chlorophyll concentrations are higher than those of GlobColour. Note that this scaling of surface chlorophyll is performed only within the PMOA emission parametrization: There is no impact on the prognostic chlorophyll simulated by the ocean model.

Energy Balance
The physical core of UKESM1 was already well-tuned before adding ESM components (Kuhlbrodt et al., 2018;Williams et al., 2018), and with the exception of the snow-vegetation interaction above it was not deemed necessary to re-tune the physical model parameters. The TOA radiation balance was however altered by the addition of ESM components, through a combination of changes in cloud, surface albedo, and radiatively active gases. Without further tuning the net downward radiation at TOA in 1850 was −0.81 W m −2 , which would have resulted in a significant downward drift in model temperatures. We brought the TOA radiation into balance by tuning parameters in the Anderson et al. (2001) parametrization of DMS seawater concentration to permit lower minimum values of DMS. The standard configuration of this parametrization has a prescribed minimum value of 2.29 nM for seawater DMS concentration, while gridded observational DMS data sets, such as those of Kettle et al. (1999) or Lana et al. (2011), contain significantly smaller values than this over large regions of the ocean. Anderson et al. (2001) themselves point out that the data from which their parametrization is derived is likely to contain a sampling bias toward higher values, so a lower minimum is reasonable. We reduced the minimum from 2.29 to 1.00 nM, extending the Anderson et al. (2001) linear relationship (between DMS concentration and log 10 of chlorophyll concentration, surface SW, and nutrient availability) to lower values of DMS. The ensuing reduction in DMS gave widespread decreases in cloud droplet number (and thus cloud albedo) across the Southern Ocean and stratocumulus regions, resulting in a drop in reflected SW at TOA of 2 to 5 W m −2 over large areas of the ocean.
Using the ocean DMS parametrization to balance the TOA, rather than for example cloud parameters, allows greater traceability between UKESM1 and its physical core GC3.1, by minimizing the number of physical model parameter changes between the two models. While the goal of this tuning was to balance the pre-industrial TOA, we note that the mean TOA over the last 20 years of the tuned historical run is within the Johnson et al. (2016) present-day observational estimate of 0.61-0.81 W m −2 (5-95% confidence).

Effectiveness of Model Tuning
Despite the difficulties noted at the start of this section, the tuning exercise was successful in reducing the biases it sought to address. The metrics in table 2 which summarize these biases are all improved relative to the untuned equivalents; that is, the pre-industrial net TOA is closer to zero, and present-day values of the other metrics are closer to their observed values. Recall that the tuning decisions were based mainly on pre-industrial tests, so the present-day performance could not be assured until historical simulations were completed. In the case of the snow-vegetation interactions affecting the outgoing SW metric, the tuning possibly went too far since the present-day value of the "snow SW" metric changed from low-biased to high-biased, though still closer to the observations. This illustrates the imprecise nature of the tuning exercise.
Finally we note that, while the parallel runs presented in Table 2 give a good indication of the magnitude of the direct impact of tuning, they are likely to underestimate the full impact. This is because these tests are designed to be relatively short (25 years), in order to minimize confounding internal variability. While there is minimal drift after the first 5 years, the increased GPP and dust emissions in particular would produce slow responses in terrestrial carbon stores and ocean biogeochemistry respectively which could take more than a century to equilibrate. Assessing the full impact of these changes would require the untuned pre-industrial configuration to be run to equilibrium, before initializing a new ensemble of historical simulations; such an exercise is too costly to perform at this time.
We have not yet assessed the impact of this tuning on emergent properties such as radiative forcing or climate sensitivity (as done for example by Mauritsen et al., 2012); this will be considered in future work.

Evaluation and Characterization of Model Behavior
In this section we characterize the behavior of the model, using coupled simulations that form the core set of experiments for CMIP6  1. piControl: a pre-industrial control with 1850 forcing; over 1,100 years of this simulation have been completed. 2. abrupt4xCO2: as piControl but with CO 2 concentration instantaneously quadrupled; 150 years duration. 3. 1pctCO2: as piControl but with CO 2 concentration increasing at 1% per year; 150 years duration. 4. historical: simulating the evolving climate since the pre-industrial era, with transient forcing from 1850 to 2014.
The pre-industrial control is initialized from the end of a 500-year coupled spin-up. The coupled spin-up was itself preceded by 5,000 years of ocean-only simulation which used an asynchronous coupling procedure, and by 1,000 years of land-only spin-up. The spin-up methodology and evolution will be documented in a companion paper in this special issue. Historical results in the present paper make use of nine members of an initial condition ensemble, with each member initialized from a different date in the pre-industrial global mean net TOA radiation; global mean 1.5 m air temperature; hemispheric sea ice extent (area of model grid boxes with >15% ice cover); global total area of aggregated plant functional types (note reduced scale for shrubs); cumulative carbon uptake (dashed lines are ±0.1GtCyear −1 ). In the upper three panels, thin lines represent annual means and thick lines a centered 21-year rolling mean. In the lower two panels, all lines represent annual means or accumulations.
control. Unless otherwise stated, results from the historical simulation use a mean of these nine ensemble members.
In section 4.1 we explore the drift and variability of the pre-industrial control. Sections 4.2-4.6 evaluate the latter period of the historical simulations with reference to observations. Finally in section 4.7 we examine emergent properties of the model diagnosed from the full historical period and from the idealized CO 2 experiments.

Pre-Industrial Trends and Variability
The pre-industrial control simulation is important for quantifying the model's natural variability and any long-term drifts in the model state. Significant drifts in climate or the carbon cycle would compromise the model's utility for many applications, particularly in assessing future Earth system responses to anthropogenic emissions. Figure 3 shows how five key properties of the model vary over the pre-industrial control. Generally drift is very small relative to internal variability.
There is a relatively constant outgassing of CO 2 from the ocean at a rate of ≈ 0.02GtCyear −1 , slowing slightly as the run progresses, and a release of soil carbon of a similar magnitude. Both of these, and the total carbon uptake, are well below the ±0.1GtCyear −1 threshold set by C4MIP ) as a guide for acceptable drift in carbon pools, shown as dotted lines in Figure 3.
Net TOA radiation is very close to zero, which ensures that there are no discernible drifts in surface temperature or sea ice extent. TOA radiation does drift very slowly over the course of the run with a mean of 0.009 W m −2 over the first 500 years, and 0.038 W m −2 over years 500 to 1,000.
Surface air temperature shows strong multi-century variability, most notably during years 400 to 900, with swings of up to 0.3 K over a 100-year period ( Figure 3). This is linked to episodes of oceanic deep convection in the Southern Ocean which bring warm saline deep Atlantic water to the surface, leading to a significant reduction in sea ice and loss of accumulated ocean heat to the model atmosphere. The warm, convecting, phases of these cycles coincide with reductions in Antarctic sea ice cover.  Aggregated vegetation types show variability on a range of timescales. Grasses expand and retract into regions of bare soil on interannual and decadal timescales in response to local variations in rainfall and surface climate, resulting in a clear anti-correlation between grass and bare soil in Figure 3. Tree coverage evolves on a much slower timescale, with a small decrease of around 1% during the first 600 years of the run, followed by a slight recovery, mostly at the expense of shrubs.
The UKESM1 historical simulations are initialized using atmosphere-sea ice-land-ocean conditions taken at different points in the pre-industrial control simulation. Ideally, these initial conditions will sample the model's multi-decadal internal variability so that each historical run is initialized using well-separated states of the pre-industrial coupled model. To achieve this, we consider two large-scale modes of sea surface temperature (SST) multi-decadal variability, the Inter-decadal Pacific Oscillation (IPO; Power et al., 1999;Zhang et al., 1997) and the Atlantic Multi-decadal Oscillation (AMO; Kerr, 2000).
The IPO pattern is the largest multi-decadal quasi-periodic SST variability over the Pacific basin. It is represented by the first area-weighted Empirical Orthogonal Function (EOF) of deseasonalized, detrended, and low-pass filtered monthly SSTs. Figure 4 shows the IPO pattern for HadISST1.1 (1870-2017; Rayner et al., 2003) and for the UKESM1 pre-industrial control simulation. The IPO index is the principal component of this EOF. The AMO pattern is a quasi-periodic oscillation of North Atlantic SSTs. The AMO index is calculated as the area-weighted mean time series of the deseasonalized, detrended, and low-pass filtered North Atlantic SSTs. Low-pass filtering the data minimizes the influence of the El Nino/Southern Oscillation (ENSO). Observed and modeled AMO pattern are shown in Figure 5.
The observed and simulated patterns are similar, for both modes of variability. The most notable difference between the two IPO patterns is the westward extent of the equatorial warm region, which extends further toward the maritime continent in the simulation. The AMO patterns differ in terms of the northerly position of the warmest region. Nevertheless, the similarity of these climate mode patterns provides confidence that the simulation is, to first order, replicating the observed multi-decadal variability in the Pacific and North Atlantic basins.
The phase space spanned by the observed and modeled IPO and AMO indices is shown in Figure 6. The observations show no clear relationship between the IPO and AMO and populate all four sectors of phase space. The model has a much longer time series and hence explores the phase space more fully. Consistent with the observations, the IPO and AMO appear to be independent of each other in the model simulation. We use the simulation phase space to choose well-separated initial conditions for each historical run (marked in    Figure 6), thus ensuring that the beginning of the historical ensemble samples the multi-decadal variability of the ocean. For the same reason, we also prescribed that initial conditions for historical members were separated by at least 30 years in the pre-industrial control.
The model also exhibits a realistic simulation of the El Nino/Southern Oscillation (ENSO). The addition of ESM components has made no significant difference to the ENSO behavior of the physical model GC3.1 documented in Kuhlbrodt et al. (2018). We have repeated the ENSO analysis of Kuhlbrodt et al. (2018), and the results are statistically indistinguishable (not shown).

Present-Day Surface Climate
In Figure 7 we evaluate climatological seasonal mean sea surface temperature against the HadISST1.1 observation data set (Rayner et al., 2003). As is common in ocean models of this resolution (Flato et al., 2013), UKESM1 exhibits a strong cold bias in the NW Atlantic, related to the too-zonal representation of the Gulf Stream and North Atlantic current. Again, similar to other low-resolution models, we see a year-round cold bias in the north Pacific, and a too-westward extension of the equatorial Pacific cold tongue during JJA.
Overall the pattern of SST bias is similar to that of GC3.1 (Kuhlbrodt et al., 2018), including a warm bias of 0.5-2 K across much of the Southern Ocean.  (Adler et al., 2003), respectively. UKESM1 shows a wintertime cold bias over North America, Europe, and NW Russia. One driver of this cold bias relates to the SST biases noted above, influencing land temperatures through atmospheric advection. These continental cold biases are larger than the mean upwind SST bias, suggesting a potential feedback involving snow cover, snow albedo, and surface temperatures whereby a cold surface temperature bias is maintained in the model through excess snow cover, both spatially and temporally through the winter season.
In common with the CMIP5 generation of models (Flato et al., 2013), UKESM1 exhibits a double intertropical convergence zone in the tropical Pacific, linked to the Equatorial cold tongue bias. Tropical Atlantic precipitation is too far south, and in DJF this pattern extends westward over South America. In JJA UKESM1 exhibits the same dry Indian monsoon bias described by Williams et al. (2018) for GC3.1. This dry bias leads Precipitation rates are binned with pseudo-exponential bin sizes (see Klingaman et al., 2017) to account for more frequent rain events at low rain rates. Each bin frequency is then multiplied by the mean bin rate, which gives the precipitation contribution from each bin. The x axis is plotted with a log scale so the area under the curve is approximately the mean precipitation.
to a vegetation deficit (described in section 4.3) and some of the Earth system feedbacks discussed by Martin and Levine (2012).
While our emphasis in this paper is on evaluating UKESM1 at larger spatial scales (i.e., continental or ocean basin scales) and longer time scales (multi-decadal to centennial), UKESM1 will be used to make a set of future projections following the CMIP6 scenarioMIP protocol (O'Neill et al., 2016). Output from these projections will be used to drive impacts models (e.g., to assess future risks to food security or water resources). Such models require high temporal and spatial resolution surface climate data as forcing. Furthermore, assessing a model's higher time frequency statistics increases confidence in the simulated long-term means, variability, and future changes, all of which are based on the underlying high time frequency behavior. As an initial evaluation of UKESM1, Figure 10 shows simulated and observed daily timescale precipitation for four distinct climatic regions, and Figure 11 shows a similar analysis for monthly mean near-surface air temperature.
For each of the regions, Figure 10 plots the fractional contribution from different daily precipitation rates to the total seasonal accumulation, calculated for the seasons specified in the figure labels. Observations include the GPCC/HOAPS v1 global daily precipitation (Andersson et al., 2010;Schamm et al., 2014), based on both gauge and satellite observations, the Global Precipitation Climatology Project (GPCPv1.1; Adler et al., 2003), and the Tropical Rainfall Measuring Mission 3B42v6 product (TRMM; Huffman et al., 2007). All observational data were first re-gridded onto the UKESM1 grid. TRMM data are only used over the SEA and AMZ regions, whereas GPCC is not used over the predominantly ocean area of SEA.
Over EUR we show winter (DJF, full lines) and summer (JJA, dashed lines) precipitation, with the nine UKESM1 members in black and the observations in different shades of blue. In both seasons the distributions are skewed toward a long tail of light precipitation (<1 mm day −1 ) that contributes to the seasonal accumulation. UKESM1 captures this skewness quite well. In DJF the model slightly overestimates precipitation contributions from moderate to heavy daily rates (3 to 10 mm day −1 ) but does capture the heavy rain tail of the distribution. Conversely, in JJA moderate to heavy rainfall is underestimated by the model while Figure 11. Frequency of occurrence of monthly mean near-surface air temperature over the period 1961-2014 binned into 1 • C wide classes for the same four regions as in Figure 10. Observations are from the CRUTS4.02 data set (thick black line), and nine UKESM1 historical ensemble members are shown (thin colored lines). Only grid points that include a fraction of land greater than 0.995 are included in the spatial means.
very heavy events (>20 mm day −1 ) contribute too much to the seasonal accumulation. As the moderate to heavy daily rates contribute more to the seasonal accumulation (i.e., they occur more frequently) than the very heavy rates, the underestimate of the former results in a modest dry bias over EUR in JJA, primarily in central and south-east Europe (Figure 9).
Turning to the EAS region, where precipitation is dominated by the summer monsoon, we plot distributions for the summer (wet) season (JJAS) and the winter (dry) season (NDJFM). In JJAS UKESM1 significantly overestimates the contribution of rainfall events >30 mm day −1 , with some unrealistic daily rates in excess of 100 mm day −1 . Figure 9 indicates that this results in a significant wet bias, primarily over the ocean part of EAS (i.e., the west Pacific warm pool region). In NDJFM seasonal precipitation is significantly lower than JJAS in both observations and the model, although the model systematically overestimates precipitation from most of the bins. Again Figure 9 suggests this overestimate is mainly along the coastal ocean in EAS.
To the south of the EAS region, over the maritime continent and west Pacific warm pool (SEA) the rainfall distribution in the main wet months (NDJFM) is very well captured, with only a modest overestimate in the model across most bins. Finally, the AMZ region shows precipitation over the Amazon rainforest. While the overall spread of the observed distribution is captured by the model, there is a significant underestimate of moderate to heavy rainfall days (approximately 8 to 30 mm day −1 ) that leads to the seasonal mean dry bias seen over the central Amazon basin in Figure 9, which is surrounded to the south by a wet bias of equivalent magnitude.
In Figure 11 for the same four regions we compare spatially averaged 1.5 m monthly mean air temperatures between UKESM1 and CRUTS4.02 observations (Harris et al., 2014). From this we wish to see if UKESM1 captures the spread of cold/warm months around its climatological mean value, providing an evaluation of the inter-annual variability of near-surface climate in the model. For all regions we first interpolate the 0.5 • CRUTS data onto the UKESM1 grid and then mask out ocean and coastal points, using the UKESM1 land-sea mask and a criterion that a grid point has to have fractional land amount >0.995 to be included in the spatial average. Considering EUR, in winter (DJF) a systematic bias is evident in the model distribution, with mode values shifted relative to CRUTS by 3-4 • C. Nevertheless, the spread of cold/warm months around the mode of the distribution, as well as the shape (skewness), are well captured in the model. Figure 8 indicates the bulk of the cold bias in EUR is over Scandinavia and European Russia, as noted above. In the summer (JJA) a smaller cold bias remains in the model distribution, which continues to be located in the northern part of EUR. A similar cold bias is seen over EAS during boreal winter, while the summer distribution is well simulated, albeit with a slight tendency for too frequent cold months (surface temperatures <18 • C) not seen in CRUTS. Over AMZ, the mode of the UKESM1 distribution is quite accurate, although the model now has a bias toward extreme warm months (mean temperatures >27 • C) not seen in CRUTS. These are likely associated with the dry precipitation bias over the region (Figure 9), shifting the surface energy balance from predominantly evaporative toward a larger sensible heat component. Finally, the SEA region shows a cold bias in the model distribution. It should be noted for SEA that due to the small number of land points used in the spatial means and the extreme topography of islands in this region, neither the model nor CRUTS data are likely a robust representation of true land temperatures in this region. We plan to extend our analysis of UKESM1 near-surface temperatures in a future study, considering daily temperature variability, as well as daily maximum and minimum values to better isolate model processes controlling biases in regional surface temperature. Figure 12 compares the areal extent of snow with the GlobSnow observation product (Takala et al., 2011) for four different snow depths. In addition to the area for observed monthly mean snow depths, the equivalent for monthly maximum depth is shown as an indicator of inter-annual variability. The model has too much very thin snow compared with the observations, but a comparable amount of deeper snow. The deeper snow tends to thaw slightly later than the observations in the spring but increase at a similar rate to the observations in the autumn.  In general there is very good agreement between the modeled (shading) and the observed (orange lines) location of the ice edge as depicted by the 15% concentration contour. In the Arctic the model sea ice is, overall, a little more extensive than the observations, particularly so in the north Pacific (Bering Sea and Sea of Okhotsk) in winter (March). The model also has too much winter sea ice north of Iceland and does not reproduce the Oden ice tongue, which is captured in the HadISST data set. In the summer the Arctic sea ice is slightly over extensive within the Arctic Ocean basin, except for in the Kara Sea where the sea ice extent is much smaller than observations. In the Antarctic the sea ice is slightly less extensive than observed, particularly in the Ross Sea and east of the Weddell Sea sector south of Africa.
In response to the cool northern hemisphere surface temperature reported above and in section 4.7, the present-day sea ice is much thicker than observational estimates. Mean September sea ice thickness for . Areas with sea ice less than 15% are masked out. Also shown (orange lines) are corresponding 15% sea ice concentration contours from the HadISST.2.2.0.0 data set (Titchner & Rayner, 2014). Inlaid boxes show the observed and modeled sea ice and the corresponding Integrated Ice Edge Error in units of 10 6 km 2 (IIEE; Goessling et al., 2016). 1990-2009 is over 4.5 m throughout the central Arctic (not shown). However, despite the high modeled thicknesses, the spatial extent of sea ice in UKESM1 agrees well with the HadISST observational data set.

Terrestrial Biogeochemistry
To have confidence in the processes responsible for terrestrial carbon uptake, one needs to evaluate model simulated vegetation structure, surface carbon fluxes, and carbon stores. If a model performed well for one or two of these properties, but not all three, it would suggest that the model was getting the right answer for the wrong reasons. In this section we evaluate and characterize the behavior of each of these three aspects.
UKESM1 represents vegetation structure with nine natural PFTs and four crop/pasture PFTs, and the fractional coverage of each PFT within a grid box is determined by a competition hierarchy. We evaluate the vegetation distribution using two complementary approaches: first with spatial maps of aggregated PFTs ( Figure 15) and second by aggregating fractions within biomes (Figure 16). The biomes are regions of the world defined by Olson et al. (2006) based on their common ecosystem characteristics. Within each of these regions Figure 16 shows the fractional coverage of aggregated PFTs for UKESM1 and the IGBP-LUCC and CCI-LC land cover data sets; the same region definition is used for all three data sets. Broadly the model does a very good job in representing vegetation structure of the biomes. Notable biases include an excess of C3 grass in tundra regions which the observations indicate should contain more bare ground; this can be seen in the northern Siberian C3 grass fraction in Figure 15. Some of this bias is a side effect of the lai min tuning described in section 3, which improved the bare soil fraction in dust-producing regions but has allowed grass to spread too extensively in northern Russia and Canada. In contrast, Saharan bare soil extends too far south, with a deficit of grassland in the Sahel. Similarly there is excess bare soil in eastern India. Both of these biases result from precipitation deficits in these regions, associated with errors in the position and intensity of monsoon rains (see Williams et al., 2018). There is a small overestimation of tree fraction in savanna biomes, most notably on the southern edge of the Amazon forest. This is due to the lack of fire disturbance, the inclusion of which would be expected to improve vegetation structure in these regions (Burton et al., 2019). Figure 17 shows the evolution of terrestrial carbon fluxes in the ensemble mean of the historical simulations. From 1850 to 1980, decadal-mean GPP slowly increases from 118 to 122 Pg year −1 , after which it increases rapidly to 134 Pg year −1 in the last decade of the experiment. GPP in UKESM1 is slightly higher than observational estimates, which are around 120 Pg year −1 in the last few decades (Jung et al., 2011). Net ecosystem productivity (NEP), the balance of GPP and respiration, also increases more rapidly after 1980. NEP is balanced by a small land use change emission flux and a much larger crop harvest flux; the sum of these three Figure 16. Fractional coverage of aggregated PFTs within seven "biome" regions defined by Olson et al. (2006). Each biome shows three columns: UKESM1 (historical ensemble mean for the year 2005), IGBP land cover, and CCI land cover.
fluxes is the net biome productivity (NBP), the total carbon flux from the atmosphere to the terrestrial biosphere. The land is a net source of carbon prior to 1980: cumulative NBP is −40 Pg from 1850 to 1979. After 1980 the land becomes a sink of atmospheric CO 2 : cumulative NBP between 1980 and 2015 is 22 Pg.
Generally the model has an excellent zonal distribution of vegetation carbon (Figure 18), lying between the two observational estimates. The exception is a high-bias between 25 • S and 10 • S, reflecting the excess of trees in savanna biomes noted above. Global total vegetation carbon is 551 Pg in year 2000 and 583 Pg in 1850 which is consistent with observation-based estimates (450-650 Pg in 1750; Ciais et al., 2013); 72% of vegetation carbon is stored in tropical forests and 26% in extra-tropical forests, with the remaining 2% being stored in shrubs and grasses.
At 1,780 GtC, the UKESM1 global total soil carbon is less than that estimated by the Carvalhais et al. (2014) data set for the whole soil profile (2,450 GtC) and slightly less than that estimated by Batjes (2016) for the top 2 m (1,970 GtC). The model represents the top 3 m of soil. There is not enough soil carbon simulated for the northern high latitudes ( Figure 18); this is partly because turnover times are too fast (not shown) but also because there is not yet a representation of permafrost carbon included within UKESM1. The simulated peak of soil carbon in the tropics is slightly further south than in any of the observational data sets.

Ocean Biogeochemistry
Ocean biochemistry in UKESM1 plays a fundamental role in the carbon cycle, acting as a source or sink of atmospheric CO 2 and in the aerosol simulation via natural emissions of DMS and PMOA. In this section we focus on these two roles, evaluating the present-day surface flux of CO 2 and the productivity which underpins the aerosol emissions. Figure 19 shows simulated net primary production, averaged over the period 2000-2009, compared with three satellite-derived observational estimates for the same period (Behrenfeld & Falkowski, 1997;Carr et al., 2006;Westberry et al., 2008, these three estimates are averaged here to a single field, as per Yool et al., 2013). In general, the model simulates the main geographical patterns of high productivity in upwelling regions, low production in subtropical oligotrophic areas, and seasonally high productivity in high latitude areas. However, modeled productivity is more tightly focused on the equator, especially in the Pacific Ocean. Oligotrophic areas exhibit lower production in the model, in part due to unrepresented mesoscale processes, and high latitude productivity is elevated above that observed, especially in the Southern Ocean.    Ocean, and ingassing (red) at higher latitudes, in particular the North Atlantic. The model does show some discrepancies, for instance excessive outgassing in the upwelling areas associated with the Humboldt and Benguela Currents, and excess ingassing close to Antarctica linked to a deficit in sea ice cover in spring and summer.

Atmospheric Chemistry
A thorough evaluation of the chemistry performance of UKESM1 can be found in Archibald et al. (2019); here we focus on those aspects which represent major enhancements relative to HadGEM2-ES, namely, stratospheric chemistry, interactive photolysis, and biogenic emissions of VOCs.
First we assess the quality of the simulation of total column ozone (TCO), which is dominated by the stratospheric ozone layer, versus the NIWA-Bodeker Scientific TCO database (version3.4; Bodeker et al., 2005) covering 1979-2016. We also use the TCO record measured at South Pole using a Dobson photospectrometer (Rosen et al., 1993;WOUDC, 2014). Figure 21 shows the zonal-mean TCO averaged over 1995-2014, averaged over two historical UKESM1 simulations, and the difference versus the climatology. In the extra-polar regions, there is a general overestimation of ozone of around 20 to 40 Dobson Units (DU), whereas in the Antarctic, the model underestimates ozone relative to the climatology by up to −60 DU in November and December. Since satellite-based observation of TCO is particularly difficult at high latitudes, we compare this result directly to the Dobson long-term measurements at South Pole ( Figure 22). The comparison shows a good agreement with the Dobson record in October, but in November and December, the model only tracks the most stable, deepest ozone holes and does not reproduce the observed interannual variability. Linear trends during the period of increasing ozone depleting substances (1979)(1980)(1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997) mirror this finding ( Figure 23). In the tropics the model does not produce significant trends, in agreement with the climatology, but in both polar regions the model produces too large  negative trends during this period during spring and summer in both hemispheres. A separate analysis of trends during the ozone recovery period (1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) shows generally non-significant trends, in agreement with the observations. An analysis of the vertical distribution of ozone and its trends (not shown) reveals a generally good correspondence with observations but too strong a downdraught of mesosphere air during polar winter which contributes to low TCO during spring and also the general overestimation of ozone in the extra-polar ozone layer discussed above.
The good correspondence in October suggests that the sensitivity of ozone to anthropogenic chemical ozone depletion is generally correct, broadly in agreement with scientific consensus on the underlying gas-phase chemistry (World Meteorological Organization, 2018), but the underestimation of variability in November and December suggest insufficient, and insufficiently variable, meridional heat transport toward both poles which would serve to inject variability into TCO and would shorten, on average, the lifetime of the polar vortex (Garfinkel et al., 2013). This is consistent with a stratospheric cold bias of 5-8 K over the Antarctic and a polar night jet that is too strong by more than 10 m s −1 (not shown).
The general overestimation of ozone in the extra-polar regions would be partly addressed by a more comprehensive inclusion of heterogeneous reactions on sulphate aerosols involving bromine compounds (Dennison et al., 2019), which is not part of UKESM1. The nine CMIP5 models that did simulate stratospheric ozone interactively exhibited substantial biases (Eyring et al., 2013). Relative to this previous generation of models, UKESM1 compares well to the reference climatology.
Turning now to the troposphere, Figure 24 shows that for the historical simulation of UKESM1, the burden of ozone in the troposphere (defined as in Tilmes et al., 2016) has increased steadily over the period 1850-2014. The burden of ozone has recently been evaluated for the present-day (2014-2016) by Gaudel et al. (2018) as 300 ± 12 Tg from 60 • S-60 • N. The UKESM1 tropospheric ozone burden, subset over the same latitude range for the year 2014 is 314 Tg, in excellent agreement with the observations. UKESM1 simulates an Figure 24. Time series of the evolution of the tropospheric ozone burden in six members of the historical ensemble of the UKESM1 simulations (colored lines). The black line shows the ensemble mean. Tropospheric ozone burden is calculated as the total mass of ozone between the surface and the tropopause, using a tropopause based on monthly mean ozone mixing ratio of 150 ppb. increase over the historical period (1850-2014) of more than 60 Tg of ozone in the troposphere, an increase of approximately 23% from a pre-industrial baseline of 260 ± 4 Tg (1850-1875) to 320 ± 7 Tg (1995-2010).
Methane lifetime is a core metric for atmospheric chemistry models due to the role methane plays in multiple chemical cycles, with impacts on ozone, aerosols (via oxidant concentrations), as well as exerting a strong direct radiative effect in its own right. The methane lifetime in UKESM1 is 8.4 years (calculated as total atmospheric burden divided by chemical loss and deposition for the period 1995-2010). This lies within the 9.1 ± 0.9 year range estimated by Prather et al. (2012) based on constraints from methyl chloroform, and the 7.6 to 14 year range of SPARC (2013). Table 3 summarizes methane lifetime along with five other metrics of the global chemical state of the model.  Figure 26 shows the UKESM1 ensemble annual mean AOD at 440 nm for the period 1992 to 2008. Annual mean observations of AOD covering the same period from the ground-based, global AERONET  are overlaid on the simulated global AOD map. Overall, the model captures the spatial distribution of AOD well. This is reflected in the good correlation found in the point-wise comparison shown in Figure 26 (r 2 = 0.63). AOD over continental polluted regions and tropical biomass regions is generally well-represented in the model, including the east-west contrast in continental North America. There are fewer observations in remote marine regions, but where observations exist, the model is able to capture the low AOD values in the Pacific Ocean and downwind of anthropogenic pollution source regions in the Atlantic Ocean.

Aerosols
The model AOD in Figure 26 is biased low at sites close to dust sources in West Africa and the tropical Atlantic. This is examined in more detail in Figure 27 which compares the model against seasonal mean AERONET AOD at dust-dominated sites and near-surface dust concentrations from the University of Miami observing network. The AODs in this figure are consistent with the annual mean result, with seasonal mean AOD lower than the observations by 39% on average. We note that these stations are all at similar latitudes in a region strongly affected by the West African Monsoon, which is challenging to simulate, and small biases in wind-speed or precipitation in this area would strongly affect local dust production. Thus the bias in AODs may not be representative of the wider dust simulation. Indeed, dust concentrations over the Atlantic are generally well-simulated ( Figure 27). Further afield, there is some high bias in concentration over the Pacific and low bias over the Southern Ocean.

Emergent Behavior of the Coupled Model
We consider some emergent properties of UKESM1 using metrics of global climate sensitivity and carbon uptake. First, equilibrium climate sensitivity (ECS) estimates the equilibrium response of global mean surface air temperature to a doubling of CO 2 . While this cannot be directly evaluated against observations, it is a useful metric for comparing the response of models to CO 2 perturbations across multiple generations of models. Figure 28 shows the temporal evolution of surface air temperature in response to an abrupt CO 2 quadrupling (the CMIP6 abrupt4xCO2 experiment; Eyring et al., 2016). We calculate effective climate sensitivity as in Gregory et al. (2004) by regressing the change in TOA net radiation against surface temperature over the first 150 years of the run. The projected equilibrium temperature change from the abrupt4xCO2 experiment is divided by 2 to correspond to a CO 2 doubling. The resulting ECS for UKESM1 is 5.4 K.
Transient climate response (TCR) is calculated from the CMIP6 1pctCO2 experiment in which CO 2 is increased at 1% per year. TCR is defined as the temperature change from the pre-industrial control after CO 2 has doubled, averaged over years 61-80. A four-member initial condition ensemble for UKESM1 ( Figure 28) have TCR in the range 2.68-2.85 K. Transient climate response to cumulative emission (TCRE; Gillett et al., 2013) extends the TCR concept to use carbon feedbacks from an ESM in order to project the warming due to a cumulative CO 2 emission of 1TtC. Following Gillett et al. (2013) we calculate TCRE as TCR (as defined above) divided by the cumulative compatible anthropogenic emission up to year 70 of the same 1pctCO2 experiment. CO 2 concentration is prescribed in these experiments, but the anthropogenic emission compatible with the model's carbon uptake can be calculated as the change in the total mass of carbon in the ocean, land, and atmosphere. To remove the (very small) background drift in ocean and land carbon stores ( Figure 3) we calculate the compatible emission as the difference between the 1pctCO2 store at year 70 and that of pre-industrial control in the same year, yielding a range of compatible emissions from 1.07 to 1.08 TtC and TCRE from 2.49 to 2.66 K TtC −1 .
The UKESM1 values of ECS, TCR, and TCRE are all higher than those of CMIP5 models (respectively 2.1 K to 4.7 K, 1.0 K to 2.6 K, and 0.8 K TtC−1 to 2.4 K TtC−1; Andrews et al., 2012;Gillett et al., 2013). Elsewhere in this special issue, Bodas-Salcedo et al. (2019) analyze the increase in atmospheric climate feedbacks in HadGEM3-GC3.1 relative to the previous version of HadGEM3, whose ECS (3.2 K; Senior et al., 2016) is within the range of CMIP5 models. They find that the feedbacks have become more positive as a result of improvements to cloud microphysics and cloud-aerosol interactions. Both of these developments reduce errors in the present-day simulation of clouds relative to observations. These stronger feedbacks contribute to increases in all three of these metrics; TCRE is additionally increased relative to CMIP5 models by the  inclusion of nitrogen limitation which reduces the terrestrial carbon sink for a given CO 2 atmospheric concentration and hence increases the CO 2 concentration and corresponding warming compatible with a 1TtC cumulative emission. More detailed analysis of the climate sensitivity and carbon feedbacks in UKESM1 will be reported in subsequent papers.
Carbon uptake can also be diagnosed from historical, rather than idealized, simulations and thus compared against estimates rooted in historical observations. In Table 4 Table 4 represent the carbon flux due to clearance of above-ground biomass, including crop harvest and deforestation. The model's total emission due to land use includes an additional contribution from decomposition of below-ground biomass which we have not diagnosed. The total land use emissions will hence will be closer to, or above, the GCB central estimate.
Surface temperature is one of the few variables for which reliable observations cover the full period of the 1850-2014 historical simulation, which allows us to evaluate the model's first-order climate response to the evolving forcing over this period. The UKESM1 global mean surface temperature anomaly in the historical ensemble is shown in Figure 29, alongside the HadCRUT4 observation data set . The observations represent only a single realization of the internal variability of the climate system, so one should not expect the model ensemble to be centered on the observations, but rather that the range of observational uncertainty overlaps the ensemble range (under the assumption that the model ensemble is large enough to sample the relevant internal variability). Most ensemble members begin to warm in the early 20th century, then cool strongly between 1950 and 1970 before warming rapidly through to the end of the simulation. The observations show a limited cooling of 0.1 to 0.2 K during 1940 to 1970, but the model ensemble mean cools by nearly 0.4 K over the same period. All ensemble members also show a stronger cooling response to the large volcanic eruptions of 1883, 1963, and 1991 than is seen in the observations. Note. UKESM1 land use emissions in this table include only the CO 2 flux from above-ground biomass; GCB land use emission estimate includes both above-and below-ground biomass. Figure 29 also separates the mean surface temperature into northern and southern hemisphere time series. The stronger-than-observed cooling is restricted to the northern hemisphere which, together with its temporal evolution, points to either aerosol or land use forcing as the prime driver. Further investigation into this discrepancy will be the subject of future work. In the southern hemisphere the model ensemble overlaps with the observational uncertainty for the entire duration of the experiment, with the exception of the dip following the Mt. Pinatubo eruption in 1991 noted above.

Discussion and Conclusions
In this paper we describe and evaluate the new Earth system model UKESM1, which has been jointly developed by NERC and the Met Office. UKESM1 represents a major advance on its predecessor model HadGEM2-ES, both in the sophistication of its component models and in the complexity of the couplings between these components. The most important advances are 1. inclusion of nitrogen cycling within the terrestrial carbon cycle, and increased ocean biological complexity, in order to better predict allowable anthropogenic emissions; 2. major advances in the treatment of land management, with the separation of crop and pasture areas, and the introduction of a harvest carbon flux; 3. a well-resolved stratosphere with interactive stratospheric chemistry to assess the impact of changes in ozone depleting substances; 4. a state-of-the-art aerosol scheme with prognostic size distribution for a better understanding of aerosol radiative forcing; 5. coupling between the ocean biogeochemistry and atmospheric composition via interactive emissions of primary marine organic aerosol; 6. coupling between the terrestrial biosphere and atmospheric composition in the form of interactive BVOC emissions from vegetation; and 7. coupling between atmospheric chemistry and physics through water vapor changes arising from chemical reactions.
Achieving acceptable performance in the presence of this additional model complexity posed major challenges during the development and tuning of UKESM1. The greatest difficulties arose where one component model was sensitive to biases in the variables it received from another component. One example of this is the dependence of dust emissions on bare soil fraction, and the sensitivity of bare soil to the model's climate in semi-arid regions. Through careful tuning of vegetation and dust parameters we were able to significantly reduce both bare soil and dust errors. However, in some regions (particularly western India) the climate biases were such that we could not obtain a good simulation of vegetation. In this respect we still have not answered some of the difficult questions posed by Martin and Levine (2012) regarding coupling and component model biases in Earth system models, though we have made significant steps forward.
Generally, the model performs very well. The pre-industrial control is stable, with acceptably small drifts in carbon pools and virtually no drift in surface climate over 1,100 years. The Southern Ocean exhibits large multi-century oscillations in surface temperature and sea ice extent. Further work is in progress to understand the mechanisms behind this variability, but there is a growing body of evidence which indicates that such oscillations may be a feature of the natural state of the Earth system (e.g., Kurtakoti et al., 2018;Reintges et al., 2017;Zhang et al., 2018).
Evaluation of the latter portion of the historical experiments against observations is generally positive. Particular highlights are vegetation structure, which has reduced or eliminated biases in the Boreal forests and Australia that were seen in the predecessor model HadGEM2-ES (Collins et al., 2011), and stratospheric ozone, which exhibits smaller biases than the previous generation of coupled chemistry-climate models with interactive stratospheric chemistry (Eyring et al., 2013).
The most significant deficiency in the model's performance is its stronger-than-observed global mean surface cooling in the mid-20th century, followed by stronger-than-observed warming after 1990. The cooling is most likely related to aerosols and/or land use change, suggesting that the combined radiative forcing of these processes is excessive during this period. While these processes will not have a first-order impact on century-scale projections at a global scale (which are dominated by greenhouse gas forcing), understanding and improving this bias will be a priority for future configurations of UKESM1. The strong warming toward the end of the century may be due to the reduction of aerosol emissions in Europe and North America, potentially adding weight to the hypothesis that aerosol forcing is too strong in this model prior to 1990. Or it could be a result of the model's high climate sensitivity (discussed below) or a combination of both. There are large uncertainties in the magnitude of the true historical forcing due to aerosols and land use change, and these results present an opportunity to better understand and constrain the relevant processes, with the potential to improve ESMs more generally.
In the introduction to this special issue, Senior et al. (2020) outline the development strategy for UKESM1 and its physical core HadGEM3-GC3.1, which included a step (before model finalization) of estimating effective radiative forcing (ERF) due to key anthropogenic forcing agents (GHGs, aerosols, and land use change), in order to avoid unrealistic historical performance at a global mean scale. See Mulcahy et al. (2018) in this special issue for action which was taken on aerosol forcing as a result of this strategy. These ERF experiments assessed only the 1850-2000 forcing, but as Figure 29 demonstrates, it is possible to have good agreement in global mean surface temperature toward the end of the historical period (implying that the integrated historical forcing is accurate, perhaps through a compensation of errors) and still have significant deviations prior to this. This brings interesting new evidence to bear on the questions about model development strategy which have been discussed by, for example, Hourdin et al. (2016) and Schmidt et al. (2017), and will prompt an evolution of the strategy summarized in Senior et al. (2020).
The high sensitivity of the model to CO 2 concentrations and emissions, as evidenced by the high values of ECS, TCR, and TCRE will also be a focus of further study. Bodas-Salcedo et al. (2019) show that some of the changes responsible for the high sensitivity represent significant improvements in process evaluation, but better observations and more in-depth analysis are required to assess all the processes involved. Note that these climate sensitivity and carbon feedback metrics are emergent properties of the model and were not subject to tuning during the development process.
Near-to medium-term plans for model development in the UKESM project include new model configurations with additional capabilities. First will come a model with interactive ice sheets on Greenland and Antarctica, with two-way coupling between the ice sheets and both the atmosphere and ocean. Following that, we are developing a hybrid-resolution functionality to allow the most expensive model components (e.g., atmospheric composition) to run at a reduced horizontal resolution relative to the model physics. These ambitious developments will be reported in future papers.
Other priorities for further work include evaluation of the interactions between component models. The parametrizations used in coupling schemes (such as those for emissions or deposition of gases and aerosols) are generally based on good physical understanding and empirical relationships derived from in situ observation, but it is hard to evaluate these relationships at the scale of a model grid box, let alone regional or global scales. For example, is the response of the model's bare soil fraction to its climate more or less sensitive than that of the real world? Such questions have an important bearing not only for obtaining a good present-day simulation but also for interpreting its response in model projections.
The complexity of UKESM1, both in terms of its component models and its internal coupling, is unprecedented for an Earth system model. The sophisticated interactions were included to facilitate the exploration of future feedbacks but increase the difficulty of obtaining skilful simulation of the observed period. In the face of this challenge, the generally successful performance of the model is a major accomplishment.

Appendix A: Component Model Details
In this appendix we summarize the component models of UKESM1. We do not include a detailed description of each component; rather we refer the reader to previously published descriptions of each component and focus instead on developments which have been made since these publications, and on summarizing the key model elements from an Earth system perspective.

A1. Underlying Physical Model
The components of an ESM are dependent on the core physical model for accurate simulation of winds and ocean currents (which advect the biogeochemical tracers), atmospheric moist and turbulent processes (which control chemical reaction rates and aerosol-cloud interactions), and temperature, precipitation, and soil moisture (upon which aerosol and gas removal processes, vegetation, and CO 2 uptake depend). Biases in the physical model which are tolerable in their own right can impact disproportionately on Earth system components (e.g., Hardiman et al., 2015;Martin & Levine, 2012). Simulated ocean-land surface temperature and surface and top-of-atmosphere (TOA) radiative fluxes retain their primary importance, even for the ESM components, because of their role in feedbacks.
The Earth system components of UKESM1 are built upon the state-of-the-art physical climate model HadGEM3-GC3.1 (Kuhlbrodt et al., 2018;Williams et al., 2018). The ocean component of GC3.1  uses the NEMO dynamical ocean on a nominally 1 • tripolar grid with 75 vertical levels and an explicit nonlinear free surface. The sea ice (Ridley et al., 2018) is modeled by CICE on the same horizontal grid as NEMO, with five ice thickness categories, multi-layer thermodynamics with four layers of ice and one of snow, and elastic-viscous-plastic rheology. The atmosphere (Walters et al., 2019) uses the Met Office Unified Model (UM) with 85 vertical levels on a terrain-following hybrid height coordinate, with an 85 km model top. The land surface is modeled using JULES . Land and atmosphere share the same horizontal grid: a regular latitude-longitude grid with 1.25 • × 1.875 • resolution. Surface and sub-surface runoff is transported to the ocean using the TRIP river routing scheme (Total Runoff Integrating Pathways; Oki & Sud, 2006). Comparing UKESM1 to its predecessor HadGEM2-ES, the ocean and sea ice models have been completely replaced, and the atmosphere has a new dynamical core (Wood et al., 2014), a well-resolved stratosphere, major advances in cloud physics including prognostic cloud, condensate and rain (Wilson et al., 2008), and a new modal aerosol scheme (Mann et al., 2010). The coupled physical model as a whole is evaluated in Kuhlbrodt et al. (2018) and has been shown to represent a substantial improvement upon its predecessors.
We believe that we have achieved our goal of keeping the physical core of UKESM1 as close to GC3.1 as possible. The differences are restricted to the following components: 1. enhancements to aerosol emissions to improve their process realism (adding couplings between the biosphere and atmospheric composition which were previously missing), described in section A5; 2. tuning parameters which control dust emissions and snow-vegetation interactions, the motivation for which is explained in section 3; 3. a change to the soil hydrology scheme to address a problem with soil moisture. GC3.1, as in recent versions of HadGEM3, uses an adaptation of the van Genuchten (1980) empirical equations for soil hydraulic conductivity properties . During early tests of UKESM1, it was discovered that the implementation of these equations in JULES produced unrealistic soil moisture profiles in the Earth system model. In some regions the soil moisture profile was inverted relative to observed profiles, with less moisture in the deepest soil layer (1-3 m) than the surface layer (0-10 cm). Such a lack of moisture at depth would limit the availability of moisture to vegetation during dry periods and had the potential to distort the dynamic simulation of vegetation distribution. To remedy this, UKESM1 instead uses equations based on the work of Brooks and Corey (1964) with consistent ancillaries files for soil properties, as employed in HadGEM2. This results in much higher soil hydraulic conductivity and thus more moisture in the deeper soil layers. The change of scheme makes little difference to the surface energy budget. Work is in progress to refine the van Genuchten implementation for future versions of HadGEM3 in order to remedy this issue; 4. the land surface characteristics of the Greenland and Antarctic ice sheets are represented using the sub-gridscale elevation class tiling scheme and albedo and snow pack physics described in Shannon et al. (2019); we do not however use the glacier-specific tunings or precipitation and surface wind downscaling from that work. In UKESM1 we use 10 irregularly spaced elevation classes and allow for up to 10 discrete layers in the snow pack. These schemes provide significant improvements to the surface mass balance of the ice sheets simulated by UKESM1 compared to GC3.1.
Additionally, the physical simulation in UKESM1 is affected by its Earth system components through the radiative impact of changes in atmospheric composition and through the impact of vegetation structure on surface energy and momentum fluxes. These couplings are described alongside the Earth system component models in the rest of this appendix.

A2. Terrestrial Biogeochemistry
The terrestrial biogeochemistry in UKESM1 is based largely on that of HadGEM2-ES (Collins et al., 2011), though the underlying code has been migrated from MOSES (Met Office Surface Exchange Scheme) to JULES Clark et al., 2011), and major enhancements have been developed for UKESM1. As in HadGEM2-ES, fractional cover and canopy height of PFTs are simulated by the TRIFFID vegetation dynamics (Cox, 2001), and the resulting carbon fluxes drive the RothC soil carbon model (Coleman & Jenkinson, 1999). The model simulates the fractional extent of wetland within each grid box, which is used to derive an emission of CH 4 ; in the standard configuration of the model, this emission is diagnostic, but the model has the capability to run with interactive CH 4 emissions.
Parameters related to photosynthesis, respiration, and leaf turnover have been updated based on the TRY database (Kattge et al., 2011). These include nitrogen content of leaves, roots, and stems, leaf mass per unit area, leaf growth rate, and the relationship between leaf nitrogen and the maximum rate of carboxylation of the enzyme Rubisco at 25 • C (Harper et al., 2016(Harper et al., , 2018. The number of natural PFTs was increased from five to nine to represent the distinction between evergreen and deciduous plants and between tropical and temperate evergreen trees. The canopy scheme was updated so that the exponential decay of both photosynthetically available radiation and leaf nitrogen through the canopy now depend on the total leaf area index. In offline land-only tests, these developments improved model simulation of gross and net primary productivity (GPP and NPP, respectively) at the site level and globally (Harper et al., 2016). In particular, the NPP of C4 grasses was reduced and GPP and NPP of trees was increased.
Plants still compete for space based on their NPP and phenology as in Cox (2001), but the competition is now based on a pure height ranking, where the tallest trees get access to space in a grid box first. In uncoupled component model tests, the new dynamic vegetation and PFTs yield a closer match to observed vegetation distribution, with particular improvements to tropical and boreal forests and the high latitudes, where previously the model overestimated the distribution of shrubs (Harper et al., 2018).  Figure A1 illustrates the overall distribution of these PFTs with the dominant vegetation type, that is, the PFT having the greatest fractional coverage in each grid box. The additional PFT sub-divisions added by Harper et al. (2016) can be seen most clearly in the Boreal forests, with needleleaf deciduous trees generally dominating over needleleaf evergreen trees in more northerly continental climates. Tropical broadleaf evergreen trees dominate most of South America, sub-Saharan Africa, SE Asia, and the maritime continent. In contrast, there are very few areas in which broadleaf deciduous trees or temperate broadleaf evergreen dominate, as these tend to grow sparsely in regions dominated by grasses or shrubs.
Terrestrial carbon uptake is limited by the availability of nitrogen and other nutrients. An ecosystem becomes limited when insufficient nitrogen is available for plants to allocate net photosynthate to growth. In such cases photosynthesis is down-regulated to match the available inorganic nitrogen. In UKESM1, nitrogen does not directly affect photosynthetic capacity through leaf nitrogen concentrations but acts indirectly by controlling the biomass and leaf area index within the TRIFFID vegetation dynamics. A second mechanism acts through the soil by limiting the decomposition of litter into soil carbon in the RothC model. This occurs when insufficient nitrogen is available to decompose high C:N litter into low C:N soil organic matter. The vegetation model includes retranslocation of nitrogen during senescence of leaves and roots into a labile pool to supply nutrients for the following season's leaf growth. The soil model simulates mineralization and immobilization, with mineralized nitrogen becoming available for plant uptake and ecosystem loss. Inorganic nitrogen is represented by a single grid box pool to which all PFTs have equal access. As PFTs have differing nitrogen demands and turnover, the inclusion of nitrogen processes influences vegetation competition and therefore distribution.
To simulate the impact of agriculture and anthropogenic land use on the climate (via biophysical effects) and carbon cycle (via biogeochemical effects), a representation of land use and land management is implemented. Going beyond the HadGEM2-ES treatment of all agriculture as "disturbed" in a common way , UKESM1 distinguishes crop and pasture land. This is done by reserving a fraction of each grid box for crops and a fraction for pasture; these land use fractions are prescribed as an external forcing and may vary in time. Within the crop (pasture) fraction only the C3 crop (pasture) and C4 crop (pasture) PFTs are allowed to grow, with the area of each crop PFT determined by TRIFFID. Only the nine natural PFTs (including natural C3 and C4 grass) may grow the remainder of the grid box. When the prescribed crop or pasture fraction increases, natural vegetation is removed from the portion of the grid box into which agriculture has expanded, representing anthropogenic land clearance. The vegetation carbon thus removed is distributed as follows: Root biomass enters the soil carbon pools, and above-ground biomass is distributed to three product pools which release CO 2 to the atmosphere at turnover rates of 1, 10, and 100 years to represent a range of fast-, medium-, and slow-decay wood products. Conversely, when crop and pasture areas are reduced, the natural PFTs are allowed to recolonize the vacated grid box fraction.
The crop and pasture PFTs are physiologically identical to the natural grasses, but simple representations of fertilizer application and harvesting are applied to the crop PFTs. The crop PFTs are prevented from being nitrogen-limited by a fertilizer flux, which is calculated to exactly meet the nitrogen demand which cannot be met by available soil nitrogen. A crop harvest carbon flux is included by preventing 30% of crop PFT litter from entering the soil; instead this carbon flux enters the atmosphere, representing the rapid turnover of crop products. The harvest flux prevents unrealistic accumulation of soil carbon in croplands.
Finally, JULES now simulates biogenic emissions of VOCs, described in section A4.

A3. Ocean Biogeochemistry
The Model of Ecosystem Dynamics, nutrient Utilisation, Sequestration and Acidification (MEDUSA; Yool et al., 2013) is an intermediate-complexity plankton ecosystem model designed to incorporate sufficient complexity to address key feedbacks between anthropogenically-driven changes (climate and acidification) and oceanic biogeochemistry. MEDUSA-2 resolves a dual size-structured ecosystem of small (nanophytoplankton and microzooplankton) and large (microphytoplankton and mesozooplankton) components. This explicitly includes the biogeochemical cycles of nitrogen, silicon, and iron nutrients as well as the cycles of carbon, alkalinity, and dissolved oxygen. Large phytoplankton are treated as diatoms and utilize silicic acid in addition to nitrogen, iron, and carbon.
Similar to its living components, MEDUSA-2's detrital components are split into two size classes. Small, slow-sinking detritus is represented explicitly with separate nitrogen and carbon tracers; a fixed Fe:N ratio is assumed. Large, fast-sinking detritus is represented implicitly, and created and remineralized within a single timestep. Fast-sinking detritus consists of organic nitrogen, carbon, and iron, together with silicon and calcium carbonate biominerals that are involved in a ballast parameterization (Armstrong et al., 2001). At the seafloor, MEDUSA-2 resolves five reservoirs to store sinking organic material reaching the sediment.
The model's nitrogen, silicon and alkalinity cycles are closed and conservative (e.g., no riverine inputs), while the other three cycles (carbon, iron, and oxygen) are open. The iron cycle includes aeolian and benthic sources and is depleted by scavenging. The ocean's carbon cycle exchanges CO 2 with the atmosphere. The oxygen cycle exchanges with the atmosphere, and dissolved oxygen is additionally created by primary production and depleted by remineralization. The various elemental cycles include both fixed and variable stoichiometry: Iron has a fixed ratio with nitrogen throughout; nitrogen and carbon have fixed (but different) ratios in phytoplankton and zooplankton, and variable ratios in detritus; diatom silicon has a variable ratio with nitrogen; calcium carbonate (cf. alkalinity) is produced at a variable rate relative to organic carbon; oxygen production and consumption reflects the C:N ratio of organic matter produced and consumed.
During its development for UKESM1, several changes have occurred to the science and code of MEDUSA-2 compared to that originally described in Yool et al. (2013). The principal science changes are as follows: 1. The carbonate chemistry submodel in MEDUSA-2 has been upgraded to MOCSY-2.0 (Orr & Epitalon, 2015), increasing the calculation accuracy of pCO 2 and thus CO 2 air-sea flux. 2. A diagnostic submodel for DMS surface seawater concentration (Anderson et al., 2001) was added to support interactive DMS emissions to the atmosphere (see section 4.6). 3. For consistency with earlier work, the diel light cycle within UKESM1 is time-averaged for MEDUSA-2.

A4. Atmospheric Chemistry
Atmospheric composition in UKESM1 is simulated by the U.K. Chemistry and Aerosols (UKCA) model. UKCA has several chemistry configurations available; the specific configuration used in UKESM1 is described in Archibald et al. (2019) and comprises a combination of the stratospheric scheme of Morgenstern et al. (2009) (with minor updates to reaction rates as in Morgenstern et al., 2017) and the "TropIsop" tropospheric chemistry of O'Connor et al. (2014), which includes extensions to the volatile organic compound (VOC) representation relative to the tropospheric chemistry in HadGEM2-ES. The merged scheme thus simulates interactive chemistry from the surface to the top of the model (85 km), enabling a holistic treatment of ozone, the third most important radiatively active gas in the atmosphere and the primary source of the hydroxyl radical (OH) in the troposphere. In total, the chemistry solves 291 thermal and photolytic reactions which describe the chemistry of 81 species. This includes the oxidation reactions which are a source of sulphate and secondary organic aerosol, and hence the chemistry is tightly coupled to aerosol production.
In UKESM1, prognostic chemistry influences radiation through the three-dimensional concentrations of four gases: O 3 , CH 4 , N 2 O, and H 2 O. CH 4 and N 2 O are forced with globally constant surface concentrations, and above the surface are treated as interactive chemical tracers. This allows the model to represent the spatiotemporal variability of these species (in particular their decay in the upper troposphere) while constraining their global scale evolution to match past observations or a particular forcing scenario. The model has the capability to run with interactive CH 4 emissions, but in the standard configuration surface concentrations are prescribed as above.
In contrast, O 3 is fully interactive and is forced by surface emissions of ozone precursors such as VOCs, three-dimensional emissions of NOx (including emissions from aircraft and interactive production by lightning), and concentrations of ozone depleting substances. Chemical reactions which produce or destroy water vapor increment the humidity tracer of the dynamical core. This allows interactive simulation of chemical water vapor feedbacks, the most significant of which is due to stratospheric methane oxidation (see e.g., Hegglin et al., 2014;Hurst et al., 2011;Rohs et al., 2006). Concentrations of ozone depleting substances (CFCs, HCFCs, and Halons) are treated as in Morgenstern et al. (2009) and thus are not coupled between UKCA and radiation.
As an inert gas, CO 2 is not modified by UKCA chemistry and is specified as a global constant in the default configuration of UKESM1. See Appendix C2 for a description of the CO 2 emission-driven configuration.
The photolysis scheme that is used to drive the photochemistry in UKCA is that of Fast-JX (Neu et al., 2007;Telford et al., 2013) v6.4, augmented by the offline scheme after Lary and Pyle (1991) above 60 km for wavelengths between 113 and 177 nm. Fast-JX includes scattering calculations over a broad wavelength range (177-850 nm) to cover the stratosphere and lower mesosphere. This allows photolysis rates to vary with the optical depth associated with cloud liquid and frozen water as well as both the surface albedo calculated by the model's radiation scheme (Edwards & Slingo, 1996) and UKCA's own prognostic ozone distribution. However, there is no coupling to the aerosol scheme, and so the interactive photolysis rates are independent of aerosol loading.
The main purpose of this interactive calculation is to allow photolysis and hence methane lifetime to respond to stratospheric ozone recovery. This limitation made the future methane lifetime projections of HadGEM2-ES (which used fixed photolysis rates) anomalous relative to other models in the Atmospheric Chemistry and Climate Model Intercomparison Project (Voulgarakis et al., 2013). Figure A2 compares interactive rates in UKESM1 against the offline tabulated rates used by HadGEM2-ES, for July 2005-2014. The morphology and magnitude of the photolysis rates match reasonably well, and there is a strong correlation between the offline and online rates (r 2 > 0.96). The Fast-JX scheme gives rise to photolysis rates that are more variable, and higher than the offline rates by a factor of 1.2-1.3. This difference persists throughout the annual cycle (not shown) and leads to the methane lifetime being approximately 25% shorter with interactive photolysis than that with offline photolysis (O'Connor et al., 2014;Telford et al., 2013). For UKESM1, such a reduction in methane lifetime brings it within the observational range, highlighting the importance of the improved, interactive, scheme.
Biogenic volatile organic compounds (BVOCs) play an important role in atmospheric composition and consequently in climate. BVOCs act as ozone precursors, impact strongly on the chemical lifetime of methane, and their oxidized derivatives contribute to the formation of secondary organic aerosols (SOA). The most important BVOCs for climate and air quality are isoprene and terpenes, which typically contribute 50-70% of the total BVOC emission flux in contemporary chemistry-climate models. In UKESM1, isoprene feeds into the chemical mechanism for isoprene photooxidation, while a monoterpene tracer (which oxidizes and condenses to form SOA) is used to represent terpenes.
In UKESM1, biogenic emissions of isoprene and monoterpene are calculated interactively by means of the iBVOC model (Pacifico et al., 2012(Pacifico et al., , 2015, which applies the isoprene emission scheme of Pacifico et al. (2011) and the terpene emission parametrization from Guenther et al. (1995). These parametrizations take as inputs the modeled temperature and gross primary productivity (GPP) and hence add an important coupling between climate, atmospheric composition, and vegetation dynamics. In order to minimize differences in organic carbon aerosol load between UKESM1 and GC3.1, the per-PFT monoterpene emission capacity parameters were adjusted within observed ranges to achieve a present-day global total monoterpene emission of approximately 125 TgC year −1 , to match the emission data set of Guenther et al. (1995) used by GC3.1. Similarly, isoprene emission factors were adjusted in line with published values to achieve a present-day global total emission of approximately 500 TgC year −1 (Arneth et al., 2008;Guenther et al., 2012;Lathière et al., 2006;Messina et al., 2016).

A5. Aerosols
The aerosol configuration of UKESM1 is the same as that of GC3.1 and is described by Mulcahy et al. (2018). It uses the UKCA-GLOMAP modal scheme (Mann et al., 2010) for sulphate, black carbon, organic carbon, and sea salt, and a variant of the Woodward (2011) bin scheme for mineral dust. UKESM1 does however differ from GC3.1 in its treatment of natural emissions of aerosols and aerosol precursors, as follows: 1. Biogenic emissions of monoterpene, a precursor of SOA, are interactively calculated in the land surface scheme as described in section A4, whereas GC3.1 prescribes a climatological emission which is representative of the present-day. 2. DMS emissions are coupled to ocean biogeochemistry, using the Anderson et al. (2001) parametrization of seawater DMS concentration rather than prescribed using a present-day observational climatology. 3. UKESM1 introduces an emission of primary marine organic aerosol (PMOA), using the parametrization of Gantt et al. (2011Gantt et al. ( , 2012, acting as a source of Aitken mode organic aerosol. This parametrization is coupled to the interactive surface chlorophyll concentration of MEDUSA. GC3.1 has no explicit PMOA representation but instead applies a scale factor of 1.7 to its DMS emissions as a proxy for PMOA (Mulcahy et al., 2018). This DMS scale factor is not applied in UKESM1 since PMOA is explicitly simulated.
These three couplings allow the model aerosol and cloud-radiative behavior to respond directly to changes in both the marine and terrestrial biospheres, as well as indirectly through the impact of climate on the biosphere. A detailed assessment of the impact of these emissions developments on the aerosol and radiation performance of the model is in progress.
Another difference in the treatment of aerosol between UKESM1 and GC3.1 is that the latter prescribes time-invariant concentrations of the chemical oxidants involved in the formation of sulphate and secondary organic aerosol, namely, O 3 , OH, NO 3 , HO 2 , and H 2 O 2 . In UKESM1 these oxidants are interactively simulated by the chemistry component and are themselves depleted by the oxidation reactions. This oxidant coupling means that changes in methane and other ozone precursors can affect the model's aerosol loading and aerosol radiative forcing.
Emissions of mineral dust in UKESM1 are dependent upon the modeled bare soil fraction as well as on soil moisture and near-surface wind. Deposition of dust to the ocean in turn acts as a source of iron nutrient to MEDUSA, fertilizing phytoplankton growth wherever iron is limited. Dust therefore acts as a unique link between the terrestrial and marine biospheres. The dust scheme in UKESM1 is similar to that in GC3.1 and is described in Woodward (2011), with the following changes. First, in UKESM1 the bare soil fraction (which dictates the area available for dust emissions) is interactively simulated by the TRIFFID vegetation dynamics, rather than prescribed. This significant difference in driving data necessitated tuning of the dust emission parameters, as described in section 3. Second, dust emission from seasonally vegetated sources was disabled in UKESM1 in order to reduce the complexity of the coupling to TRIFFID, so that dust emissions depend on bare soil fraction but not on simulated leaf area index. Note that following a careful tuning of the vegetation simulation, also described in section 3, it was not found necessary to impose a map of preferential dust source regions such as those used in earlier models (e.g., Collins et al., 2011;Ginoux et al., 2001;Tegen et al., 2002;Zender et al., 2003). Table B1 lists the primary metrics which were monitored during the tuning and spin-up of UKESM1. These were calculated based on annual mean model outputs (in some cases with longer term meaning applied to smooth interannual noise) and plotted as time series. These time series plots were updated routinely as test runs progressed, to rapidly understand the model's response to parameter changes. Priority 1 metrics were checked more frequently than the priority 2 metrics.

Appendix B: Metrics and Variables Used in Model Tuning
In addition to these time series, we also periodically examined seasonal mean, long-term averaged maps of these and other quantities to ensure there was not error cancelation in the spatial means, and to understand better the processes which were likely responsible for driving differences between test runs.

C1. Faster Configurations: UKESM1-CN and UKESM1-C
Because the chemistry scheme in UKESM1 is particularly expensive, we have implemented a cheaper configuration of the model, known as UKESM1-CN (a carbon-nitrogen cycle model) which reduces computational expense by utilizing the simpler "offline oxidant" chemistry mechanism of GC3.1 (Walters et al., 2019). This scheme simulates sulphur chemistry for the production of sulphate aerosol using prescribed oxidants but excludes other interactive chemical processes. In addition, UKESM1-CN does not have interactive emissions of BVOCs. Apart from these two changes, UKESM1-CN is identical to the full UKESM1 configuration.
It runs approximately twice as fast as UKESM1. UKESM1-CN will not be used for submissions to CMIP6 but is intended for use in carbon-nitrogen cycle research and other applications which do not depend upon interactive chemistry.
Because it is lacking fully interactive chemistry, the UKESM1-CN configuration needs additional input files of the following variables: 1. O 3 , for radiation and as an oxidant for aerosol chemistry, 2. other oxidants for aerosol chemistry (OH, NO 3 , HO 2 , and H 2 O 2 ), 3. and biogenic emissions of monoterpene.
For consistency with the full model, each of these inputs are derived from the interactive model fields produced by a prior run with UKESM1. Experience thus far indicates that headline metrics such as ECS are consistent between UKESM1 and UKESM1-CN. Future work will study the traceability between these configurations of differing complexity.
A further simplification incorporates only the carbon cycle (UKESM1-C), primarily for understanding the impact of the nitrogen processes. The nitrogen processes do not add significant cost, so UKESM1-C runs at the same speed as UKESM1-CN.

C2. CO 2 Emission-Driven Configuration
All of the results in section 4, and most of the simulations that will be submitted to CMIP6, are driven by CO 2 concentrations rather than CO 2 emissions. When UKESM1 runs in CO 2 concentration-driven mode, atmospheric CO 2 concentrations are externally prescribed as a global constant, and the biogeochemical fluxes of CO 2 are purely diagnostic. From these natural fluxes, compatible anthropogenic emissions can be derived, as per Friedlingstein et al. (2006).
When UKESM1 runs in CO 2 emission-driven mode, atmospheric CO 2 is held as an advected 3D tracer, updated by the natural surface fluxes and anthropogenic emissions. In this mode CO 2 concentrations are free to evolve and thus drive a radiative response the model's carbon cycle feedbacks. The 3D nature of the free tracer means that (unlike in concentration-driven mode) the model resolves the spatial structure of CO 2 , including hemispheric and stratosphere-troposphere differences, and its seasonal cycle. Two emission-driven simulations have been performed: a pre-industrial control with zero anthropogenic emissions and a historical run from 1850 with evolving CMIP6 anthropogenic emissions. In Figure C1 we compare the seasonal cycle of the historical run against observations from the Carbon Dioxide Information Analysis Centre (CDIAC; Keeling et al., 1994). These show that the timing of the model's seasonal cycle compares well against the observations, though the summer draw-down begins too early. The seasonal range is weaker than observations at Mauna Loa by approximately 40% and stronger at Barrow by approximately 20%. Figure C2 shows the temporal evolution of CO 2 for these experiments at Mauna Loa, again compared to historical observations from CDIAC. The pre-industrial control concentrations vary on decadal timescales within the range 280-288 ppm, which encompasses the observed value of 284 ppm prescribed in concentration-driven experiments (Meinshausen et al., 2017). The historical experiment drifts slightly higher than the observations, reaching a bias of about 10 ppm by 2000. Such close agreement with the observations is a good result given the number of degrees of freedom in the emission-driven configuration and compares well with the CMIP5 generation of models.
It is critical that the fully coupled model conserves carbon to a high accuracy: A significant loss of conservation would confound the analysis of carbon feedbacks. The UKESM1 atmospheric convection and boundary layer schemes only conserve tracer mass to within order 0.1 Gt of CO 2 per year, which is not sufficiently accurate for emission-driven simulations. An atmospheric mass correction has therefore been implemented, as was required in HadGEM2-ES . In UKESM1 this correction uses an implementation of the Priestley (2002) conservation algorithm to distribute the correction increments globally. The ocean and land components did not require any such correction scheme to achieve this level of conservation.
The finalized model conserves carbon to high accuracy. Over the first 64 years of the emission-driven pre-industrial control and historical simulations, the model gained an average of −1.3 × 10 −3 GtC year −1 and1.2 × 10 −4 GtC year −1 , respectively. These rates are two or more orders of magnitude smaller than the C4MIP-recommended 0.1 Gt year −1 threshold for acceptable drift in carbon pools in CO 2 concentration-driven experiments , so we consider this an acceptable loss of conservation for CO 2 emission-driven experiments.